Summary
Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Change Data Capture is and some of the ways that it can be used?
- What is Debezium and what problems does it solve?
- What was your motivation for creating it?
- What are some of the use cases that it enables?
- What are some of the other options on the market for handling change data capture?
- Can you describe the systems architecture of Debezium and how it has evolved since it was first created?
- How has the tight coupling with Kafka impacted the direction and capabilities of Debezium?
- What, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)?
- What are the data sources that are supported by Debezium?
- Given that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality?
- What is involved in deploying, integrating, and maintaining an installation of Debezium?
- What are the scaling factors?
- What are some of the edge cases that users and operators should be aware of?
- Debezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information?
- What are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets?
- What are some of the common downstream systems that consume the outputs of Debezium?
- What challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used?
- What have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium?
- What is in store for the future of Debezium?
Contact Info
- Randall
- Gunnar
- gunnarmorling on GitHub
- @gunnarmorling on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Debezium
- Confluent
- Kafka Connect
- RedHat
- Bean Validation
- Change Data Capture
- DBMS == DataBase Management System
- Apache Kafka
- Apache Flink
- Yugabyte DB
- PostgreSQL
- MySQL
- Microsoft SQL Server
- Apache Pulsar
- Pravega
- NATS
- Amazon Kinesis
- Pulsar IO
- WePay
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data Conference, and PyCon US.
Go to data engineering podcast dotcom/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Randall Houck and Gunnar Morling about Debezium, an open source distributed
[00:01:44] Unknown:
platform for change data capture. So, Randall, can you start by introducing yourself? Hi. Thanks. My name is Randall. I am a software engineer at Confluent. I've been there for, almost 3 years. I work on, Kafka Connect framework, which is part of the Apache Kafka project, and I work on Confluence, family of, Kafka Connect connectors. Before that, I was at Red Hat, which is where, I created the Debezium project in the last few years I was there. And, prior to that, I spent, most of my career in various forms of data integration. And, Gunnar, how about yourself? Yes.
[00:02:21] Unknown:
I work as a software engineer at, Red Hat. I'm the current project lead of the museum. So I took over, from Randall a few years ago. And before that, I also used to work on other data related projects at Red Hat. So I used to be the spec lead for the bean validation 2 dot o spec, and I also,
[00:02:41] Unknown:
have worked as a member of the hibernate team for a few years. And Randall, do you remember how you first got involved in the area of data management?
[00:02:48] Unknown:
Yeah. So before I was at Red Hat, I worked at, couple startups, that were all around data integration. We were trying to solve sort of the the disparate data problem where you have data from in multiple databases, multiple data systems. And, really, I've I've worked, in in that area, in in related areas,
[00:03:12] Unknown:
ever since. And, Gunnar, do you remember how you got involved in data management?
[00:03:16] Unknown:
Yeah. So I think it came to be when I was actually looking into a BEAM validation. So I was, you know, working at another company and then back in the, back in today, I was reading up on this new spec, bin validation 1.0, and I was writing a blog post about it because I really liked the idea. And then Emmanuel Bernard, who works the spec loop then, he reached out to me and asked, hey. Wouldn't you be interested in contributing to this and to the reference implementation? So that's what I did for a few years in my spare time. And at some point, I had the chance to work on, you know, data related projects
[00:03:51] Unknown:
full time at at Red Hat, and that's what I've been doing since then. And so before we get too far into what Debezium itself is, can we talk a bit about what change data capture is and some of the use cases that it enables? So essentially change data capture is the
[00:04:05] Unknown:
process of getting change events out of your database. So whenever there's a new record or something got updated or something got deleted from the database, the change data capture process will, you know, get hold of this event and stream it to consumers. So typically the way at first you do it, via some messaging infrastructure such as Apache Kafka so it's all nicely decoupled. And then you can have consumers which react to those change events, and you can enable all sorts of, use cases. So you could use it just to replicate data from 1 database to another or to propagate data from your database into a search index or a cache. But you also could use it for things like maintaining audit blocks, for instance, or, for propagating data between microservices
[00:04:48] Unknown:
for running streaming queries. So whenever something in your data changes, you get a updated query result. So all these kinds of things you can do. And so in terms of Debezium itself, can you describe a bit about the problems that it solves and some of your motivation for creating it, Randall? Yeah. So
[00:05:05] Unknown:
with change data capture, all of the different databases, have various support for letting client applications, be able to to see and and consume those events that was talking about. Some databases, you know, don't even support it. And and really, it it all comes from the fact that the databases are really focused around returning and operating on sort of the current state of data. And so if you want a history of that data, databases internally store that information in order to handle transactions and recover from transactions and from failures. But, they oftentimes don't make it available or at least that that was not really part of the sort of typical user interface. And so all these different databases expose the ability to see these events in in various ways. And so if you are working with a variety of databases, it's it's really difficult to know and use DBMS specific features, for each of these systems that you're working with. And so when I was at red hat, that was 1 of the problems that we had. We were trying to, you know, as I mentioned earlier, work with a lot of different DB and masses.
And Red Hat customers, you know, had a large variety of different types of DB masses. And, you know, we had, at the time, the the desire to be able to listen to all of the events. And and so, you know, doing that for each DBMS specifically would be very challenging. And so we wanted to create a framework that abstracted away that process. And so applications, different consumers that were interested in those events, they could simply subscribe to, you know, we started out with Apache Kafka. They could subscribe to Apache Kafka and just magically see all the events, because Debezium would take care of reading those change events from each of the DBMSs and inserting them into Kafka topics. And so
[00:07:00] Unknown:
I know that the general case of what people think of with change data capture is for relational systems. But when I was going through the documentation, I was noticing that there are also systems such as Cassandra and MongoDB that are supported. And I'm curious, what are some of the biggest challenges in terms of being able to provide support to all these different database engines for being able to pull the change sets out and then provide some at least somewhat unified way of being able to process them and what the sort of lowest common denominator is in terms of the types of information that you can expose. Right. I mean, so it's as Randall was mentioning, so it's always different APIs. Right? So there's 1 way you would get change events out of MySQL. There's a different way to get them out of Postgres, out of Cassandra, and so on. And so for us as the Debezium team, it means we need to do this original engineering effort, for each of the connectors. And then we try to abstract
[00:07:55] Unknown:
or you we try to keep you as a user as much as possible away from the nitty gritty details. So in particular, if it comes to the relational connectors, you don't really have to think about too much. Okay. Does this event come from SQL Server or does it come from from Postgres, let's say? So Derek works, I would say, in a pretty unified way. It's looking a bit different if you look at MongoDB. It's just a different way of storing data. Right? There you have those documents and you have can you can have a nested structure. There's no fixed schema. So the events you would get from MongoDB, they look, quite a bit different. And then well, kesenra, it's yet another case because, well, this is a distributed store, by default, so you don't have those strong guarantees when it comes to ordering of events and so on depending on from which node of the cluster we we receive change event. So in those, for those NoSQL stores, it it's it's a bit less transparent for the consumer, from which particular database this event is coming from. But then still we strive for as, much as uniformity as possible. So we try to have similar config options, similar semantics, and we try to structure things for you as a user as as similar as as possible so you don't have that much friction when moving from 1 connector to the other. And so
[00:09:10] Unknown:
being able to work across these different systems is definitely valuable. I'm curious if you've seen cases where people are blending events from different data stores to either populate a secondary or tertiary data store or be able to provide some sort of unified logic in terms of the types of events that are coming out of those different systems, or if you think that the general case is that people are handling those change sets as their own distinct streams for separate purposes.
[00:09:37] Unknown:
Yeah. So so I would say, you know, the way that I've seen Debezium used and and other, connectors, used is that, you know, the the the primary purpose is to get those events, into a into a streaming system. And once they're there, then you can do various kinds of manipulations on all of the events in those streams. And and if you have different streams for different, databases or different streams for different tables in different databases, then, you know, you you can you can process each of those streams in slightly different ways, depending upon either the structure of those streams, the type of events, or or, you know, if you wanna, you know, filter certain kinds of things. So all of that can be sort of downstream, and I think, at least, you know, 1 of the 1 of the primary goals of of Dubuizim, at least when we started out, was just get these events into a streaming system that had a rich
[00:10:34] Unknown:
mechanisms for doing additional rich processing on top. Yeah. Absolutely. That's that's also something we see. So people use, technologies like Kafka streams or, let's say, Apache Flink to process, those events, once they are, in Kafka or whatever they are using. And, that's also what something what we would like to facilitate more going forward. So, 1 very common requirement, for instance, is to aggregate or to correlate multiple change streams from from multiple tables. So 1 example would be like a purchase order. If you think about a relational database, typically, this would be persisted in at least 2 tables. You would have a table with the order headers and then you would have another table with an entry for each order line of this purchase order. So if you wanted to look at the full complete state of the purchase order, you need to look at the rows from those 2 tables. And this means in case of Debezium, you need also to look at the change events from those 2 topics. So you would have 1 topic per per table. And people, are trying to aggregate this data. And then, well, you can do this actually quite quite nicely using Kafka streams, for instance. So then you could have a stream processing application there, which essentially subscribed subscribes to those 2 topics and and joins them. So it would give you again 1 aggregated event which describes the full, purchase order and all its order lines. And
[00:11:59] Unknown:
given the fact that change data capture is used for so many different use cases and that the sort of source integration is different for each of the different engines that you're working with. I'm curious what other solutions exist as far as being able to handle those change data streams, and if they are generally more single purpose for a single database, or if there is anything else working in the same space as Debezium and sort of what the comparison side. We try to
[00:12:28] Unknown:
focus mostly on the stuff we do so. I'm not really aware of any other open source project for sure which would aim at having such a variety of, supported databases. So there's always solutions that you could use another tool for, let's say, MySQL or you could use something for this particular other database. But I don't think there's 1 open source project for the another 1 which tries to have this complete, coverage of,
[00:12:54] Unknown:
yeah, mostly all those relevant databases. Yeah. There there are a number of libraries, open source libraries that, as Gunnar mentioned, target, you know, particular DBMSs or particular data systems. But 1 of the things that we tried to do when we created Debezium was, do the capture of these events well and put those events into a system that had sufficient guarantees that we didn't lose any data, that data wasn't reordered, that you could always pick up. Right? So if if anything went wrong, the system would be fault tolerant and would pick where it left off. And that was really the primary justification for, you know, using and building on top of Apache Kafka in the 1st place. Kafka is able to to to provide those guarantees. And so when you do that, right, you sort of decouple the the responsibilities of of knowing how to talk to the the data system and extract those change events from storing those in a system that can be easily processed and used, regardless of sort of what went wrong or when you're reading them or how much lag you have, you know, how far behind, consumers are from the actual, you know, Debezian connectors that are pushing data into those, streaming systems. Right. I mean, so an interesting,
[00:14:09] Unknown:
development I saw lately is that more of those modern, databases, which are their typical distributed like like, Yuga BYTES and so on, That they, tend to come with CDC APIs out of the box. So I would say those more modern database vendors, they typically see the need for having this kind of functionality and they tend you know, it's not an afterthought as maybe used to be the case, but they tend to think about this upfront. That's that's great. And now I was following 1 discussion And I would, not recommend to do that. Right? So as a database, I think for them, they should rather focus on having a good API and maybe have a way to put the events into Kafka, but they should not be in the business of thinking about all those different things because it's just something you cannot keep up with. Right? So there will always be new data warehouses, new search indexes, new use cases. So, it's just better for the source side really to focus on that aspect, get change events out, as Randall mentioned, in a consistent way, being sure the order isn't messed up and all this kind of stuff. And then you would use this, you know, other, sync connectors to get the data from Kafka and to ride them to the particular systems where you would like to have your data. Right. And a lot of those systems that,
[00:15:29] Unknown:
you know, support, you know, writing data or streams of events into particular systems when they're systems like elasticsearch. Right? You sort of have these point to point capabilities. And, you know, as we all know in in a lot of data architectures, point to point combinations, get really complicated and and and, messy, and it's much better, oftentimes to have sort of a hub and spoke to where you don't have all of the the, you know, the point to point from database a into, you know, a search index or into a an object store that you have more flexibility when you have sort of a hub and spoke design. That I mean, that's also something I always like to emphasize. So once you have to change events in in Apache Kafka,
[00:16:11] Unknown:
well, you can keep the data there for a long time. Right? Essentially, as much as you have disk space, so you could, keep your events for a year or for, essentially, for an indefinite amount of time. And you could replay topics from the beginning. So this means you even could set up new consumers far down in the future provided you still have those change events, and then they could take the events and write them to this new data warehouse, which you maybe not even,
[00:16:37] Unknown:
thought about when those change events were produced. So I think that's another big advantage of having this kind of architecture. Yeah. That's definitely a huge piece of the CDC space is that, as you mentioned, if you don't have all of these events from the beginning of time, you do need to take an initial snapshot of the structure and contents of the database to then be able to take advantage of it going forward. And so I'm curious what you have found as far as options for people who are coming to change data capture after they've already got an existing database And any ways that Debezium is able to help with that overall effort of being able to populate a secondary data store with all of the contents that you care about, whether it's from just a subset of the tables or you need to know the entire schema, things like that? Yeah. Really, ever ever since the beginning of Debezium, that was an important aspect to be able to sort of take an initial snapshot
[00:17:30] Unknown:
and then consistently pick up where that snapshot left off and capture all of the changes from that point. And so, you know, when you're when you're sort of starting a a Debezium connector, you very often wanna begin with that snapshot and then transition to capturing the changes. For a consumer of 1 of those change event streams, they often don't really necessarily care about the difference. At some point, you know, they see that a record exists with some key, and then a bit later, you know, after the changes are being captured, they would see that that record has changed. They see a change event, and then maybe the record is deleted. And so most of the time, the consumers don't often care. And 1 of the interesting things when we sort of looked at at Kafka, Kafka was just introducing the ability to have, compacted topics. Compacted topic, basically, you sort of keep the last record with the same key. And so, you know, as time goes on, you can sort of throw away the earlier records that have the same key. And so if a change event right? If you're if you're capturing change events and you're using, let's say, the primary key as the key for the record, you can you know, if there are multiple events, you can sort of, you know, depending upon how you set up your topic, you can set that up as compacted, and you can say, oh, I wanna only wanna keep, you know, the last, you know, month of of events. And anybody any consumer that begins reading that stream, let's say, you know, several months from now, they start from the beginning, and they're essentially getting a snapshot. Right? It just happens to be the the snapshot that they start with the the events, the most recent events for every record in a table, let's say. And so that's a very interesting way of, again, sort of decoupling, you know, how to read, take an initial snapshot, and start capturing events on the production side and on the consumption side, still providing essentially the the semantics of a of a initial snapshot, but that isn't necessarily rooted in a particular time that the producer actually created its first snapshot. Yeah. It's it's it's really interesting that you,
[00:19:33] Unknown:
or that you are talking about snapshots, because it's a specific quite a specific part of the story. But just interestingly, right now, we are looking into ways for improving this because, well, depending on the size of your data, taking the snapshot can take a substantial amount of time. So essentially, I mean, what it does is essentially scans all the tables so you can set up filters so you can specify which of the tables or collections from MongoDB you would like to have captured, and then it'll scan those tables. You could also limit down the, amount of data. So let's say you have some deleted records. You could exclude those from the snapshot. But still, it could take could take some, amount of time. And, what we are looking to right now is to parallelize, those snapshots. So the idea, would be well, instead of, you know, processing 1 table r for the other, you could also, let's say, have, 4 or whatever number of VirTra threads and then read multiple tables in in parallel. And, you were just figuring out, okay, for Postgres, for instance, we can do this quite quite easily in a consistent way. So we would have those parallel snapshot workers. And once they are all done, we still have the ability to continue the streaming from 1 consistent, point in the transaction box. So that's 1 of the improvements we have planned, yeah, hopefully for for the next year or so to speed up, those initial snapshots.
[00:20:55] Unknown:
And so can we dig a bit deeper into how Debezium itself is architected and some of the evolution that it's gone through from when you first began working on it and through to where it is now and how the different use cases and challenges that you've confronted have influenced the direction that it's taken? Yeah. Maybe, Randall, you can talk about the beginning. That would be very interesting for me too.
[00:21:18] Unknown:
Yeah. So so, I mean, to be so I I was working on several different types. I was looking at a variety of technologies, and, it wasn't really until I came across Kafka that I, really understood the the, you know, the the the requirement of having, you know, those guarantees that I mentioned earlier, on the change stream. Because there was always the problem of, you know, once, you know, what happened if my change event reader, right, where I'm reading these change events from the data? What happens when that goes down? And when it comes back up, like, how do you guarantee all of that? And so, I I went through several different prototypes, started out initially with with MySQL only and really just trying to understand how to read the the data from MySQL and and how to use Kafka in a way that that that would persist all the information and that was full tolerant. And and I had tried that with a couple of different systems. I, you know, I had studied various kinds of distributed system technologies, and and data management technologies before I I got to Kafka. But then once sort of, you know, that all the pieces sort of started to fit together, I I really, you know, focused on, okay. How how could I get this to work with with my SQL?
It became apparent that just starting to read the the binary log from my SQL, wouldn't be that useful, because, you know, what happened to all the data? If I if I start reading this, all of a sudden I get some changes, but I may have a 1000000 records in my table, before I even start doing that. And so the the initial snapshot sort of, you know, became apparent that that was going to be very important part of bootstrapping any kind of practical consumption application. And and, you know, once I had something working for MySQL, then, you know, we looked a little bit, at, you know, postgres, and I didn't wanna build a system that was, you know, designed all around my SQL and wouldn't work for anything else. And so, it it really just at that point, you know, slowly evolved into what I would call sort of practical Kafka Connect Connector implementations, and that was essentially what the the Debezium,
[00:23:24] Unknown:
GitHub repository was seeded with. Yeah. And and and so when we took over or when I took over, I think there were the connectors for MySQL, Postgres, and MongoDB. And I I think Postgres was pretty at its beginnings. So we were evolving those, then we were adding SQL Server. We have some incubating support for Oracle, and and and there might there probably will be more. But really 1 challenge which became apparent is that we were duplicating efforts between those different connectors. So typically, somebody from the community would come, and they would say, I would like to have this particular, feature. So, could you implement this or could I implement it? I will send you a PR for that, a pull request. And then we figured, okay. So that's that's a great feature. But now we are actually implementing this in 4 different places. And, you know, that that's obviously not a great way to do it. So we tried to unify more and more parts of the code base. And this is still an ongoing process, so we are still not quite there where we would like to be. But typically now if this kind of, situation happens, somebody comes and they would like to have a new feature there, we either can implement it in a way it's just working, for all the connectors we have in particular across all the relational connectors that works quite well. Or we say, okay. So we have this option of we have this new feature, and maybe it's fine. We just edit for this connector there, but we already do this in a way so the future, will in the future, it will be possible to refactor this and, you know, extract this, functionality into some more generic code. The reason for that just being so typically, if somebody comes, they are on a on a particular database. So there's this MySQL user or there's this Postgres user. Very often, they are very eager to work on this stuff. So that's really 1 of the things I'm very excited about in this community. So we ask, hey. Okay. That's a great idea. How about you implement it for yourself? And they then they often say, oh, yeah. I would like to do it. But then, well, they are on MySQL, on Postgres, or whatever database they use. And now it would be asking a little bit too much typically to say, okay. Can you, you know, make it happen so it works for for all the databases, in particular, if this needs some large scale refactoring?
So that's where we have this kind of balance, and we try to be as generic as possible. But still, we don't want to overwhelm a contributor with with this,
[00:25:35] Unknown:
kind of work. Yeah. And that that's a big challenge, and that's, something that, Gunnar and the the community, you know, as I when when I transitioned, project leadership to to Gunnar, that was something that, you know, was really, a lot of technical debt. When we started out, you know, only having 3 connectors, it was very difficult. And, actually, we we started with with MySQL, and then we added sort of a prototype for for Postgres, but it really was not, fully formed. And, you know, even from that point, we could see that there was going to be some duplication, but there was it wasn't clear where those, abstractions needed to be. And we just hadn't seen enough of those patterns. And and I think, you know, the community under Gunnar's leadership has done, you know, a a really great job of trying to refactor and find those. And it's very challenging because, you know, all of these databases, not only are their APIs different, but a lot of times their semantics are different. And so, you know, finding sort of common patterns among, 4 or 5 different, you know, very different APIs,
[00:26:38] Unknown:
and and how to use them, can be quite challenging. Yeah. Absolutely. I would say it's it's still an ongoing battle for us. I think so we we made some some good progress there, but still there's many things, you know, which we would like to see done differently. But, it's it's, you know, it's an ongoing effort, I would say. Yeah. The fact that the different databases
[00:26:56] Unknown:
have differences in terms of just how they operate in general, I'm interested in how that manifests as far as the types of information that they can expose for a change data capture, and what that means for the end user as far as being able to write some sort of either more granular events that are available, or if there's any sort of tuning available on the Debezium side to say aggregate all the events up to sort of this level of granularity, or I wanna be able to see every type of change event, you know, not necessarily just at transaction boundaries, but, like, every change within that transaction. I'm just curious, player, any differences along those lines? Yeah. So currently,
[00:27:41] Unknown:
really, you would get a change event for each change. So for each insert, each update, each, delete, you would get a event or you would get an event. Now what definitely can differ just depending on your configuration is what is the payload of the event. So giving 1 particular example for Postgres, it's a matter of configuration whether for an update event, you would get the complete old row state, or not. So, typically, by default as events are designed, you would get the full new state of this record which was updated and also the full previous state. So that's, can be very useful, obviously. But, then well, depending on how the database is configured, you might not get this complete, previous row state. Maybe you just get a state of those columns which actually were updated in this particular, statement.
So let's say you have your your table has 10 columns and you just update 2 out of those 10, then depending on the configuration, you might just see the old value for, those, or or maybe just so you can see the value for those columns which are affected. That's also the case for for Cassandra, for instance. So definitely, you would have some some differences, at this level. Now, as you were mentioning, transactional guarantees and and awareness of transaction boundaries, that's that's also something which comes up very often. So typically or or as it is right now, you would get all the changes, without any awareness of of the transactional boundaries. And what we receive as feedback from the community is that people would like to have this awareness. So let's say you are doing 5 updates within 1 transaction. There's many use cases where you essentially would like to await all those 5 changements, maybe they are in different tables, and then process them at once instead of processing them, 1 by 1 as they arrive. So that's also something what we would like to add, broadly, very early next year. So we would have a way to expose the information about those transactional, boundaries. So that probably would be a separate topic which has markers. Okay. Now a transaction has started. Then there would be another event. Okay. This transaction has completed. And by the way, within this transaction, these and that, numbers of tables, of records in different tables has been affected. So this would allow a consumer to essentially await all the events which are, coming from 1 transaction and then do this kind of aggregated processing. So that's that's definitely a very
[00:30:15] Unknown:
common requirement, and we would like to better support this. Yeah. And I'm curious what you have seen as far as just tension in terms of the system design and feature sets of Debezium from just being a raw pipe for these are the events that are coming through, do with them what you will, to being something that has some level of interpretation of those events for being able to say, okay. These are the events that happened. This is what the actual outcome of that is so that there is maybe less business logic required on the consumer end to be able to take advantage of those events at the expense of being able to have all of the granularity
[00:30:54] Unknown:
for people who either might want it or have to be able to have it for maybe some sort of compliance reasons? Yeah. I mean, so yeah. As as I mentioned, so that's something we would like to better support. So so it's not quite clear to me yet how it looks, so we will work or the community will work, towards making this a reality. But but 1 idea I I just could see us doing is to have some sort of, ready made stream processing application. So it would be a service we would provide as part of the TBSIM platform. And this service would be configurable so you could tell it, okay, process those topics. So coming back to this order and order line example. So you could say, okay. Take those 2 topics, order headers and order lines. And then by means of this additional transaction boundary topic which we are going to create, please produce transactional consistent events which represent the entire purchase order and all its order lines. So then you would have events on a higher level, than the pure row level events. So that's 1 of the ideas I I could see us doing, having this kind of, configurable downstream component. So it could be based on Kafka streams, for instance.
And then another thing which we actually already have is support for the outbox pattern. So this is addressing a concern which we sometimes see. Okay. People feel not that comfortable about exposing the internal structure of the table. So maybe they are in a world of microservices and now they feel not comfortable about exposing their internal purchase order table and all its columns to downstream consumers. So let's say they want to rename a column or change its type and and and whatnot. So they want to have this kind of flexibility, and they don't want to impose any downstream consumers right away. And this outbox pattern has a nice way around it. And the way it works is as part of your transaction, your application updates its, table. So it would update the purchase order table. It would update the order lines table and so on. But then at the same time, it also would produce event and write it into a separate table, the outbox table. So this really is like a messaging queue, you could say. And then you would use Debezium or to capture changes, essentially, with just the inserts from this, outbox table. And now the events within this outbox table, you have the full control over their structure, and they would be independent from your internal database model. So, if you were to rename this column in your internal table, well, you would not necessarily have to rename also the corresponding field in the outbox event. So you have some sort of decoupling there. And this is something we have, edit, support for. So there's a routing component there, which would, for instance, allow you to take the events from this outbox table and stream them into different, topics in Kafka. And this is definitely something where we see a strong uptake of interest in the community. So people have blogged about it, and that's definitely something which is very interesting because it addresses,
[00:33:52] Unknown:
I I think those questions you mentioned. So I also think that, you know, there are a lot of different ways that people wanna use changes to capture. They wanna cap get these event streams, and people have different ways of using their data stores. So so there will be some users that have essentially large transactions where they're updating lots of records at once. And those kinds of patterns, right, it's it's very difficult to sort of put, all of those events into, like, to create a sort of a super event, simply because of the size of what that super event might be. But I think a lot of the cases are all around these really small smaller transactions.
Purchase order, I think, is a great example where there's a there's a relatively limited number of rows that are are are changed or or entities that are changed in in a data store. And so I think oftentimes, you know, it does make sense. It is a common pattern to be able to sort of reconstruct what those, you know, super events might be that are more transactionally oriented rather than,
[00:34:52] Unknown:
sort of, individual entity words. And for somebody who wants to get Debezium set up and integrated into their platform, what are the components that are acquired and some of the challenges that are associated with maintaining it and keeping it running and going through system upgrades?
[00:35:09] Unknown:
So the main component, or let's say the main runtime we target is Kafka Connect. So Kafka Connect is a separate project under the Apache Kafka umbrella, and it's a framework and also a runtime for building and operating those connectors. And you would have source and sync connectors, and Debezium essentially is a family of source connectors. So they get data from your database into Kafka. So you would have to have your Kafka Connect cluster. So this typically is clustered, so you have, again, high availability and this kind of of qualities. So you would deploy the connectors there.
Then we expose some sort of metrics, and monitoring options so you always can be aware of, okay, in what state is secure does the connector? Is it doing a snapshot? At which table is it in the snapshot? How many rows has it processed? Or is it in streaming mode? So, you know, which was the last event? What is the kind of lag you would observe between the point in time when the event happened in the database and when we process it? So you have all those monitoring options. In terms of upgrading, so that's that's a very, interesting question because there's many moving parts. Alright? So there's the the bees inversion.
Maybe you do, schema changes while the connector isn't running. Consumers, they must be aware of all these things. So that's something you have to be, careful with. We always document very carefully the, upgrading steps. So let's say you go from 1, the meeting version to the next, You would have a a guide which describes, okay, that's that's the matter or that's that's a procedure you have have to follow. And then, generally, we try to be very cautious and careful about not, breaking things in a incompatible way. So that's actually very interesting. When I took off from Randall, I think the current version was o dot 5 or 0.4, so some some early 0.x version. And now you would think, okay. It's it's, you know, it's that's that's very early and it was very early in the project lifecycle. But still, people were already using this heavily in production. It only grew ever since. So now if you do a upgrade and we put out 0.8, 0.9, 0.10, 10, and, well, we have lots of production users already. And now, of course, we are we've we don't want to make their lives harder. Right? So we are very cautious to if you have to change something to deprecate options so it's not like heartbreak. You have some time to adhere to certain changes and and this kind of stuff. And now just, very recently, yesterday, literally, we went to the 1 dot o version. So that's the it's the big 1 dot o final version we have been working towards to for for a long time. And now we are even more strict going forward about those guarantees. So so that's the message structure. You know, there's guarantees in terms of how this evolves in future upgrades and and this kind of thing. Yeah. And Debezium
[00:37:52] Unknown:
and really event streaming, they're they're complicated distributed systems because they rely upon a a number of, you know, already distributed components. And so Apache Kafka, you've got the Apache Kafka broker cluster, and and that, in itself is is nontrivial in terms of setting it up and keeping it running and make sure you don't lose anything. There's the connect cluster as as Gunnar mentioned, and and then there's the the connectors that are deployed on onto that, connect cluster. And they all can sort of be upgraded, at different times, and at different levels. And so, you know, the Kafka community does really, I think, a fantastic job of maintaining backward compatibility, and and we try to do the same thing in encouraging the connector developers to to maintain backward compatibility so that, you know, as connectors are upgraded from 1 version to the next, that, you know, users don't have to, you know, change a configuration
[00:38:47] Unknown:
in order to do the upgrade or after they do the upgrade. And I think, you know, Debezium does a a really great job of that. Yeah. And let's say I I have not heard too many complaints about us breaking things, so that's that's a good sign. Another another variable there is, the topology of your database. So, typically, people run their database at least in some sort of cluster, so they could failover from a primary to a secondary node in case there's a hardware failure, this kind of thing. So this means also Debezium typically must be able to reconnect to this new primary node. So the earlier secondary nodes, the new promoted mass primary node. So we need to be able to support that. And that's something people already use in the community. So I'm I have a feedback there, but still something we also want ourselves to more systematically test and have this kind of setup. And sometimes people don't want to connect Debezium to their current production database. Maybe they feel they just want to give it a try and they don't want to connect to the primary database right away. So there's ways, for instance, with MySQL where you could have a MySQL cluster and use the MySQL clustering to stream you know, to replicate amongst the nodes within the cluster. And then you would use Debezium, and it would connect to 1 of the secondary nodes. And, obviously, this adds a little bit of latency, but then you have maybe this, or you have this decoupling, between the primary database and Bismuth. That's what you are after. So that's, again, 1 particular deployment scenario which people use and which we need to test as it's another, moving
[00:40:20] Unknown:
piece here in the puzzle. And different connectors, as they talk to this particular DBMSs, different types of connectors place different loads on that source system. And so, you know, MySQL the load on a MySQL database might be, you know, a certain amount and the load on a a SQL Server database may be different, and and that also affects how many connectors you can run, because each of those connectors will be essentially reading the, you know, database logs, and and placing a load on the database. And so a little bit of it, that depends on on how the relational database system, you know, exposes that functionality to, DBSCM and other CDC clients. So that that's 1 other factor to consider.
And and really, you know, Debezium, this sounds really complicated because it is. Right? We I started Debezium with an emphasis on solving enterprise, data problems. And so, right, anytime you're doing that, you have to deal with all these fault tolerance issues and and network degradation issues. And so it's very important that in order to get those guarantees, right, that was sort of why we chose, Kafka, and it's why connectors are sort of written the way they're written. And when somebody picks up Debezium or even some of the other, connectors, it can seem like it's overly complicated. When you're just, you know, hey. I've got a local database, and I just wanna try this out. I wanna see what the streams, of of database changes are. And in order to get there, you sort of have to set up, you know, Kafka and cut connect, and you have to deploy a connector, and then you have to, you know, consume the events. And so on this very small sort of proof of concept, scale, it can seem a little overwhelming.
But, you know, that all of that, infrastructure is really the the the best way to get enterprise, scale functionality and and change to capture that is tolerant of faults and,
[00:42:12] Unknown:
degradation throughout the system that that every enterprise has from time to time. And I think it's interesting that you called out the sort of close connection between the design of Debezium and its initial build target of Kafka. And I'm curious if you have explored or what the sort of level of support is for other streaming back end such as Pulsar or Pervega or if you've looked at other any other architectures or sort of deployment substrates for the Debezium project itself. Yeah. Actually, we we do. So,
[00:42:46] Unknown:
definitely, Kafka is what's currently, I don't know, 95% of the people are using Debezium with. But then, actually, 1 thing which we had, even when I took over is what we call the embedded engine. And this allows you to use Kafka and so on. But it it's a way to get changes, consume change events in Kafka and so on. But it it's a way to get changes consume change events in your application. So you just register a callback method. And now, interestingly, we see people using this embedded engine to connect it to other kind of substrates, as you call them. So definitely people are using this, with Amazon Kinesis. I'm aware of people using it with nuts. So people are are using this. Then as you were mentioning, Pazalore, that's that's again a very interesting case because they even have, DBZ in support, out of the box. So there's a counterpart to Apache Kafka Connect, which is Palsar IO. So it's it's the same thing essentially a runtime for and framework for connectors but targeting Apache Palsar. And now they if you get Palsar, you already get the Debezium connectors, with it, and they are, running essentially via, Palsar IO in this case. So that's that's definitely interesting to see. And it's also on, our road map going forward to abstract a little bit more from the Kafka Connect, API. So it would be, you know, even easier or maybe more efficient to implement or to get those kinds of use cases where you would like to use, something else. And what are some of the most interesting or unexpected or innovative ways that you've seen Debezium used? So, 1 thing I I I'm seeing is, is huge installations, and I find that very interesting. So I know if there's 1 user so they have 35, 000 MySQL databases, 35 k, MySQL databases, and they stream changes out of, DBS. So, typically, as I said, this stream changes out of those databases using DBS. And, typically, these kinds of use cases is multi tenant use cases. So they would have, like, 1 schema, with the same set of tables for each of their tenants. So that's why they have this large amount of logical databases. Or the other day, somebody came and says, oh, I have, you know, I have a couple of 100 DBs and Postgres connectors, and now I have this particular problem. So please help me. And and for us, it's always very interesting. Well, I don't we don't really have to a way to test with a couple of 100 connector instances. So getting this kind of feedback from from these people there is very valuable because well, in this case, they said, okay. This is, you know, I have this huge amount of tables there, and this is running just fine. Just my connector, it needs 80 gigabyte of heap space. So that's a bit too much. Can we do something about it? And then, well, we were taking a look and we saw, okay. So there's, you know, this this sort of metadata structure which we need to have for each table column, and now they have, like, loads of columns.
And, well, we could optimize this a little bit, and we could remove some redundancy. And now they went from 80 to, 70 gigs of heap space. So that's a nice improvement just for the line 1 line of, code change for us, essentially. So I would say having those large installations, that's that's very interesting to see. Oftentimes or sometimes people build their own control planes. So I know of people. They're also in this multitenancy business. So they need to spin up hundreds of connectors regularly, and it's just not feasible to do this by hand. Right? So they have all this automated they have scripts to do that, then they have some monitoring layer that they can see, all their connectors and some nice doors some nice sort of dashboard. And dashboard. And that this is really fascinating to see to how far people push, Debezium to the limits there. And in your own experience
[00:46:34] Unknown:
of helping to build and grow the Debezium project and community, what have you found to be some of the most challenging or complex or complicated aspects? Brendle, do you wanna take it? Yeah. I think starting out, when we created the Debezium or when I created the Debezium project, there were there was really nobody there. It's very typical for with open source, projects, and and that's the way Red Hat sort of does its, you know, research and development. But there were people that had the same problems, and they wanted to collaborate and and work on a solution. And there were more people that wanted to just use a a solution, whether it was available or not. And and so slowly, the the community sort of started building up. And then as, you know, the community started building up, there was, you know, the connectors became more stable, had more functionality, had sufficient functionality for more and more use cases. And, I, you know, I think the the great thing about the the d b zing community is it's it's, you know, growing quickly, now at at a rate that, you know, it increases, you know, every every year. And there's a very large number of people that have contributed to Debutium, and probably a significantly larger number of users of of the DBZ and connectors. And and, you know, I think the the last few years,
[00:47:49] Unknown:
the the project has done a fantastic job, engaging the community and and getting people to to collaborate and work together to build, you know, these these connectors that are very valuable for lots of lots of people. Yeah. Absolutely. I mean, the community building part, that's that's also what I, you know, find the most interesting, I would say. I mean, it's always interesting to do the technical side of thing and solve this very complex problem there. But really growing this community and see how it, you know, gets bigger and bigger, that's that's really fascinating. And 1 thing I'm very proud of, it's a very friendly community. So I don't remember any incident where we had, you know, any sort of flame force going on or whatever. So people are helpful. And the best I always like to see is when somebody comes to the user chat and they have a question, then somebody else from the community helps them out. So that's that's really great to see people help out on the mailing list. And also, as Randall mentioned, many people contribute. So right now, we have, about 150 different people who have contributed. You know, sometimes it's a small bug fix, but then there's other people who stick around and they, for instance, they use, let's say, the SQL Server connector, so they have a continued interest in in making this better and better. And then in case of the Cassandra connector, this is even led by the community. So it's not Red Hat engineers who are leading this effort, but it's people from a company called WePay. So they open sourced this under the Tebism umbrella, and they are leading this effort. So this is great to see how all those people from different backgrounds, different organizations come together, and we all work together
[00:49:18] Unknown:
as a joint effort or in a joint effort to make this, open source CDC platform a reality. Alright. Are there any other aspects of the Debezium project or change data capture that we didn't discuss yet that you think we should cover before we close out the show? I think we discussed everything,
[00:49:34] Unknown:
what's relevant. I would say if you are interested, come to debezium. Io. You can find all the information there and you can find the links to the community there. And, well, we could, you know, get any discussion started.
[00:49:47] Unknown:
Alright. Well, for anybody who does want to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, starting with you, Randall, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think,
[00:50:04] Unknown:
really just sort of I mean, we have a lot of different components for for using data in different ways, and I think, you know, enterprises are are capturing a lot more data than than ever, and being able to use that effectively and and have, sort of a a collective understanding of within, you know, companies of what data they have, what data processes they have, the relationships, you know, around provenance, and and data governance. You know, we're we're we're really, I think, still we're we're making progress, but still there's there's just a very large ecosystem that we have to get our hands around. And I think that we'll be continuing to to struggle with that and to to make progress maybe more slowly than what we prefer. And, Gunnar, how about yourself? Yeah. I, I would say,
[00:50:54] Unknown:
I I would see the main challenges in a pretty similar area. So I think if you have a large organization, there tends to be many, many data producers, many data formats flying around. There's different ways how data is exposed in synchronous ways, asynchronous ways. You have APIs. You have data streams. You have Kafka, maybe different fabrics, like, I don't know, under other messaging infrastructure. So just being able to see, what's going on, what are the data formats, who is owning them, who is allowed to see them, and who can, you know, see which part of the data. Answering all these questions, I think this is really the key point. So how do formats evolve? How how long should format versions stay around? So having some sort of insight in all this data formats and schemas and structures in your enterprise, I think, having an answer from that or having good tools for that, I would think that would be a very valuable asset to have. Alright. Well, thank you both for taking the time today to join me and discuss your experience of building the Debezium project and helping to grow the community around it. It's definitely a very interesting project and 1 that I have seen referenced numerous times, and it's been mentioned in the show a number of times. So thank you both for all of your efforts on that front, and I hope you each enjoy the rest of your day. Thank you very much, Tobias. Absolutely. Thank you so much for having me. It was a pleasure.
[00:52:15] Unknown:
For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Randall Houck and Gunnar Morling
Understanding Change Data Capture (CDC)
Debezium: Solving CDC Problems
Challenges in Supporting Multiple Database Engines
Comparison with Other CDC Solutions
Initial Snapshots and Consistency
Debezium's Architecture and Evolution
Granularity and Transactional Boundaries
Setting Up and Maintaining Debezium
Exploring Other Streaming Backends
Interesting Use Cases and Community Contributions
Challenges in Building and Growing Debezium
Gaps in Data Management Tooling
Closing Remarks and Contact Information