Summary
The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Raghu Murthy about his recent work of making change data capture more accessible and maintainable
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what CDC is and when it is useful?
- What are the alternatives to CDC?
- What are the cases where a more batch-oriented approach would be preferable?
- What are the factors that you need to consider when deciding whether to implement a CDC system for a given data integration?
- What are the barriers to entry?
- What are some of the common mistakes or misconceptions about CDC that you have encountered in your own work and while working with customers?
- How does CDC fit into a broader data platform, particularly where there are likely to be other data integration pipelines in operation? (e.g. Fivetran/Airbyte/Meltano/custom scripts)
- What are the moving pieces in a CDC workflow that need to be considered as you are designing the system?
- What are some examples of the configuration changes necessary in source systems to provide the needed log data?
- How would you characterize the current landscape of tools available off the shelf for building a CDC pipeline?
- What are your predictions about the potential for a unified abstraction layer for log-based CDC across databases?
- What are some of the potential performance/uptime impacts on source databases, both during the initial historical sync and once you hit a steady state?
- How can you mitigate the impacts of the CDC pipeline on the source databases?
- What are some of the implementation details that application developers DBAs need to be aware of for data modeling in the source systems to allow for proper replication via CDC?
- Are there any performance challenges that need to be addressed in the consumers or destination systems? e.g. parallelism
- Can you describe the technical implementation and architecture that you use for implementing CDC?
- How has the design evolved as you have grown the scale and sophistication of your system?
- In the destination system, what data modeling decisions need to be made to ensure that the replicated information is usable for anlytics?
- What additional attributes need to be added to track things like row modifications, deletions, schema changes, etc.?
- How do you approach treatment of data copies in the DWH? (e.g. ELT – keep all source tables and use DBT for converting relevant tables into star/snowflake/data vault/wide tables)
- What are your thoughts on the viability of a data lake as the destination system? (e.g. S3/Parquet or Trino/Drill/etc.)
- CDC is a topic that is generally reserved for coversations about databases, but what are some of the other systems that we could think about implementing CDC? e.g. APIs and third party data sources
- How can we integrage CDC into metadata/lineage tooling?
- How do you handle observability of CDC flows?
- What is involved in debugging a replication flow?
- How can we build data quality checks into CDC workflows?
- What are some of the most interesting, innovative, or unexpected ways that you have seen CDC used?
- What are the most interesting, unexpected, or challenging lessons that you have learned from digging deep into CDC implementation?
- When is CDC the wrong choice?
- What are some of the industry or technology trends around CDC that you are most excited by?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- DataCoral
- DataCoral Blog
- Hive
- Hadoop
- DBT
- FiveTran
- Change Data Capture
- Metadata First Blog Post
- Debezium
- UUID == Universally Unique Identifier
- Airflow
- Oracle Goldengate
- Parquet
- Trino
- AWS Lambda
- Data Mesh
- Enterprise Message Bus
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy. And today, I'd like to welcome back Raghu Murthy from Data Coral to talk about his recent work of making change data capture more accessible and maintainable for his customers. So, Raghu, can you start by introducing yourself? Thanks, Tobias. Been about a couple years since I've been here.
[00:01:13] Unknown:
So I'm the founder and CEO of Data Coral. Been building, data engineering platform over the past few years, and off late, we have found that a lot of companies are wanting to kind of make their production database data available, for analytics in their data warehouse. So we've been building a solution for them that I'm happy to talk about. My background has been in distributed systems and databases. Been doing that for a while now, starting off at Yahoo back in the day, the early 2000, to Facebook, which is now about 10, 12 years ago. Worked on the big data stack, including, like, Hive and Hadoop and pretty much every layer of the data infrastructure stack there.
And after that, ended up leading engineering at startup called Bebop, which built enterprise application platform,
[00:01:58] Unknown:
and then worked on, Datacol. I've been doing that for about 5 years. As you mentioned, you've been in the sort of big data distributed system space for a while, and we dug a bit into data coral specifically in the previous episode. And I'll add a link to the show notes for that for for anybody who wants to go back and learn more there. So I'm wondering if you can just
[00:02:17] Unknown:
give a bit of context of some of the ways that Data Coral has grown or evolved or changed since the last time that we spoke. At the outset, I mean, Data Coral was started with the explicit goal of making it easy for companies to leverage their data without having to build the plumbing around data pipelines. So our software would automate a lot of that plumbing, and we have had, a little amount of success with that. And, also, the space has evolved. As you know, in the past couple of years, we've had, some really great tools like DBT, NF5tran, and the other kind of data integration services that have kind of shown up. And what we have found is that we are actually working a lot more closely with these tools, and our microservices architecture has made it very easy for us to live alongside a lot of these tools. And at the same time, we are also finding that the end to end kind of platform that we have actually built has allowed us to build some sophisticated solutions for problems that are actually pretty hard for other tools to provide, specifically something like change data capture or CDC.
So we have had a lot of our customers who typically start off with using our connectors for, like, services like Salesforce and Marketo and other kinds of services. But then once they start getting into replicating data from their production databases into the warehouse, that's where kind of our connectors really shine. And then after that, we are able to help orchestrate any of the model building or the transformations that need to happen after the data lands in the data warehouse. We have gone through, I think, in the previous podcast about, like, how we have provided, like, a shared metadata layer for the entire data pipeline. Our entire architecture is kind of built metadata first. I've also written a little bit about it recently in a blog post about kinda how to even think about building a data stack. And as far as data call itself is concerned, we have made it so that for our customers, it's very easy to start off small. Like, they can use any of the connectors or even the transformation orchestration that we have built out. But then after that, they can grow and add more and more capabilities or bring in other tools as well. So I've done a couple of episodes where
[00:04:22] Unknown:
I covered different aspects of change data capture, most notably talking to the folks behind the Debezium project, which is a framework for being able to manage the overall data stream of change data capture data are trying to understand a bit more about when it's useful, can you give a bit more context about sort of what change data capture is and some of the particular use cases that it's beneficial for? Yeah. Definitely. So CDC or change data capture is a solution for a very specific problem in the context of analytics.
[00:05:02] Unknown:
So in general, when you have production databases like Oracle, MySQL, Postgres, they are used to build applications. So there are transaction processing systems, so they allow applications to make changes to the data and also, of course, serve the data. These databases, they actually, for purposes of disaster recovery or even high availability, they have their own kind of system replication log that allows a replica to be created off these kind of production databases. And this change log, as such, which is used to kind of create replicas, off late, they've made it available for other clients, if you will, to be able to read these changes and do other things with it. So in the case of analytics, when you think about data that is sitting in your production databases, for the most part, this data in production databases is actually probably the most valuable data in your data warehouse.
Right? So this is data that your application is using. It could have, like, transactions around, like, your billing, how your product is being used. And for an analyst, if you wanna actually query that data, you can't really run those queries on that production database because you'll be all loading that production database. And there is another kind of aspect of understanding the data, which doesn't get talked about that much, which is understanding the data not only requires you to just analyze the data as of that point in time, but you should also have a sense for how the data is changing over time because that is what gives you the true understanding of what the data means. Clearly, the source of truth of the data is your application, the business logic of that application.
And the the application is kind of making changes to the database, the production database. So now as an analyst, for you to truly understand what's happening with that data, you also wanna know how the data is changing. So now combining these 2 things of not wanting to run analysis on the production database and combine that with the need to understand how the data is changing. The perfect solution for that seems to be CDC, which is you get this reliable log of all the changes that are happening in your data, and you take that log and put it into a different system to do the analysis rather than trying to do the analysis in the production database itself. So that combination is what makes CDC really interesting for analytics.
[00:07:19] Unknown:
In terms of the historical context of the data and understanding sort of what are the changes, that's also reminiscent of a architectural pattern for software design where rather than writing the specific data into the database, you have a sort of queue of events then get processed by the application to mutate state so that you can maintain a bit more context about some of the intent behind some of these events. So CDC is kind of the same idea, but on the other end of the application.
[00:07:51] Unknown:
Yeah. Absolutely. That's absolutely right. Only thing that I would say is building applications that way in terms of maintaining a queue of events is actually pretty hard, and most application developers don't really kind of do that. It takes a lot of scaffolding to build it build applications in that way. So, typically, applications are directly making changes to the database instead of kind of creating this queue of events. But the good news is that the databases themselves are producing that kind of queue of events on the other end. So CDC is a way to kind of capture value on the other end. For people who are
[00:08:26] Unknown:
working on billing data integration systems, a lot of times, they'll start with a more batch oriented approach where they'll dump the entirety of a table or some subset of the table using a SQL query and then export that as a CSV or a parquet file and load it either into s 3 or a data warehouse or what have you. And if they're working with an API, then they'll likely go with a more batch oriented approach as well. So I'm wondering if you can just discuss some of the alternatives to change data capture as a means of data integration, but also maybe some of the alternatives for being able to also maintain some of that contextual history and the the history of data changes
[00:09:06] Unknown:
within the database or within other systems. You mentioned kind of 2 things. Right? 1 is the batch oriented kind of data integration, and then there is CDC, which has a connotation of streaming. So I wanna actually distinguish between the 2 or at least disambiguate between the 2. You can technically have, like, a micro batch CDC as well. You kind of build the systems in that way. But the way to kind of think about data integration specifically, especially around production databases, we just talked about how you can fetch the changes. So data integrations are typically built to be able to kinda fetch changes or, like, fetch data from the sources, like you mentioned, and put them into the data warehouses.
But when you think about change data capture, it is not only fetching kind of the data or, like, getting the changes from the source and applying it to the data warehouse. You're actually getting every single change as well. So that's kind of the difference between kind of general data integration and CDC. But batching versus streaming can be kinda almost orthogonal. So you could potentially do CDC in a micro batch way. For example, when you're fetching the changes change events from, let's say, a production database, in order to apply those changes to the data warehouse, you can apply them in batches. So you don't really have to in fact, it's not even efficient to apply the changes row wise to the data warehouse. Instead, you can actually do it in do it in microbatches.
So CDC, specifically, for data integrations, is is a slightly different problem that needs to be solved. And especially because you're loading data into a data warehouse, even though you're fetching, like, a stream of changes, you'll have to batch them up to efficiently load them to update tables inside of a warehouse. Right? So the main challenge there in CDC is for data integration is what is the best way for you to get all the changes in real time, and then what is the best way for you to apply the changes into a data warehouse. So the line is definitely kind of blurring when you think about data integrations in general and CDC.
[00:11:07] Unknown:
Change data capture as a potential for data integration has been growing in terms of popularity and availability in recent years because of the growing list of technologies that are available to implement it and the data sources that produce an event log for being able to replicate. But for teams who are considering whether to go down the road of actually setting up a CDC pipeline for being able to pull data from their production systems into a data warehouse, What are some of the factors that they need to consider when making that decision and some of the barriers to entry that still exist for
[00:11:45] Unknown:
the overall sort of category of technologies and the architectural pattern and the just the systems necessary to make this a reality? There are multiple factors that go into whether CDC is even the right solution for you. The very first thing is whether the data source actually can support change data capture. So most modern databases allow you to replicate these changes from the database to a different system, as in you get these kinda change logs that you can read. But you do need to make changes to the sources, so you need to configure it to support replication.
If you don't have access to the database or enough permissions on the database to allow it to enable replication, then CDC is not even an option. In other cases, your data may not even be valuable enough for you to even invest in CDC. Right? So if if you have some configuration data that's sitting somewhere and you just want a copy of it and you don't really need, like, a very highly reliable and high fidelity copy in your data warehouse, there's not even a need to invest in CDC. You can run something that's very simple. It's just kind of reading the data every so often and then applying the changes to the warehouse. The thing that CDC does is it actually raises the bar for what specifically in the in the case of data integration, it actually raises the bar for what a good data integration means. Right? So CDC is, in some sense, if all data integration could be done as CDC in a simple way, then that is what I would think everybody would want because it gives you really reliable kind of copies.
You can not only see kind of a point in time data, but you can see exactly what has changed. But the first barrier to entry is that you don't even know if CDC is an option or you don't even know what CDC is. So hopefully, people are learning about CDC through podcast like this and kind of the fact that there are more people talking about it. And then after that, the technology has to become easier to use. So even now, yes, we have built a data call, a package we are doing CDC, but many parts have to come together into CDC for you to be able to have a solution that actually works reliably. So I think over time, both of this is gonna, gonna improve. 1 is around the technology become a lot more easy, a lot less friction to use, and then that also kind of feeds into more and more people learning about it and wanting it. So CDC is definitely, in our minds at least, a better way to do data integration than what has been happening so far. And in terms of
[00:14:19] Unknown:
the actual database model and the source systems beyond just being able to enable replication, are there any data modeling considerations or table layout or table design considerations that need to be made when preparing for setting up CDC to make sure that it is reliable and that you could actually
[00:14:40] Unknown:
meaningfully reconstitute the data? The most obvious 1 is the fact that you want all of your tables if you are working on a database. Like, you want all of your tables to have primary keys. So if a table does not have primary keys, then it becomes fairly inefficient even if you have CDC to be able to replicate all of the changes reliably into a data warehouse. And the other thing is around having time stamps that indicate when a particular row was updated. So this is not a necessary kind of condition, but it definitely helps. And in fact, it in fact adds some level of discipline in how the applications themselves are built so that whenever they're updating a table in a database, they're updating a row. They're also updating the timestamp on the for the updated column, if you will. And that allows for the changes when they come in to be applied to the destination tables and to get, like, a reliable copy of that data that is also easily queryable in the data warehouse. And in terms of
[00:15:44] Unknown:
the primary keys and identifiers, do you find that there's any utility in adopting something like a UUID in place of an integer for the primary keys or anything like that for being able to use them in downstream analytics systems?
[00:15:59] Unknown:
Yes. They definitely help. But like I talked about in the time stamps, they're not a necessary condition. Right? So as long as there is a primary key in the change logs, there is enough context to know which specific tables change it is for every change log that comes in. So then whatever primary key we are able to kind of find or primary keys for that matter, our CDC based data integration should be able to apply the changes into the destination data warehouse in a reliable way. UUIDs are really good in providing kind of uniqueness across, like, a really large kind of namespace, but they are also, in some cases, kind of expensive. Right? Sometimes that overhead may not be as necessary, but, yeah, having UUIDs as primary is definitely,
[00:16:42] Unknown:
helpful. Going back a little bit to some of our earlier conversation, you were mentioning how systems such as DBT and Fivetran have been able to integrate well with your existing business and that change data capture has been a real growth area for you. I'm Wondering if you can just talk a bit about some of the ways that change data capture fits into the broader data platform, particularly where you have other data sources like APIs or third party systems or databases that maybe you're just pulling in batch because they don't have enough valuable data to merit the investment of CDC.
[00:17:19] Unknown:
Yeah. Before I talk about kind of specific tools, I wanna talk about what our kind of overall philosophy has been in terms of building out end to end data pipeline and, like, an overall data stack. And that'll give you an idea of how we think about the different tools kind of fitting in. So we started off with metadata first. Right? So and by metadata, I mean kind of the configurations in for the data integrations themselves, the actual transformation logic, and any kind of orchestration metadata or, like, orchestration configuration that can be stored in a central place so that data pipelines can actually leverage all of this metadata. So I've written quite a bit about it. There's more that we are writing about it as well. We won't have time to get into all of that. But overall principle that we follow is as long as any connector that is pulling data from different places where it could be APIs, it could be databases, it could even be event streams, specifically around data integrations.
If you can kinda map your or if you can translate your requirements into a micro batch oriented data pipeline, then it is actually fairly straightforward to synchronize all of your data integrations and the corresponding transformations. So like a quick example of how this might typically play out. So if you have a shared metadata layer, so data integration in its configuration will be able to have information about, like, how often data is actually coming in. Let's say data is coming in from Salesforce every day. Right? So and you're using a data integration tool like Fireflies. And then you also have a CDC connector that is pulling data in a continuous way but is updating the data warehouse, let's say, every hour, just to make the math a little bit easy.
So now when you wanna run a transformation, which is a daily report, you wanna make sure that all the data for that day arrived from Salesforce using a Salesforce connector, and you wanna wait for 24 hours of the CDC data to also have been applied to the data warehouse before your reporting query actually runs. Right? And this synchronization is typically hand coded in data pipelines using workflow managers like Airflow. But in our case, because we started off with this metadata first, our orchestration cannot can automatically wait for all of the data for a given day to kinda show up before a transformation for that day needs to run without any additional kind of programming that's needed.
So when we work with kind of other tools so we have our own connectors too. We have about 80 odd connectors that are pulling from APIs that are making event streams available. And also, we have the CDC connectors. That's how we are able to kinda synchronize is by having a shared metadata layer.
[00:20:12] Unknown:
RudderStack smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer stack. Sign up for free at dataengineeringpodcast.com/rudder today.
In terms of the overall landscape of tooling for change data capture, I'm wondering if you can give your perspective as somebody who is running this as a production service and some of the shortcomings that you've seen in terms of what you've had to implement or what you've been able to take off the shelf and just sort of your overall sense of where we're headed as an industry?
[00:21:08] Unknown:
Specifically around CDC, as you know, there are multiple solutions that are available. I broadly classified into kind of 3 different categories. 1 is the data integration providers who also offer CDC connectors. Right? So these connectors, they are able to do kind of the basic functionality of, like, shipping the change logs from the source and applying it in the destination, but they're not fully featured. So all the kinds of changes that need to be handled, especially around, like, schema changes, there might be cases where they apply too much load on the source systems, or it might be that they're not able to provide kind of data quality guarantees.
So that's the first kind, which is data integration providers offering CDC connectors. The second class is your kind of traditional CDC providers, like, starting with, let's say, Oracle Golden Gate. They are really sophisticated. They provide pretty much everything that you want from a CDC solution. They've been around for a really long time. They started off as monoliths, and they offered anything and everything that you want, but they're a pretty big heavy lift to even palpate into your overall stack. And, also, it turns out they're incredibly expensive. And the 3rd category is a lot of kind of homegrown solutions that are being used that are either leveraging some of the open source technologies like Debezium that people are kind of putting together themselves, Or there are other solutions like ours where people have taken kind of these end to end data engineering platforms and try to build, like, a CDC solution.
What we have done is actually package it so that it starts looking like the first category, which is a connector that can be just added with a few clicks. But then it is a lot closer to kind of a fully featured kind of CDC solution, especially around data integration that we are able to provide. And as far as
[00:22:53] Unknown:
the challenges that you've seen in your own work and in working with customers who maybe have started to try to go down the path of implementing CDC on their own. What are some of the complexities that crop up once you actually start to dig into this and run it at a production scale?
[00:23:12] Unknown:
Yeah. That's something that we talk to our customers a lot, especially ones who already have a solution or think that they can kind of build their own or they want to build their own. Turns out CDC is actually a disappointingly hard problem to solve. There are a lot of challenges that you have to deal with when you're working with the CDC solution. So the first thing is, of course, that we always talk to them about is their security. So do you want to expose your production databases to the Internet? So if you're okay with that, then maybe some SaaS solution kinda works for you. Otherwise, you have to build something in house, something that's kinda working inside of your environment. The second thing is around the initial setup itself. So as soon as you want to deploy a CDC and a solution, CDC gives you the ability to fetch the changes from the time that you have deployed the connector, right, or from the time that you have enabled CDC. So there's a whole lot of kind of historical data that you need to synchronize to your data warehouse first. And then in a consistent way, hand off from this kind of historical sync into a change data capture way of kind of fetching changes.
And for these historical syncs themselves, there's a lot of challenges. How do you make sure that you don't overwhelm the source system when you're doing the historical syncs? How do you make sure that you're doing them fast? Do you have enough intelligence to figure out how much concurrency is allowed by the source systems or how much concurrency you can take up in the source systems so that you get both the necessary throughput as well as not really overwhelm the source systems. The next challenge is on an ongoing basis, how is your kinda CDC solution performing?
There are cases where there could be things like replication lag. So it could be either because it's replication or the CDC system itself is slowed or maybe you're doing fetching the changes from a replica and there is a lag in the replica itself. How do you deal with that replica lag? How do you deal with cases where the room may have been updated, but not all columns have been updated accurately. So it's not a high fidelity copy. Are you able to do data quality checks to and using things like check sums to make sure that the copy that you have in the data warehouse is a high fidelity copy of the source. And this specific problem itself around data quality is actually pretty tricky. Right? Because you know that you want there to be a high fidelity copy, but the source is also changing. So if the source is changing, you don't know at what point in time you are to even do the perform the check because there might be a bunch of changes that have not yet been applied in the destination.
So then how do you make sure that what is in the destination is an accurate representation of what was in the source maybe, like, a few seconds or a few minutes ago? So being able to do that in a consistent and a reliable way is a pretty hard problem. And the last part that I would mention is around kind of the ongoing kind of provisioning and maintenance of a CDC solution. If your data volumes grow, is your CDC solution able to kind of auto scale, or do you have to reprovision or, like, stop your replication and restart it? And even there, when you stop and restart it, do you end up losing data? What happens when, let's say in the case of Postgres CDC, you drop the replication slot? Then when you bring it back up again, there's a whole bunch of changes that were lost. How are you able to handle those? So it's you can go on and on about providing solutions for each of these challenges.
I wanna repeat though that CDC for data integration actually elevates data integration to a level that it hasn't been so far, which is providing this level of transactional guarantees that all the data that's in the data warehouse
[00:26:49] Unknown:
is a reliable copy of what was in the source. And that brings up a lot of the questions of data quality and observability of the overall system. And you mentioned a little bit of that as far as being able to do some of the kind of summary statistics of the data warehouse versus the source database and sort of when to compete those statistics. But what have you seen as sort of the integration points or the extension points for CDC pipelines to be able to add custom data quality checks or observability to understand what the current state of the replication is and just some of the overall ways to also integrate that with your metadata. So being able to understand the lineage of the records that are being transferred through the CDC pipeline?
[00:27:35] Unknown:
Yeah. So like I mentioned, we start with the metadata. Right? So we have a data model for the metadata that is essentially the CDC pipeline metadata. Right? So which is what is the latest, batch of record that were kind of applied, fetched as well as kinda applied into the data warehouse. Are there any translations that are happening from the source types to the destination types? And also the latest time stamps that were seen at the source versus at the destination. So all of this metadata is what is available in our metadata layer. And this metadata is what is also being used by our orchestration.
So we have the ability to notify our downstream kind of processing that all data for a given time period, let's say, hour, day, or whatever, is now available for analysis. So that notification is actually driven by the metadata that we have already captured. There is nothing different that we are doing for this type of kind of orchestration of downstream. And the same thing goes to observability as well, where that exact same metadata the pipeline metadata that we have captured that our orchestration system uses, that pipeline metadata is also what is available for customers to kind of inspect what is happening in the CDC pipeline. So we have had cases where so we brought out of the box data quality checks, but they can add their own too. But when there are data quality checks that fail, we are able to show them that these specific records are where the data quality checks failed, and we provide them the ability to trace just those records from the source being staged and all the way to being applied to the destination warehouse.
So our end to end kind of data engineering platform that we have built from the ground up to be metadata first allows us to provide these things out of the box without having it be something that we glom on top of what the actual data flow itself is.
[00:29:29] Unknown:
In terms of the downstream systems, are there any data modeling considerations or specific advice that you have as far as how to actually think about representing the information destination system, particularly because of the fact that you do have these historical records of the values of different rows at different points and times. You could just simply represent it as a direct mirror of the source database. You need to be able to have that understanding of those historical rows. There's a lot of options here. Right? And a lot of it is determined by what the business use cases are. If
[00:30:08] Unknown:
the most simple use case is that of treating CDC as just yet another kind of data integration, I don't care about any of how the underlying data is coming in as, like, a set of change logs or whatever. I just need a version of the data that represents what was in the source at some point in time. So in those cases, like I mentioned, having primary keys and a CDC solution that is able to apply changes to the destination is, for the most part, enough. So then you get into a little bit more detail as to what do you wanna do with hard deletes on your source. So first of all, is your application doing soft deletes or hard deletes? If it is soft deletes, it's fine. But if it is hard deletes, then do you want to replicate the hard deletes also in the destination? So this is something that the data analysts or the people who are creating the data can choose.
So if they say that, okay. Apply the hard deletes in the destination warehouse, then you do need to have, because there are primary keys and you can't figure out how to kind of delete the rows in the destination. Or if you wanna just turn a hard delete in the source to a soft delete in the destination, then pretty much every table would have something like a deleted at column, which gets updated by the CDC solution. So as an analyst, you know that a particular record was deleted at some point in time. You can either choose to query them or not. The next level of consideration is, do you want to query the changes themselves, or do you just like I mentioned, the simpler use case is just creating the end result of a CDC. If you are to query the changes themselves, then you typically don't want to store the changes in the same in a similar schema as the final tables, if you will, which are 2 replicas of the source. So we have customers who only want to see the deleted records, the hard deleted records in a separate table. So those tend to be more like logs. Right? So the changes that are coming in from the source database, we are able to load them in append mode, if you will, in the destination kind of warehouses for people to do things like auditing or kinda even debugging their application in some cases.
But in the case of getting, like, a an updated version of the data in the destination warehouse, clearly, there are some consideration as to the type of warehouse. Does it support merge queries, for example, or does it support kind of doing in place updates? So those become some of the considerations. And I think you had mentioned something about data lakes. We can kinda talk about what it means to kind of do CDC into a data lake as well. Yeah. I think digging into that is definitely useful because
[00:32:40] Unknown:
with a data warehouse where it's a database engine, it's a little bit easier to understand. You know, you translate the log of events into the insert, delete, update queries at the destination system. But when you're working with a data lake, particularly if you're working at the file level where you're trying to batch up all these events into a set of parquet files and then load that at s 3 and then be able to represent that in something like the Hive metastore for querying using Presto or Trino. There's a lot more complexity involved there, and so I'm curious to understand what your experience has been there and some of the ways that you can think about creating the pipeline to be able to batch up those records into a parquet file or a set of JSON log files or whatever it might be to then be able to actually perform analyses on it through various query engines or just from doing direct compute using something like Spark?
[00:33:38] Unknown:
In general, doing, like, in place updates in, like, these data lakes is a hard problem. I mean, there are newer, I guess, variations of these kind of data lakes like Delta that seem to allow updates that there are still some ways to go for them to be efficient. But the way that, at least, we have architected our entire pipeline, like I mentioned, to apply a change to a data warehouse, doing individual raw inserts or updates is actually not efficient. You have to batch them up. You don't have to batch them up, like, by day or whatever. You still have to batch them up enough so that you're not unnecessarily overwhelming the data warehouse. So whenever you get these changes as a stream, we actually batch them up and load them into a file system like s 3 before we kinda copy that in merge mode into our data warehouse.
So now if you wanna query the logs as is from the data lake that can be sitting on top of a s 3 or a file system, what you get are these batches of changes that have kinda come along. So when you think about CDC, we didn't really talk about it. You get a single stream of changes that happen to all tables in a particular database. So these changes are not coming in as separate streams. You can set it up that way, but the setup itself ends up becoming kind of fairly complex. So you can assume, for simplicity that you have a single stream, but that stream of changes has changes from across all tables. So the first thing that you do in a data lake, if you wanna query everything in a data lake, is actually split that stream into changes that can be applied to a specific table. So if you have a table like customers and, let's say, orders, changes to both customers and orders is coming in the same stream.
So you end up creating 2 different tables in, let's say, presto, 1 for customer changes and 1 for orders changes. And you can then query those changes separately because they can have separate schemas. They have different tables. And then you can also build a pipeline that can create a table that has all the latest records that you have seen in the change logs. So we do the exact same thing where when we do a historical sync, that entire snapshot of the data is also stored in a data lake and available as a table. And then you can build a pipeline that is applying those changes to this table.
But given that there's a file system based kind of immutable kind of data that's sitting in these file systems, we're essentially making copies. And even back in the day, Facebook, we used to do this where every day there is a new version of the data that gets off these tables that kinda shows up. And that version is basically a join of the changes that came in, plus whatever was there in the previous day, let's say, because it's a daily pipeline. So you end up creating these kind of daily partitions of the latest snapshots as of that day, And then you can use things like retention and so on so that your space doesn't just blow up. But that is 1 way of kinda approaching, like, being able to get both the ability to query the changes themselves as well as getting the latest snapshot over a period of time. 1 last thing about having these kind of snapshots over a long period of time, people were able to do more efficient longitudinal analysis because they now have, like, snapshots over, let's say, 3 months. Right? So they're able to kind of run analysis on those snapshots, which are each snapshot is a different partition in those tables.
[00:37:03] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask. In terms of the actual moving pieces, we've kind of skated around the edges of it. But I'm wondering if you can dig a bit more into all of the different systems that are involved in actually running a full CDC solution. So, obviously, there are the source databases. There are the destination systems. But what are some of the other pieces that go in the middle? Because you're talking about, for instance, having all of the events in a single stream and then being able to split out the streams and process those. So what are some of the components that go there? Is it something like a ksql stream or, you know, Kafka streams that you're working with or something like a airflow to DAG or a Daxter pipeline that you're working with to be able to process those streams in a batch oriented manner? Just some of the different technologies and components that go into that. We end up using services like Lambda, AWS Lambda, or, like, containers for, like, limited
[00:38:55] Unknown:
kind of cases. So the kind of 4 different things that I can kinda talk about in terms of the requirements, and there are different components that go into the different aspects. The very first thing, like, that we do whenever a connector is added is do these historical syncs. These historical syncs are all done pretty much using Lambdas because we try to batch up, like, these small batches of data that can be very efficiently kind of move forward, and we can pick up from where we left off. And then as we are kind of reading these change logs, this is where kind of more long running processes are needed. So we end up using containers for that. But then as these changes are getting written out into a file system like s 3, we also have the ability to kinda split that data up into these microbatches that are being written out on a per table basis into a file system.
And there are also cases where we have to fetch checksums or do, like, some of these statistics on the source. Those are also done in a fully serverless manner. And then, of course, the loads that happen into the data warehouse, they're they're also done in a serverless way. And then the last part is kind of detecting these schema changes. So we have these microservices that are constantly kind of looking at the source database, especially in cases where the DDL changes are not also part of the change log. We are kind of have sensors that are constantly looking at the source systems to detect these changes and then can apply them. And those end up triggering kind of historical things and so on. So all of that orchestration is something that we have built internally on top of our kind of metadata first kind of orchestration system. So we end up using a lot of services out of AWS, but the fact that we have built this metadata to an orchestration makes a lot of this kind of pipeline building as such for ourselves a lot more easy than if you were to bring in, like, a bunch of other systems as well. In terms
[00:40:42] Unknown:
of the experience that you've had of digging deep into change data capture and building a service around it for your customers to be able to trust and rely on for their data systems. What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:40:58] Unknown:
I think 1 of the first things that we were surprised about is how some analysts, in order for them to truly understand the data, they're not only analyzing the data, but they're also analyzing the change logs themselves because they wanted to see how the data evolved over time. So that is something that, typically, when you're thinking about CDC, especially around data integration, you're just starting kind of the final updated kind of values, and you're doing it in these kind of batches. So you're seeing the changes in the destination every so often. But the fact that they wanted to see the every single change was something that was, you know, interesting for us. The thankfully for us, like, it was already available for them to query, so that's something that we have we've been able to kind of service. There are other challenges, especially around sharded databases, that we have a reasonable handle on, but we don't offer something directly in our product just yet, where, as you can imagine, there are sharded databases that are typically built out because people want their production databases to be reliable the way that they scale it is by creating multiple smaller databases that have, like, small fractions of their customer data instead of these databases.
So being able to provide a CDC solution for sharded databases is an interesting challenge. And then the other challenge that we are working on is a way to provide transactional consistency in the data warehouse. So it's a little bit nuanced. And, again, we are kind of, looking at how data integration the expectation of data quality from a data integration gets elevated using CDC. So for example, when I'm trying to do analysis on a particular set of tables that came through CDC, how do we make sure that all the tables that have been updated, they've been updated with rows that are updated in such a way that they're transactionally consistent to the source database. If a single transaction in a source updated 10 different tables, How do we make it so that all 10 tables are also updated in the destination before the analysis gets kicked off? So that is something that we are kinda working on, and I think it's a pretty interesting challenge there. In your experience of
[00:43:11] Unknown:
using CDC and working with your customers who are deploying these CDC solutions, what are some of the most interesting or innovative or unexpected people wanna just take the changes and then maybe update a search index.
[00:43:26] Unknown:
People wanna just take the changes and then maybe update a search index, or they wanna update other production databases to be able to leverage only specific changes that have been applied in other parts of the application. So these are kind of generic CDC use cases. So our customers, definitely cannot talk to us about not only the data integration side, but also on the generic CDC use case, which I guess translates to the whole data mesh, this kind of new thing that's kinda coming up where people want a way to consistently update different kinds of databases, what they call polyglot systems,
[00:44:03] Unknown:
from the same change log that is coming from a production database. From working with all of the different source databases in particular and other source systems, what are your thoughts or hopes on the future direction of change data capture as a technology pattern and the ways that these different source systems can and should evolve to make it easier to achieve and maybe some unified API or sort of common interface for being able to consume these event streams so that it's easier for us as data engineers to be able to actually consume them and use them effectively in downstream systems.
[00:44:44] Unknown:
So that's kind of the holy grail. Right? So people have talked about is they're called enterprise event ledgers, which if they were built that way where every application, every database, every system is actually generating events for every operation that it is actually performing. So now as an analyst on the other end, getting access to all of these events is is incredible. But it takes a lot of discipline to actually build it out that way from the get go. And what we are seeing though is there are some even kind of API based kind of services. I think Stripe is a really good example where they not only have APIs to be able to fetch the data from different places, but they also offer an event stream of every single change that has happened across all of the different objects that are within Stripe.
So I think a lot of the movement towards this way of doing CDC for all kinds of applications and services and, of course, databases, I've had them for a while, It all goes along with kind of whatever the requirements are, right, or what the expectations are. If there are more and more analysts saying, okay. I want consistent data. I want to be able to do analysis or even build applications data applications with really consistent data across different, kind of services, that is when they'll be pushed towards these kind of source systems to also offer a way to get, like, these replication logs. And then I think the databases have actually done a pretty good job of, like, designing, like, replication systems. Right? So following a lot of what we have learned from these databases and how they do replication, I think building applications that way will definitely help. There are newer and newer ways of talking about what you just said, just having a unified API for all changes across all systems.
Back in the day, they used to be called an enterprise message bus. This was TIBCO. So there's that realm. And there's this kind of enterprise event ledger is another way of talking about it. Data mesh is another 1. So I'm sure people are trying to get to the same answer in, like, kind of different eras, but it's a hard problem to solve. And our goal is with our connectors and the way we have built things out, we have built out an interface just for ourselves so that we can support different kinds of databases in terms of sources to be able to build CDC solutions. We are also happy to kind of, like, publish our interfaces to see if that's something that can help overall.
[00:47:09] Unknown:
Are there any other aspects of the overall space of change data capture and the ways that you are managing it at Data Coral and some of the ways that it's used that we didn't discuss yet that you'd like to cover before we close out the show? I think the most important thing to get out of this episode
[00:47:25] Unknown:
would be for people to kind of get themselves familiarized with what CDC even means and start thinking about, specifically, around data integrations. Is CDC even the right thing for you or, like, how could CDC help you? Clearly, we need to talk a lot more about change data capture, not only in the realm of data integration, but in general because it offers a way to provide really consistent replicas
[00:47:50] Unknown:
of data in different systems. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:07] Unknown:
I would actually say that there's probably no gap in the tooling. There are, in fact, too many tools. People don't know how to put them together. I see the gap in the way our data stack is being built. Like, we've been working on, like, a pretty opinionated way of making it kind of metadata first instead of kind of glomming on, like, metadata systems after building the data stack. So I hope that with this kind of renewed interest in a lot of these kind of metadata systems, there is a movement towards actually building a data stack
[00:48:38] Unknown:
metadata on words. Right? So have metadata first as your way of kind of thinking about building a data stack. Well, thank you very much for taking the time today to join me and discuss all the work that you've been doing with change data capture and the experiences that you've had from that. It's definitely very interesting and growing problem domain. So I appreciate all the time and energy you've put into helping us figure that out and sharing your experiences, and I hope you enjoy the rest of your day. Yeah. Thank you so much for having me again. I always enjoy our conversation. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Guest Welcome
Data Coral's Evolution and Goals
Change Data Capture (CDC) Overview
Alternatives to CDC and Data Integration
Factors and Barriers in Implementing CDC
Database Modeling for CDC
Integrating CDC with Broader Data Platforms
Landscape of CDC Tooling
Complexities in Running CDC at Scale
Data Quality and Observability in CDC
Data Modeling Considerations for CDC
CDC in Data Lakes
Components of a CDC Solution
Lessons Learned from Implementing CDC
Future Directions and Hopes for CDC
Closing Remarks and Contact Information