Summary
Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of change data capture and the state of streaming data integration in the modern data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- What are some of the notable changes/advancements at Fivetran in the last 3 years?
- How has the scale and scope of usage for real-time data changed in that time?
- What are some of the differences in usage for real-time CDC data vs. event streams that have been the driving force for a large amount of real-time data?
- What are some of the architectural shifts that are necessary in an organizations data platform to take advantage of CDC data streams?
- What are some of the shifts in e.g. cloud data warehouses that have happened/are happening to allow for ingestion and timely processing of these data feeds?
- What are some of the different ways that CDC is implemented in different source systems?
- What are some of the ways that CDC principles might start to bleed into e.g. APIs/SaaS systems to allow for more unified processing patterns across data sources?
- What are some of the architectural/design changes that you have had to make to provide CDC for your customers at Fivetran?
- What are the most interesting, innovative, or unexpected ways that you have seen CDC used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDC at Fivetran?
- When is CDC the wrong choice?
- What do you have planned for the future of CDC at Fivetran?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
You wake up to a Slack message from your CEO who's upset because the company's revenue dashboard is broken. You're told to fix it before this morning's board meeting, which is just minutes away. Enter Metaplane, the industry's only self serve data observability tool. In just a few clicks, you identify the issue's root cause, conduct an impact analysis, and save the day. Data leaders at Imperfect Foods, Drift, and Vendor love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free forever plan at dataengineeringpodcast.com/metaplane or try out their most advanced features with a 14 day free trial. And And if you mentioned the podcast, you get a free in data we trust world tour t shirt. Your host is Tobias Macy, and today I'm interviewing Mark Vandeveel about Fivetran's implementation of change data capture and the state of streaming data integration in the modern data stack. So Marc, can you start by introducing yourself?
[00:01:47] Unknown:
Thank you. So, yes, Mark Vandeweel. My job is field CTO at Fivetran. I joined Fivetran about a year ago through the acquisition of HCR. And throughout my career, I've worked in the data replication and BI and analytics space.
[00:02:02] Unknown:
And do you remember how you first got started working in data?
[00:02:05] Unknown:
Absolutely. I was in university. I came across a project in my graduation year for the Dutch government where we wanted to build an information system for decision makers. So that's how I ended up in the data space. I learned about Oracle, and everything from there is history.
[00:02:24] Unknown:
You've been at Fivetran for about a year. I actually interviewed 1 of the founders about 3 years ago about Fivetran, allowing for the fact that you haven't been there for that whole span. I'm wondering if you can maybe call out some of the notable changes or advancements that have happened at Fivetran within the past 3 years or even just in the past year since you've been there?
[00:02:44] Unknown:
Yeah. Absolutely. So with Fivetran, we provide data integration as a service. And 1 of the key aspects of Fivetran is connectors. How many applications, software as a service applications, and databases can we connect to so that organizations can synchronize, can consolidate data in an analytical environment in the cloud. So we always continue to evolve and expand the number of connectors that we work with. So over the past 3 years, we'll have grown from, let's say, about a 100 connectors up to, right now, probably over 200 connectors, and this continues to go on.
In addition to that, security has had an incredible focus. Organizations are naturally very concerned about access to the data. You hear about data breaches almost on a daily basis. And our platform, as a managed service, we obviously have to cater for the level of security that our customers expect. Now lastly, I wanted to highlight, of course, the HR acquisition. Fivetran has had database connectors. However, there was a recognition about a year ago, just over a year ago, that building industry leading database connect specifically around change data capture is incredibly difficult.
And 5 so Fivetran went out and bought what was considered to be 1 of the market leading technologies with the HVR acquisition to make that part of the managed service. And that's what we've been focused on this past year is to integrate those technologies and make high volume data replication part of the, Fivetran managed service.
[00:04:19] Unknown:
So in terms of the usage of change data capture and the utility and requirement for these near real time data feeds, what do you see as some of the scale and scope of usage across the industry both now and maybe compared to 3 to 5 years ago?
[00:04:39] Unknown:
That's a great question, and it kinda shows the boundaries that we're pushing with this kind of technology. Very recently, I've been working with a customer in the financial services industry who have a single database that generates up to 15 terabytes of changes in a day. So if you think about the sheer volume of changed data that goes on in this database, it's incredible. I think it shows how we've been progressing if I think back not just 3 to 5 years, but I think back to some of the early days of my career when there was surveys out there about what are the largest data warehouse databases that are out there in the marketplace. Right? Like, in Yahoo came out and Walmart came out. They had systems that were sometimes in the tens of terabytes in total volume, including the indexes, everything that was residing in the database.
And now, if we consider, like, 15 terabytes of change change volume in a day, that is incredible. And that is then 1 out of many systems that organizations try to consolidate in their analytical environment. And you can imagine the the volume dimension, how that has evolved and how that has become very dominant in our space.
[00:05:51] Unknown:
And as far as the approaches to change data capture, There are a number of different products out there. 1 of the most popular ones in the open source space is Debezium. And I'm wondering if you can talk to some of the different ways that change data capture manifests where different database engines maybe have it built in natively versus having to bolt it on as an afterthought and how that influences the kind of maturity and robustness of the capabilities that it provides.
[00:06:23] Unknown:
We could spend all of this recording on that topic, I suppose. But just to keep it at a relatively high level, you're absolutely right. Some of the technologies provide native options for log based change data capture. And let me actually take 1 more step back and go back to the concept of log based replication. Fundamentally, almost every transaction processing database that's out there will use a transaction log to record the changes. It becomes the ledger of what was going on in the database, and it is foundationally the basis for database recovery. System crashes or the software crashes, system restarts, database restarts, and it goes back to the most recently committed state of the database by replaying changes from the transaction log top of the change data that was on disk. Now log based change data capture is widely considered to be the least intrusive approach to then get those changes out of the database.
Now, indeed, some databases have native capabilities to retrieve those changes out of the database, like Postgres, write a head, log, reader. Oracle has log minor capability. SQL Server provides CDC tables and and other databases have different options. Those options can absolutely be used within the context of the database. However, whatever the limitations might be that are associated with that technology, possibly think about overhead or think about the implementation, think about some of the data types, maybe limitations that are imposed upon you by the provider of the technology. Those are the ones that you have to live with.
Also, consider that running inside of a database generally comes with a a certain amount of additional overhead. There is security validations. There is parsing. There's all of these, let's say, routines that are called with every operation that you submit down to a database. And then this is all for a good thing. Right? Like, it's secure. It's recoverable. It's it's all of these attributes that we really love about these database technologies. However, there are systems, and certainly, I think when I think back to the history of HVR, we've come across absolutely mission critical database systems where, essentially, a slowdown of the database technology would have a direct revenue impact to the organization. And there was a need for the absolute least intrusive way to perform change data capture. And with HVR and now Fivetran, we embarked upon this concept of so called binary log reading, where we essentially submit some changes to a database, access the the database the transaction log files directly, and go figure out what happened in those transaction log files so that we can, in the end, parse out these changes for heterogeneous replication.
We run outside the database. There is no additional overhead. We come up with an architecture that was distributed where we do some of the heavy lifting very close to the source, but then any more extensive processing happens downstream. There's compression. There's encryption in the mix. And with those, we've proven that we could acquire customers who had those mission critical, let's say, central core database technologies and could successfully capture changes out of these with absolute minimal impact to database processing. And that is what customers really wanted in the end, and I think that is so the binary lock reading has absolutely been a great success for us. Now if you compare that to the Bezium that you had mentioned that is out there, the Bezium largely falls back on the technologies, the capabilities that the database vendors provide from an out of the box point of view. Right? Like, in it covers many use cases, and it's absolutely great technology. But there are just cases where some of the, let's say, the biggest databases, the busiest systems that are out there need something that goes slightly beyond that. And that's where the binary log readouts come in. Another interesting element of this
[00:10:32] Unknown:
problem space and conversation is the different ways that real time data manifests. Where right now we're talking about change data capture coming from the database. A lot of the conversation around real time data has for the most part, been driven by event streams. So, like, click stream analytics or application generated events, sensor driven events, and being able to process those as they are emitted. And I'm wondering what you see as the differences in terms of capabilities and use cases for that data between these kind of real time event streams versus change data capture events. You know, some of the ways that we're able to lean on those technologies that were developed in those earlier stages of real time data to be able to
[00:11:19] Unknown:
facilitate things like change data capture and where we have to build additional capabilities above and beyond that. When you think about log streaming and you talk about some of the sources you've mentioned, right, whether it's clickstream or IoT data, a lot of these data streams are arguably relatively straightforward. Right? Like, with the clickstream, it's like, okay. I'm browsing the Internet, and here is where I go. It's always incremental. It's not like, okay. I go back, and the click that I did 10 minutes ago, I decided not to do that click but do a different click. Like, those kinds of operations don't happen. Right? Like and if we do, let's say, sensor generate sensor data based on, let's say, some technology that resides inside of our car, for example, and we we track, like, what is the thickness of our brake pads.
Like, every data point is a new data point. It's not like we're going back and updating historical data points. Now if we contrast that to CDC and relational database technology, of course, when we look at how we operate relational databases, of course, there is generally dominantly inserts, but there's also updates. And in some cases, systems genuinely process deletes, and and we wanna deal with those. And then for some of the use cases, you wanna look at it as an incremental stream of changes. But for many use cases, you also want to get the current state of the data. And I think where some of the most powerful use cases or some of the most interesting use cases come come to bear is where you combine some of those sources where maybe it's some of the reference data or maybe it's some of the core processing that happens in the ERP is relevant in the context of what's going on with our IoT data or with our clickstream data or with our social media posts. And we're combining those datasets into a use case that shows a an integrated a consolidated overview of a set of systems with real time aspects, attributes to it that in the end made the organization more competitive or more efficient or can save costs, like, whatever the the ultimate business outcome is for the organization.
[00:13:26] Unknown:
For organizations who maybe have already invested in these real time streaming capabilities for event streams, for IoT, what are some of the new systems or new architectural patterns that they need to adopt in order to be able to also factor in change data capture feeds or be able to effectively process and analyze those data sources?
[00:13:50] Unknown:
Yeah. So if you consider the stream analytics or the technologies that you would use for streaming analytics, right, like whether it's Kafka or it's something like a Google Pub Sub or a Kinesis or equivalent Azure technology, None of these technologies really provide the change data capture itself. Like, sure, I can point my data feed, my IoT feed to the data stream, and it'll absorb it, and I can run my analytics against it as as changes arrive, etcetera. Now if we wanna combine that data with a dataset that comes out of a more traditional database, it's like, well, okay. So what's the what's the change data capture? What is delivering those changes into the data stream so we can incorporate those changes as part of the use case? So when we look at the CDC technologies and the role they play in the context of existing event stream use cases, How do we incorporate those datasets? Like, what is the change data capture mechanism? And that's where our technology can come in and be that feed that takes the data out of the SaaS application in near real time or takes it out of a database with, let's say, at most a couple of seconds of latency. Like, at the end of the day, we talk about real time, and and I suppose we didn't define what real time really is. But in reality, it's always near real time. We're not directly querying the source. We're not delaying the transactions.
Instead, we're capturing the change once the commit hits the system, and, generally, that's within a couple of seconds of when the commits hit hits the system. It can get delivered into the data stream.
[00:15:33] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder. In terms of the applicability and adoption of change data capture, 1 of the things that's necessary for making it viable and desirable is the ability to be able to actually analyze that feed as it's coming in. And, you know, up till now, there have been a lot of additional capabilities that are necessary to make that feasible. And I'm wondering what are some of the shifts and evolutions in the ecosystem that you've seen that make this a more tractable problem for people who don't necessarily have, you know, 1, 000, 000 of dollars of financing to be able spend all of the engineering time and investment infrastructure
[00:16:47] Unknown:
to be able to process these feeds as they come in. I think we see the technologies evolve on the streaming side. Right? So if you think about Kafka as an example, like, it's been already a few years that KSQL became available, right, with essentially a structured query language capability to be able to query streaming data. And that was maybe a first out of multiple where there is more of the ability to automate the analytics of what's happening to the data stream. And, likewise, I think if you consider, like, a Databricks as a destination, there's now Delta Life Tables where it's the transformations, the analytics, the analysis, whatever it is that needs to happen to the data. Once it arrives, some of that happens automatically. So I think there is absolutely that evolution of the technologies that enable those use cases, we see that happening more and more. Now all of that said, I think we still see a lot of use of of our technology also in somewhat relatively, let's say, traditional use cases as well, where there is, like, okay. We need reporting, and we've been doing batch reporting for all this time. And now we start with consolidated reporting, and we don't wanna do too much of an upfront investment. We start by using cloud technologies, pay as you go, and we're feeding those different data sources. And, oh, wow. We can do closer to real time. Let's see what what kind of use cases come out of that. Well, we still see a lot of that as well in addition to some of the more, call it, leading edge streaming use cases.
[00:18:23] Unknown:
Another interesting element of that is being able to actually update views or run queries as the data changes where systems such as Materialise and some of the other kind of real time databases have been built to make that a possibility. And cloud data warehouses have generally been built up as these scalable systems that allow you to run these queries on a periodic basis and be able to process massive amounts of data without having to wait, you know, minutes or hours for it to complete. Although there is the question of being able to do that affordably if you want to keep your data up to date. So I'm wondering what are some of those kind of capabilities in the access path that make this a tractable problem and viable for companies to be able to actually want to pull in those data sources and query them on a continual basis.
[00:19:15] Unknown:
Like you mentioned, right, there's a lot of technologies in this space that look at a problem from a slightly different angle. And I'll take it all the way back to call it federated access to systems. Right? Like, you consolidate your view of the world by just always connecting to the source applications. Now 1 of the benefits, of course, is you're looking at up to date data because you're hitting the source directly. However, if you try to pull large volumes of data together and you wanna join these datasets and combine these datasets on the fly, well, that becomes a challenging problem. And indeed, scalability costs come into play as well as arguably the real time aspect. You could say, well, yeah, we're accessing the source. So, of course, the data is real time. If it then takes a couple of hours to pull the datasets together, well, then it's arguably no longer real time because it took a couple of hours to consolidate the datasets.
There is then the approach of, let's call it, CBC landed in a data warehouse or a data lake and run, let's say, data load or the transformation routines on top of those. Right? Extract, load, transform, or in some cases, extract, transform, load, these kinds of approaches. And then there are also the solutions like you mentioned, like what Materialise does, for example, is is essentially build a single view on top of a set of datasets and then kinda, like, provide that update automatically, but materialize the datasets so that access to the queries or access to the data is high performance when you need it. You're not always recomputing, reconsolidating that data. It's already there. I think all of these are different approaches to a very similar problem, and you may find that indeed based on the budget and, let's say, the data volume, the dataset you have, 1 option might work better for you than another option. But I think all of these are viable approaches within the context of
[00:21:11] Unknown:
near real time, real time analytics, and also combining streaming datasets as well. In terms of the way that Fivetran is approaching change data capture and the requirements for a customer to be able to actually incorporate that into their data platform and data analytics, how much of the overall process does Fivetran own and what are some of the capabilities that are necessary on the customer side to be able to handle those feeds that you're being able to send and manage the kind of integration flow for?
[00:21:44] Unknown:
When we set up and configure CDC for a customer's data source, we do have a set of requirements that the customer has to fulfill in order to enable us to do the change data capture. That set of requirements, we wanna keep that as minimal as possible, but, of course, we have to recognize that we have to end up with a working solution. So we provide multiple options with the Firetran technology as it relates to CDC, and we ask the customer to self select what is the best option for their environment. In some cases, that means that all they need to do is create a database user with the adequate privileges, come to our portal, enter the credentials, sources, and destination system, and data can start syncing.
In other cases and specifically as it relates to the higher volume use cases, we're gonna request the customer to install an agent in their environment. And that agent is going to essentially allow for this higher volume use case where we want low latency access to the data. We wanna take advantage of compression as any and all of the data moves across the wire. We're taking advantage of 5 to 10 x compression. We're taking advantage of strategies to optimize the use of network bandwidth even in high latency networks. So some of these strategies come into play, and there's essentially options in between those kind of extremes where there's almost no configuration from a customer perspective and somewhat more where the customer needs to do the installation on on a server on their side before they come to our website, our portal to to enter some credentials and and configure that pipeline.
[00:23:30] Unknown:
In terms of the way that Fivetran has been doing business where it's largely been batch oriented from a source to a destination, What are some of the internal architectural changes that have been necessary as you have been integrating the HBR technology and the CDC capabilities to be able to have kind of a unified end user experience for interacting with the Fivetran platform, bridging across these batch and streaming modes.
[00:23:57] Unknown:
With the Fivetran platform, we've obviously seen pressure, as you hinted at, to get to lower latency. Now as it stands, we've limited the sync frequency down to once per minute so customers can go in and configure their syncs to run once a minute. Now, of course, the assumption would be that the syncs are running within a minute. And I think if you look at the number like, some of the most popular data destinations, whether that is Snowflake or Databricks or BigQuery or Redshift, like, some of those call it analytical or data lake or lakehouse kind of technologies, those technologies aren't necessarily suitable for sub second kind of latency. Right? Like, that's where Kafka comes in as a technology and Kinesis pops up and those kinds of technologies.
Now if you consider that the delivery into the destination is gonna have to go through micro batches anyway, well, then 1 minute, sync frequency is actually quite good. Right? Like, if you can get data end to end within a minute, that is, I think, quite remarkable for some of these data platforms. Now we do indeed, as as you said, like, we're gonna drive this further down all the way down to, like, okay. We're running everything continuously, and we're just delivering the data into the destination, into the data stream as it arrives on the source. We're not quite there yet, but this is absolutely part of our journey as we integrate those technologies.
[00:25:27] Unknown:
For change data capture, largely you've been talking about in the context of databases, but from a principal perspective, there's nothing that constrains it to working in those types of systems specifically. And I'm wondering what you see as some of the potential for applying those patterns to other types of data sources. So things like SaaS APIs being able to bring change data capture semantics into maybe data warehouses for reverse ETL, being able to bring change data capture into maybe EventStream pipeline so that you can have a unified interface for processing those and some of the, I guess, standards that would be useful to be able to start to build on top of for making change data capture a more maintainable approach to data integration.
[00:26:18] Unknown:
Yeah. Chain data capture, as you said, is a very generic concept, and it does, of course, apply well beyond databases. We talk about it a lot in the context of databases, but that doesn't mean it doesn't apply to APIs. Now on the API side, however, we are dependent on what the API provides. And we will absolutely and we already have a number of connectors that, in fact, do change data capture based on the APIs that are available. But, again, it largely depends on what the API provider made available to the consumers for the endpoint.
I suppose we will use CDC whenever we can. And in some cases, we could do this mostly by relying on, like, a last modified date. This kind of information is sometimes available through APIs. We have to know how we can deal with deletes for those kinds of use cases. We do have or we are also investing in technologies where we're essentially doing a very quick comparison between source and destination data and compute the differences between those so that we can bring them back in sync by just selectively fetching the differences or selectively applying the differences as it relates to deletes.
So there is all these different approaches in the making, if you like, to allow for different use cases that may fit better for 1 scenario versus another. But, yeah, like the API use case. And I think in general, when you think about CDC, as the data volumes increase, like CDC becomes more and more important. Right? Because a full load is no longer possible. Now when we started the conversation, I mentioned 15 terabyte of changes in a day on a single database. If you consider, let's say, the use of Salesforce for a typical organization or you consider the use of Zendesk or ServiceNow or some of these SaaS based applications that we have connectors for, are you making, like, 15 terabytes worth of changes within a day with the use of your platform? And I think the answer generally is going to be no. Probably not.
Right? Well, in fact, 50 terabytes, I think, is in the on prem world quite an exception. Again, like, if we look at the volumes, I think some of the CDC is more important in the database world because there are just more changes than if we compare those to the SaaS world. But as APIs evolve, as we get more
[00:28:56] Unknown:
broad application of CDC paradigms and practices to data integration as a general practice, I'm wondering if you have seen any movement or the efforts across the broader community to introduce some form of maybe standards definition or, you know, try to build consensus around what that can and should look like from a technical implementation and interface design perspective?
[00:29:25] Unknown:
Yeah. So unfortunately, there is no such standard. In fact, what what I think is actually making our service quite valuable is is arguably the lack of standards. Right? Because if you consider that there are no standards and there are even no rules out there, if if you think about it like updates to APIs, and those APIs do change, like new attributes get added. In some cases, attributes get removed. And if you rely on extracting data through those APIs and and you built your own solution and and it works on day 1 and day 5, it no longer works. And now you have to go figure out, like, okay, where did my attribute go? Or or why why did it no longer work? And you have no explanation for that. With our 5 tran managed service, we maintain these things. So we know what's going on on the platform.
We can see when things break. We analyze. We have relationships with the source provider. So in some cases, we get updates when things change. We don't always get the updates, but we do still recognize when things end up failing. And we we proactively start addressing these changes. Now you also have to realize that based on our estimation, there's more than 20, 000 API platforms out there, and that number increases on a daily basis, right, by double digit, possibly triple digit numbers. Right? There's a lot of APIs out there, and it'd be wonderful if there were the standard, but there isn't 1. We haven't seen 1. Maybe 1 day, we can only hope for, we will be so popular that we can propose a standard and application or API providers are are willing to adopt our standard because they look at us as a market leading technology that they see benefits.
But, yeah, like, it's not there today.
[00:31:10] Unknown:
And so another interesting aspect of the adoption of change data capture is for the case where you already have an existing data integration workflow for a given database, but you want to be able to start moving to this more continuous feed of updates rather than having to have scheduled batch jobs. And I'm wondering what the process looks like for customers who maybe have already been using, for instance, Postgres source and syncing that into their data warehouse. And then they say, okay, now I actually want to make this a continuous feed so that I get all of the updates as they happen. But I don't wanna break any of my existing use cases for that data that's present and just what that process might look like for moving from the batch oriented to the change data capture feeds.
[00:31:58] Unknown:
In the batch oriented world, an important question to ask is always like, okay. To what extent do you apply transformations on the data as you take it out of the source and put it into the destination? When we think about the CDC technologies, by far the easiest, and from our perspective, the proposed approach is to essentially take a straight copy of source tables that you're interested in into the destination with minimal transformations. Some of the transformations that are quite popular are things like soft deletes. Instead of physically deleting a row when a row got deleted on the source, we mark it as deleted instead. And, of course, it's very easy to filter out those deletes if you don't want them. But if you do have some post processing happening on your system, then it's actually very convenient to know what rows get deleted because otherwise you'd have to run a relatively expensive operation in order to figure that out.
So we go from a batch oriented mode to an extract load where it's, for the most part, a regular copy of the table. If there are transformations, then you wanna do the transformations on top of the data. Now transformations are the Fivetran platform. We integrate with DBT packages, and we actually provide a lot of DBT packages out to the open to the open source world, and the use of the transformations is free of charge within our platform. The motivation there is it specifically around the use of the SaaS applications. Right? Like and I mentioned that the API has changed and we have to change our connectives in order to be compatible with those APIs. That also means that then the destination definition evolves over time.
You wanna have that starting point, and you wanna use transformation on top of whatever the starting point in is as it evolves, and we provide those transformations there. Now DBT is a widely adopted technology that's out there. Our recommendation would be to also utilize that in the context of, well, if we are doing database extract load and there are extensive transformations that used to be there in the batch world, well, then maybe incorporate those as part of the ELT process. Still continue to use your existing destination tables. We can align the change data capture with the state of the destination and essentially pick up, continue from there, and then we'll have some nice capabilities like lineage charts that eventually can show you based on the data and the destination through the transforms where did that data come from.
[00:34:39] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by snd. Io, 95% reported being at at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. In terms of the maybe behavioral changes in the ways that the organization interacts with the data once they start using these change data capture feeds and having a more continuous view of the information that their various systems are generating, what are some of the ways that that might influence their approach to their core data practices, like the architectural capabilities, the types of data products or assets that they're building, and also some of the ways that that bleeds into some of the operational characteristics of the business as far as how much they rely on the data and their overall perception of the reliability of the information that they're getting from those downstream data products?
[00:36:28] Unknown:
So I think this is where organizations, in some cases, they have a clear path to where they wanna go, and they have plans with data as it becomes available closer to real time. And we've seen over time a lot of, like, new data products that get delivered. Right? Like organizations who who build a data product out of consolidated data sources because, let's say, they work with organizations who deploy their products in their plants. And now by providing a consolidated view of how their systems operate across the different plants, they can now provide a data product that has genuine value to their customers, and they can start selling that. And the closer to real time the dataset is, the more valuable the information is to the customer. We've also seen it, for example, in the package delivery space where the organization started consolidating data feeds as they started tracking packages in the real time across the warehouse where the bottlenecks were, Those could get resolved, but then also to be able to better provide delivery guarantees to the customers who received the shipment in the end. So it's a win win scenario where bottlenecks disappear, but at the same time, new data products get delivered. So I think that these are very exciting evolutions. You asked a relatively generic question. You asked about data architecture and where does that fit in. And I think I'll take a step back and I'll say that or at least what we've seen a lot is that organizations, as they start started embarking on some of these use cases, I think in many cases, they didn't necessarily understood the power of this near real time data and the possibilities that it would deliver to them, they naturally ended up looking at cloud technologies to to solve these challenges. And the reason for that was because we had large volumes of data.
We knew that we needed a lot of processing capabilities in order to get the data processed. But at the same time, there was no appetite. There was no justification for an incredibly large upfront investment. So you wanted to go with a pay as you go service where you knew you had scalability on demand, but you didn't have, like, a very large upfront payment in order to build a solution where you'd have to evaluate, like, what is the total return on investment, etcetera. So gravitating towards a cloud technology was was a natural choice there. And I think that's where we saw an acceleration of adoption of cloud technology specifically around these use cases where large volumes, near real time, complex analytical processes were required. And then now, over the course of the last few years, you've seen, like, cloud providers deliver technologies, deliver services that are actually very useful to make those kinds of use cases even more powerful. Right? Like, you go to AWS, you go to Google, or you go to Azure, you can find readily built machine learning algorithms where all you need to do is figure out how to feed your data through the algorithm and out comes some some machine learning results that you can start utilizing to improve your business, to improve your organizations, where traditionally, you would have had to kinda start building those models from scratch, and you'd have to figure out, like, oh, what are the relevant attributes that we should look at? And there is, I think, a plethora of technologies and services that have been developed around these use cases. And and I think it's still relatively early days for some of this. I think there is still a lot we can learn, a lot more services that are going to be developed here, but also from an organizational perspective, lots of opportunities for organizations to take more and more advantage of some of these capabilities that are out there. In your work of
[00:40:25] Unknown:
building change data capture at HVR and now at Fivetran and then integrating those capabilities into the Fivetran platform and working with your customers there, what What are some of the most interesting or innovative or unexpected ways that you've seen the CDC capabilities used?
[00:40:41] Unknown:
Yeah. I think the CDC capabilities, I think the operational use cases, that's where some of the some very interesting scenarios have been developed over time. And there's an example that comes to mind in the manufacturing space where there is complex machinery that gets built, has all kinds of sensors in the technology. That technology gets shipped to clients. The maintenance, the ongoing performance of that machinery is highly dependent on, let's say, the quality of the individual components. And we all understand that with a very expensive investment in complex machinery, the performance, the efficiency, and the uptime of this technology is incredibly important.
And to be able to maximize the uptime for customers knowing that, well, okay, components, wear and tear, they start degrading over time, and at some point, they're going to fail. Figuring out, like, okay. What is the best way to or, like, how can we integrate all these different data sources in a way that, okay, we will be able to do preventive maintenance of this complex machinery where even maintenance itself is relatively complex. And in some cases, right, the machinery locomotives, etcetera. And you have to bring together parts. You have to bring engineers. You have to allow for time. You may need, let's say, a garage space or something like that where you can perform the maintenance. All of these things, the right tools have to come to the right place at the right time in order to do the service when the ultimate goal is to maximize the uptime, the efficiency, let's say, the average speed, the value that the customer can get out of their machinery. I think some of those use cases, those are some of the most exciting ones that I've seen, developed over time.
[00:42:44] Unknown:
In your own experience of working in this space and investing in and building CDC technologies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:42:55] Unknown:
There's a couple of lessons there. There is certainly a lesson around the volume. Right? Like, where we started the conversation to recognize and realize, like, how much the volume has changed. But I think another lesson is that you can never overestimate how complex the data infrastructure at a customer ends up being or consider, like, with the database technologies. And, of course, the transaction processing technologies have evolved for a long time, and they've been very mature for literally decades. There's a lot of capabilities there. An organization, they do utilize some of the complex more, let's say, corner case capabilities in those technologies.
They rely on them. They want the data, the results, the the changes replicated. How can you help them? It it gets complicated over time. The challenge that we're in is how to make that very simple, and that is an ongoing challenge and keeps them busy.
[00:43:52] Unknown:
And so for people who are interested in being able to gain more consistent visibility into their data and be able to understand how things are evolving as they happen? What are the cases where CDC is the wrong choice and maybe they are better off just sticking with batches and maybe just ratcheting down the schedule?
[00:44:13] Unknown:
So we see customers use data sources of various kinds. And in some cases, for historical reasons, while a lot of data feeds ended up in in an existing data warehouse, and now for whatever reason, there is the desire to move on from the data warehouse technology. However, the loads into the data warehouse come from many angles. And, dominantly, on a daily basis, let's say, there is a truncate and reload of that data happening. It's become sizable. It it's big. And and now they wanna start using data as the initial source for the adoption of a new technology. While truncate reload is just not the best use case for change data capture the way certainly not the log based change data capture that we focused on during most of our conversation here. That's where, like, a comparison and applying just the differences becomes more relevant because, like, hey. If we do a batch reload and let's say we have a few years of of historical data that we're dealing with, but still it's a truncate reload on a daily basis, well then 99 plus percent of the data actually doesn't change.
However, if you looked at it from a change data capture perspective, well, the table is is emptied on a daily basis, and it's reloaded. Lots of changes. But in practice, there's not that many changes. So that is absolutely an example where the certainly, the law based CDC is not the right approach.
[00:45:35] Unknown:
As you continue to build and invest in CDC at Fivetran, what are some of the things you have planned for the near to medium term?
[00:45:43] Unknown:
There's the continued focus on getting to the high volume use cases. The use cases that are absolutely critical to the customer's primary business processes and unlocking the data and incorporating the data in their consolidated data feeds, their streaming analytics, their data warehouse workloads. That is ongoing, but at the same time, it's that desire to simplify the use cases. If you think about, like, okay. We have a particular database technology, and you go to the Fivetran website and you look at, well, okay. I want to unlock data out of this technology, we might present you with 2, 3, in some cases, 4 or 5 different approaches to get the data out. You end up self selecting like, okay. This is probably the best approach for me, and maybe you talk to a representative from Firetran to help guide you through this.
However, in the ideal world, you shouldn't have to make that choice. We should be able to present the right choice to you. So maybe that is a flow of a few questions. And, of course, we can never out of the cloud, well, unless you provide the credentials to do so, you may end up having to install a bit of software in your data center if that is what starts the handshake. But at the end of the day, whether if you wanna replicate 500 tables out of your ERP system, a couple of those are loaded via a truncate reload, for example, well, then we should be able to without asking you to self select, like, okay. There's 2 different options here. And for those 2 tables, you wanna use this option, and for the rest, you wanna use this other option. We should be able to figure this out ourselves and make this so simple that in the end, absolutely, of course, we will always need to ask for credentials to the system. But beyond that, absolutely minimize and simplify what that user experience looks like. And I think there is still room for improvement, and that's where we'll be looking at over the next few years, I I imagine.
[00:47:48] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:04] Unknown:
I think it's related to the visualization. Right? Like, from a visualization perspective, it's the ability to discover what would be the right visualization of a dataset. I think there are still missing technology components there that would essentially make the right choice of how we can visualize the results from a dataset that's just not there. You have to know what you have to look for, and I think that's gonna be my submission to your question. Thank you. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on change data capture
[00:48:38] Unknown:
at Fivetran and definitely very interesting and constantly evolving space. I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you. You too.
[00:48:56] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production product from the show, then tell us about it. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Mark Vandeveel: Introduction and Background
Notable Changes and Advancements at Fivetran
Change Data Capture and Industry Usage
Approaches to Change Data Capture
Real-Time Data and Event Streams
Analyzing Change Data Capture Feeds
Fivetran's Approach to Change Data Capture
Expanding Change Data Capture Beyond Databases
Transitioning from Batch to Change Data Capture
Impact of Change Data Capture on Organizational Practices
Interesting Use Cases of Change Data Capture
Future Plans for Change Data Capture at Fivetran
Biggest Gaps in Data Management Tooling
Closing Remarks