Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

You wake up to a Slack message from your CEO who's upset because the company's revenue dashboard is broken. You're told to fix it before this morning's board meeting, which is just minutes away. Enter Metaplane, the industry's only self serve data observability tool. In just a few clicks, you identify the issue's root cause, conduct an impact analysis, and save the day. Data leaders at Imperfect Foods, Drift, and Vendor love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast.

Sign up for a free forever plan at dataengineeringpodcast.com/metaplane

or try out their most advanced features with a 14 day free trial. And And if you mentioned the podcast, you get a free in data we trust world tour t shirt. Your host is Tobias Macy, and today I'm interviewing Mark Vandeveel about Fivetran's implementation of change data capture and the state of streaming data integration in the modern data stack. So Marc, can you start by introducing yourself?

Thank you. So, yes, Mark Vandeweel.

My job is field CTO at Fivetran.

I joined Fivetran about a year ago through the acquisition of HCR.

And throughout my career, I've worked in the data replication and BI and analytics space.

And do you remember how you first got started working in data?

Absolutely. I was in university. I came across a project in my graduation

year

for the Dutch government where we wanted to build an information system for decision makers.

So that's how I ended up in the data space. I learned about Oracle, and

everything from there is history.

You've been at Fivetran for about a year. I actually interviewed 1 of the founders about 3 years ago about Fivetran,

allowing for the fact that you haven't been there for that whole span. I'm wondering if you can maybe call out some of the notable changes or advancements that have happened at Fivetran

within the past 3 years or even just in the past year since you've been there?

Yeah. Absolutely. So with Fivetran,

we provide

data integration as a service. And 1 of the key aspects of Fivetran

is connectors.

How many applications, software as a service applications, and databases can we connect to so that organizations

can synchronize, can consolidate data in an analytical environment in the cloud.

So we always continue to evolve and expand the number of connectors that we work with.

So over the past 3 years, we'll have grown from, let's say, about a 100 connectors up to, right now, probably over 200 connectors, and this continues

to go on.

In addition to that, security has had an incredible focus.

Organizations

are naturally very concerned about access to the data. You hear about data breaches almost on a daily basis.

And our platform, as a managed service, we obviously have to cater for the level of security that our customers

expect. Now lastly, I wanted to highlight, of course, the HR acquisition.

Fivetran has had database connectors. However,

there was a recognition about a year ago, just over a year ago, that

building industry leading database connect specifically around change data capture

is incredibly difficult.

And 5 so Fivetran went out and bought what was considered to be 1 of the market leading technologies with the HVR acquisition

to make that part of the managed service. And that's what we've been focused on this past year is to integrate those technologies

and make high volume data replication part of the, Fivetran managed service.

So in terms of the

usage of change data capture and the

utility and requirement for these near real time data feeds, what do you see as some of the

scale and scope of usage across the industry

both now and maybe compared

to 3 to 5 years ago?

That's a great question, and it kinda shows the boundaries that we're pushing with this kind of technology. Very recently, I've been working with a customer in the financial services industry

who have a single database that generates up to 15 terabytes

of changes in a day.

So if you think about the sheer volume of changed data that goes on in this database,

it's incredible.

I think it shows how we've been

progressing if I think back not just 3 to 5 years, but I think back to some of the early days of my career when there was surveys out there about what are the largest

data warehouse databases that are out there in the marketplace. Right? Like, in Yahoo came out and Walmart came out. They had

systems that were sometimes in the tens of terabytes in total volume, including the indexes, everything that was residing in the database.

And now, if we consider, like, 15 terabytes of change change volume

in a day,

that is incredible. And that is then 1 out of many systems that organizations try to consolidate in their analytical environment. And you can imagine the the volume dimension, how that has evolved and how that has become

very dominant in our space.

And as far as the

approaches to change data capture,

There are a number of different products out there. 1 of the most popular ones in the open source space is Debezium.

And I'm wondering if you can talk to some of the different ways that change data capture

manifests

where different database engines maybe have it built in natively versus having to bolt it on as an afterthought and how that influences

the

kind of maturity

and

robustness of the capabilities that it provides.

We could spend all of this recording on that topic, I suppose. But just to keep it at a relatively high level, you're absolutely right. Some of the technologies provide native

options for log based change data capture. And let me actually take 1 more step back and go back to the concept of log based replication.

Fundamentally, almost every transaction processing database that's out there will use a transaction log to record the changes. It becomes the ledger of what was going on in the database, and it is foundationally

the basis for database recovery. System crashes or the software crashes,

system restarts, database restarts, and

it goes back to the most recently committed state of the database

by replaying

changes from the transaction log top of the change data that was on disk. Now log based change data capture is widely considered to be the least intrusive

approach to then get those changes out of the database.

Now, indeed, some databases have native capabilities

to retrieve those changes out of the database,

like Postgres, write a head, log, reader. Oracle has log minor capability.

SQL Server provides CDC tables and and other databases have different options.

Those options can absolutely be used within the context of the database.

However,

whatever the limitations

might be that are associated with that technology,

possibly

think about overhead or think about the implementation,

think about some of the

data types, maybe limitations that are imposed upon you by the provider of the technology.

Those are the ones that you have to live with.

Also, consider that running inside of a database generally comes with a a certain amount of additional overhead. There is security validations. There is parsing. There's all of these,

let's say, routines that are called with every operation that you submit down to a database. And then this is all for a good thing. Right? Like, it's secure. It's recoverable. It's it's all of these attributes that we really love about these database technologies.

However,

there are systems, and certainly, I think when I think back to the history of HVR,

we've come across

absolutely mission critical database systems where, essentially,

a slowdown of the database technology would have a direct revenue impact to the organization. And

there was a need for the absolute least intrusive way to perform change data capture. And with HVR and now Fivetran, we embarked upon this concept of so called binary log reading, where we essentially

submit some changes to a database,

access the the database the transaction log files directly,

and go figure out what happened in those transaction log files so that we can, in the end, parse out these changes for heterogeneous

replication.

We run outside the database. There is no additional overhead.

We come up with an architecture

that was distributed

where we do some of the heavy lifting very close to the source, but then

any more extensive processing happens downstream.

There's compression. There's encryption in the mix.

And with those, we've proven that we could acquire customers who had those mission critical,

let's say,

central core database technologies and could successfully capture changes out of these

with absolute minimal impact to database processing. And that is what customers really wanted in the end, and I think that is so the binary lock reading has absolutely been a great success for us. Now if you compare that to the Bezium that you had mentioned that is out there, the Bezium largely falls back on the technologies, the capabilities that the database vendors provide

from an out of the box point of view. Right? Like, in it covers many use cases, and it's absolutely great technology. But there are just cases where

some of the, let's say, the biggest databases, the busiest systems that are out there need something

that goes slightly beyond that. And that's where the binary log readouts come in. Another interesting element of this

problem space and conversation

is the

different ways that real time data manifests. Where right now we're talking about change data capture coming from the database.

A lot of the conversation around real time data has for the most part, been driven by event streams. So, like, click stream analytics or

application generated events, sensor driven events, and being able to process those as they are emitted.

And I'm wondering what you see as

the differences in terms of capabilities

and use cases for that data between these kind of real time event streams versus change data capture events. You know, some of the ways that we're able to lean on those technologies that were developed in those earlier stages of real time data to be able to

facilitate things like change data capture and where we have to build additional capabilities above and beyond that. When you think about log streaming and you talk about some of the sources you've mentioned, right, whether it's clickstream or IoT data,

a lot of these data streams are arguably relatively

straightforward. Right? Like, with the clickstream,

it's like, okay. I'm browsing the Internet, and here is where I go. It's always incremental. It's not like, okay. I go back, and the click that I did 10 minutes ago, I decided not to do that click but do a different click. Like, those kinds

of operations

don't happen. Right? Like and if we do, let's say, sensor generate sensor data based on, let's say, some technology that resides inside of our car, for example, and we we track, like, what is the thickness of our brake pads.

Like, every data point is a new data point. It's not like we're going back and updating historical data points.

Now if we contrast that to CDC and relational database technology, of course, when we look at how we operate

relational databases, of course, there is generally dominantly inserts, but there's also updates. And in some cases, systems genuinely process deletes, and and we wanna deal with those. And then for some of the use cases, you wanna look at it as an incremental stream of changes.

But for many use cases, you also want to get the current state of the data. And

I think where some of the most powerful use cases or some of the most interesting use cases come come to bear is where you combine some of those sources where

maybe it's some of the reference data or maybe it's some of the core processing that happens in the ERP is relevant in the context of what's going on with our IoT data or with our clickstream data or with our social media posts. And we're combining those datasets into a use case that shows a

an integrated a consolidated

overview of a set of systems with

real time aspects, attributes to it that in the end made the organization more competitive

or more

efficient or can save costs, like, whatever the the ultimate business outcome is for the organization.

For

organizations who maybe have already invested in these real time streaming capabilities for event streams, for IoT,

what are some of the

new systems or new architectural patterns that they need to adopt in order to be able to also factor in change data capture feeds or be able to

effectively process and analyze those data sources?

Yeah. So if you consider

the stream analytics

or the technologies that you would use for streaming analytics, right, like whether it's Kafka or it's

something like a Google Pub Sub or a Kinesis or equivalent Azure

technology,

None of these technologies really provide

the change data capture itself. Like, sure, I can point my data feed, my IoT feed to the data stream, and it'll absorb it, and I can run my analytics against it as

as changes arrive, etcetera.

Now

if we wanna combine that data with a dataset that comes out of a more traditional

database, it's like, well, okay. So what's the what's the change data capture? What is delivering those changes into the data stream

so we can incorporate those changes as part of the use case? So

when we look at the CDC technologies and the role they play in the context of

existing event stream use cases,

How do we incorporate those datasets? Like, what is the change data capture mechanism? And that's where our technology can come in

and

be that feed that takes the data out of the SaaS application in near real time or takes it out of a database with,

let's say, at most a couple of seconds of latency. Like, at the end of the day, we talk about real time, and and I suppose we didn't define what real time really is. But in reality, it's always near

real time. We're not directly querying the source. We're not delaying the transactions.

Instead, we're capturing the change once the commit hits the system, and, generally, that's within a couple of seconds

of when the commits hit hits the system. It can get delivered into the data stream.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder.

In terms of the

applicability and adoption of change data capture,

1 of the things that's necessary

for making it viable and desirable is the ability to be able to actually analyze that feed as it's coming in. And,

you know, up till now, there have been a lot of

additional capabilities that are necessary to make that feasible. And I'm wondering what are some of the

shifts and evolutions in the ecosystem that you've seen that make this a more tractable problem for people who don't necessarily have, you know, 1, 000, 000 of dollars of

financing to be able spend all of the engineering time and investment infrastructure

to be able to process these feeds as they come in. I think we see the technologies evolve on the streaming side. Right?

So if you think about Kafka as an example, like, it's been already a few years that KSQL became available, right, with

essentially a structured query language

capability to be able to query streaming

data. And that was maybe a first out of multiple where there is more of the ability to

automate the analytics of what's

happening to the data stream. And, likewise, I think if you consider, like, a Databricks as a destination, there's now Delta Life Tables where it's the transformations, the analytics, the analysis,

whatever it is that needs to happen to the data. Once it arrives, some of that happens automatically. So I think there is absolutely that

evolution of the technologies that enable those use cases, we see that happening more and more. Now

all of that said, I think we still see a lot of use of of our technology also in somewhat

relatively, let's say, traditional use cases as well, where there is, like, okay. We need reporting, and we've been doing batch reporting for all this time. And now we start with consolidated reporting, and we don't wanna do too much of an upfront investment. We start by using cloud technologies, pay as you go, and we're feeding those different data sources. And, oh, wow. We can do closer to real time. Let's see what

what kind of use cases come out of that. Well, we still see a lot of that as well in addition to some of the more, call it, leading edge streaming use cases.

Another interesting element of that is being able to actually

update views or run queries as the data changes where systems such as Materialise

and some of the other kind of real time databases have been built to make that a possibility.

And

cloud data warehouses have generally been built up as these scalable systems that allow you to run these queries on a periodic basis and be able to process massive amounts of data without having to wait, you know, minutes or hours for it to complete. Although there is the question of being able to do that affordably if you want to keep your data up to date. So I'm wondering what are some of those

kind of capabilities

in the access path

that make this a tractable problem and viable for companies to be able to actually want to pull in those data sources

and query them on a continual basis.

Like you mentioned, right, there's a lot of technologies

in this space

that look at a problem from a slightly different angle. And I'll take it all the way back to call it federated

access to systems. Right? Like, you consolidate your view of the world by just always connecting to the source applications. Now 1 of the benefits, of course, is you're looking at up to date data because you're hitting the source directly. However,

if you try to pull large volumes of data together and you wanna

join these datasets and combine these datasets

on the fly, well, that becomes a challenging problem. And indeed, scalability

costs come into play as well as arguably the real time aspect. You could say, well, yeah, we're accessing the source. So, of course, the data is real time. If it then takes a couple of hours to pull the datasets together, well, then it's arguably no longer real time because it took a couple of hours to consolidate the datasets.

There is then the approach of,

let's call it, CBC landed in a data warehouse or a data lake and run, let's say,

data load or the transformation routines on top of those. Right? Extract, load, transform,

or in some cases, extract, transform, load, these kinds of approaches.

And then there are also the solutions like you mentioned, like what Materialise does, for example, is is essentially build

a single view on top of a set of datasets and then kinda, like, provide that update automatically, but materialize the datasets so that access to the queries or access to the data is high performance when you need it. You're not always recomputing,

reconsolidating

that data. It's already there. I think all of these are different approaches to a very similar problem, and you may find that indeed based on the budget and, let's say, the data volume, the dataset you have,

1 option might work better for you than another option. But I think all of these are viable approaches

within the context of

near real time, real time analytics, and also combining streaming datasets as well. In terms of the way that Fivetran is approaching change data capture and the requirements for

a customer to be able to actually incorporate that into their

data platform and data analytics,

how much of the

overall process does Fivetran own and what are some of the capabilities that are necessary on the customer side to be able to

handle those feeds that you're being able to send and manage the kind of integration flow for?

When we set up and configure CDC for a customer's data source, we do have a set of requirements that the customer has to fulfill in order to enable us

to do the change data capture.

That set of requirements, we wanna keep that as minimal as possible, but, of course, we have to recognize that we have to end up with a working solution.

So

we provide multiple options with the Firetran technology as it relates to CDC, and we ask the customer to self select what is the best option for their environment.

In some cases, that means that all they need to do is create a database user with the adequate privileges,

come to our portal, enter the credentials,

sources, and destination system, and data can start syncing.

In other cases and specifically

as it relates to the higher volume use cases,

we're gonna request the customer to install an agent

in their environment. And that agent is going to essentially allow for this higher volume use case where we want low latency access to the data. We wanna take advantage of compression as

any and all of the data moves across the wire. We're taking advantage of 5 to 10 x compression.

We're taking advantage of strategies to optimize the use of network bandwidth even in high latency networks.

So some of these strategies come into play, and

there's essentially options in between

those kind of extremes where there's almost no configuration from a customer perspective and

somewhat more where the customer needs to do the installation on on a server on their side before they come to our

website, our portal to to enter some credentials and and configure that pipeline.

In terms of the

way that Fivetran

has been doing business where it's largely been batch oriented from a source to a destination,

What are some of the internal architectural changes that have been necessary

as you have been integrating the HBR technology and the CDC capabilities

to be able to

have kind of a unified

end user experience for interacting with the Fivetran platform, bridging across these batch and streaming modes.

With the Fivetran platform, we've obviously seen pressure, as you hinted at, to get to lower latency.

Now as it stands, we've limited the sync frequency

down to once per minute so customers can go in and configure their syncs to run once a minute. Now, of course,

the assumption would be that the syncs are running within a minute.

And I think if you look at the number like, some of the most popular

data destinations, whether that is Snowflake or Databricks or BigQuery or Redshift,

like, some of those call it analytical or data lake or lakehouse kind of technologies,

those technologies

aren't necessarily suitable for sub second kind of latency. Right? Like, that's where Kafka comes in as a technology and Kinesis pops up and those kinds of technologies.

Now if you consider that the delivery into the destination

is gonna have to go through micro batches anyway,

well, then 1 minute, sync frequency is actually quite good. Right? Like, if you can get data end to end within a minute, that is, I think, quite remarkable for some of these data platforms.

Now we do

indeed, as as you said, like, we're gonna drive this further down all the way down to, like, okay. We're running everything continuously, and we're just delivering the data into the destination, into the data stream

as it arrives on the source. We're not quite there yet, but this is absolutely part of our journey as we integrate those technologies.

For

change data capture, largely you've been talking about in the context of databases,

but from a

principal perspective, there's nothing that constrains it to working in those types of systems specifically.

And I'm wondering what you see as some of the potential for applying those patterns to

other types of data sources. So things like SaaS APIs

being able

to bring change data capture semantics

into maybe data warehouses for reverse ETL, being able to bring change data capture

into maybe EventStream pipeline so that you can have a unified interface for processing those and some of the,

I guess, standards that would be useful to be able to start to build on top of for making change data capture a more

maintainable approach to data integration.

Yeah. Chain data capture, as you said, is a very generic concept, and it does, of course, apply well beyond databases. We talk about it a lot in the context of databases, but that doesn't mean it doesn't apply to

APIs.

Now

on the API side, however, we are dependent on what the API provides. And

we will absolutely and we already have a number of connectors that, in fact,

do change data capture based on the APIs that are available.

But, again, it largely depends on what the API

provider made available to the consumers

for the endpoint.

I suppose

we will use CDC whenever we can. And in some cases, we could do this mostly by

relying on, like, a last modified date.

This kind of information is sometimes available through APIs.

We have to know how we can deal with deletes

for those kinds of use cases.

We do have

or we are also investing in technologies where we're essentially

doing a very quick comparison

between source and destination data

and compute the differences between those so that we can bring them back in sync by just

selectively

fetching the differences or selectively

applying the differences as it relates to deletes.

So there is all these different approaches

in the making, if you like,

to allow for different use cases that may fit better for 1 scenario versus another. But, yeah, like the API use case. And I think in general, when you think about CDC,

as the data volumes increase,

like CDC becomes more and more important. Right? Because a full load is no longer possible. Now when we started the conversation,

I mentioned 15 terabyte of changes in a day on a single database.

If you consider, let's say, the use of Salesforce for a typical organization or you consider the use of

Zendesk or ServiceNow or some of these SaaS based applications that we have connectors for,

are you making, like,

15 terabytes

worth of changes within a day with the use of your platform? And I think the answer generally is going to be no. Probably not.

Right? Well, in fact, 50 terabytes, I think, is in the on prem world quite an exception.

Again, like, if we look at the volumes, I think some of the CDC is more important in the database world because there are just more changes than if we compare those to the SaaS world. But as APIs evolve, as we get more

broad application of CDC

paradigms and practices

to data integration

as a general practice, I'm wondering if you have seen

any

movement or the efforts across the broader community to introduce some form of

maybe standards definition or,

you know, try to build consensus around what that can and should look like from a

technical implementation

and interface design perspective?

Yeah. So unfortunately,

there is no such standard. In fact, what

what I think is actually making our service quite valuable is is arguably the lack of standards. Right? Because

if you consider that there are no standards and there are even no rules out there, if if you think about it like updates to APIs,

and those APIs do change, like new attributes get added. In some cases, attributes get removed.

And if you rely on extracting data through those APIs and and you built your own solution and and it works on day 1 and day 5, it no longer works. And now you have to go figure out, like, okay, where did my attribute go? Or or why why did it no longer work? And you have no explanation for that.

With our 5 tran managed service, we maintain these things. So we know what's going on on the platform.

We can see when things break. We analyze. We have relationships

with the source provider. So in some cases, we get updates

when things change. We don't always get the updates, but we do still recognize when things end up failing. And we we proactively

start addressing these changes.

Now you also have to realize that

based on our estimation, there's more than 20, 000

API

platforms out there, and that number

increases on a daily basis, right, by double digit, possibly triple digit numbers. Right? There's a lot of APIs out there, and it'd be wonderful if there were the standard, but there isn't 1. We haven't seen 1. Maybe 1 day, we can only hope for, we will be so popular that we can

propose a standard and application or API providers are are willing to adopt our standard because they look at us as a market leading technology that they see benefits.

But, yeah, like, it's not there today.

And so another interesting aspect of the adoption of change data capture is for the case where you already have an existing data integration workflow for a given database, but you want to be able to start moving to this more

continuous feed of updates rather than having to have scheduled batch jobs.

And

I'm wondering what the

process looks like for customers who maybe have already been using, for instance, Postgres source and syncing that into their data warehouse. And then they say, okay, now I actually want to make this a continuous feed so that I get all of the updates as they happen. But I don't wanna break any of my existing

use cases for that data that's present and just what that process might look like for moving from the batch oriented to the change data capture feeds.

In the batch oriented world, an important question

to ask is always like, okay. To what extent do you apply transformations

on the data as you take it out of the source and put it into the destination?

When we think about the CDC technologies, by far the easiest, and from our perspective, the proposed approach is

to essentially take a straight copy of source

tables that you're interested in into the destination with minimal transformations.

Some of the transformations that are quite popular are things like soft deletes. Instead of physically deleting a row when a row got deleted on the source, we mark it as deleted instead. And, of course, it's very easy to filter out those deletes if you don't want them. But if you do have some post processing happening on your system, then it's actually very convenient to know what rows get deleted

because otherwise you'd have to run a relatively expensive operation in order to figure that out.

So we go from

a batch oriented mode to an extract load where it's,

for the most part, a regular copy of the table.

If there are transformations,

then you wanna do the transformations on top of the data. Now

transformations are the Fivetran platform. We integrate with DBT packages,

and we actually

provide a lot of DBT packages out to the open to the open source world, and the use of the transformations is free of charge within our platform.

The motivation there is

it specifically around the use of the SaaS applications. Right? Like and I mentioned that the API has changed and we have to change our connectives in order to be compatible with those APIs. That also means that then the destination definition

evolves over time.

You wanna have that starting point, and you wanna use transformation on top of whatever the starting point in is as it evolves,

and we provide those transformations there. Now

DBT is a widely adopted technology that's out there. Our recommendation would be to also utilize that in the context of, well, if we are doing database extract load and there are extensive transformations that used to be there in the batch world,

well, then maybe incorporate those as part of the ELT

process.

Still continue to use your existing destination tables. We can align

the change data capture with the state

of the destination and essentially pick up, continue from there, and then we'll have some nice capabilities like lineage charts that eventually can show you based on the data and the destination through the transforms where did that data come from.

Data teams are increasingly under pressure to deliver.

According to a recent survey by snd. Io, 95%

reported being at at or overcapacity.

With 72%

of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation,

85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to dataengineeringpodcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer.

In terms of the

maybe behavioral changes in the ways that the organization

interacts with the data once they start using these change data capture feeds and having a more

continuous view of the information that their various systems are generating,

what are some of the ways that that might influence their approach to their

core data practices, like the architectural capabilities, the types of

data products or assets that they're building, and also some of the ways that that bleeds into some of the operational characteristics of the business as far as how much they rely on the data and their overall perception of the reliability of the information that they're getting from those downstream data products?

So I think this is where organizations,

in some cases, they have a clear path to where they wanna go, and they have plans with data as it becomes available closer to real time.

And we've seen over time a lot of, like, new data products that get delivered. Right? Like organizations

who who build a data product out of consolidated data sources because,

let's say, they work with organizations

who deploy their products in their plants.

And now by providing a consolidated view

of how their systems operate across the different plants, they can now provide a data product

that has genuine value to their customers,

and they can start selling that. And the closer to real time the dataset is, the more valuable

the information is to the customer. We've also seen it, for example, in the package delivery space where the organization started consolidating

data feeds as they started tracking packages in the real time

across the warehouse where the bottlenecks were, Those could get resolved, but then also

to be able to better provide delivery guarantees to the customers who received the shipment in the end. So it's a win win scenario where bottlenecks disappear,

but at the same time, new data products get delivered. So I think that

these are very exciting

evolutions. You asked a relatively generic question. You asked about data architecture and where does that fit in. And I think

I'll take a step back and I'll say that

or at least what we've seen a lot is that

organizations,

as they start started embarking on some of these use cases, I think in many cases, they didn't necessarily

understood

the power

of

this near real time data and the possibilities that it would deliver to them,

they naturally ended up looking at cloud technologies to to solve these challenges. And the reason for that was because

we had large volumes of data.

We knew that we needed a lot of processing capabilities in order to get the data processed.

But at the same time, there was no appetite. There was no justification for an incredibly large upfront investment. So you wanted to go with a pay as you go service where

you knew you had scalability

on demand,

but you didn't have, like, a very large upfront payment in order to build a solution where you'd have to evaluate, like, what is the total return on investment, etcetera. So

gravitating towards a cloud technology was was a natural choice

there. And I think that's where we saw an acceleration of adoption of cloud technology specifically around these use cases where

large volumes, near real time,

complex analytical

processes

were required. And then now,

over the course of the last few years, you've seen, like, cloud providers

deliver

technologies, deliver services that are actually very useful to make those kinds of use cases even more powerful. Right? Like, you go to AWS, you go to Google, or you go to Azure,

you can find readily built machine learning algorithms where

all you need to do is figure out how to feed your data through the algorithm and out comes some some machine learning results

that you can start utilizing

to improve your business, to improve your organizations, where traditionally,

you would have had to kinda start building those models from scratch, and you'd have to figure out, like, oh, what are the relevant attributes that we should look at? And there is, I think, a plethora of technologies and services that have been developed around these use cases. And and I think it's still relatively early days for some of this. I think there is still a lot we can learn, a lot more services that are going to be developed here, but also from an organizational

perspective, lots of opportunities for organizations to take more and more advantage of some of these capabilities that are out there. In your work of

building change data capture at HVR and now at Fivetran and then integrating those capabilities into the Fivetran platform and working with your customers there, what What are some of the most interesting or innovative or unexpected ways that you've seen the CDC capabilities used?

Yeah. I think the CDC capabilities, I think the operational use cases, that's where some of the

some very interesting

scenarios have been developed over time. And there's an example that comes to mind in the manufacturing space where there is

complex

machinery

that gets built, has all kinds of sensors

in the technology.

That technology gets shipped to clients.

The maintenance, the ongoing performance

of that machinery is highly dependent

on, let's say, the quality of the individual components.

And we all understand that

with a very expensive investment in complex machinery,

the performance, the efficiency, and the uptime of this technology is incredibly important.

And to be able to maximize the uptime for customers

knowing that,

well, okay,

components,

wear and tear, they start degrading over time, and at some point, they're going to fail. Figuring out, like, okay. What is the best way to or, like, how can we

integrate all these different data sources in a way that, okay, we will be able to do preventive maintenance

of this complex machinery

where even maintenance itself is relatively complex. And in some cases, right, the machinery

locomotives, etcetera.

And you have to bring together parts. You have to bring engineers.

You have to allow for time. You may need,

let's say, a garage space or something like that where you can perform the maintenance. All of these things, the right tools have to come to the right place at the right time in order to do the service when the ultimate

goal is to maximize

the uptime, the efficiency, let's say, the average

speed,

the value that the customer can get out of their machinery. I think some of those use cases,

those are some of the most exciting ones that I've seen,

developed over time.

In your own experience of working in this space and

investing in and building CDC technologies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

There's a couple of lessons there. There is certainly a lesson around the volume. Right? Like, where we started the conversation to recognize and realize, like, how much

the volume has changed.

But I think another lesson is that you can never overestimate

how

complex

the data infrastructure at a customer ends up being or

consider, like, with the database technologies. And, of course, the transaction processing technologies have evolved for a long time, and they've been very mature for

literally decades.

There's a lot of capabilities there. An organization, they do utilize some of the complex more, let's say, corner case capabilities in those technologies.

They rely on them. They want the data, the results, the the changes replicated.

How can you help them? It it gets complicated over time. The challenge that we're in is how to make that very simple, and that is an ongoing challenge and keeps them busy.

And so for people who are

interested in being able to gain

more consistent visibility into their data and be able to

understand how things are evolving as they happen? What are the cases where CDC is the wrong choice and maybe they are better off just sticking with batches and maybe just ratcheting down the schedule?

So we see customers use data sources of various kinds. And

in some cases, for historical reasons, while a lot of data feeds ended up in in an existing data warehouse, and now for whatever reason, there is the desire

to move on from the data warehouse technology. However,

the loads into the data warehouse come from many angles. And,

dominantly, on a daily basis, let's say, there is a truncate and reload of that data happening.

It's become sizable. It it's big. And and now they wanna start using data as the initial source for the adoption of a new technology.

While truncate reload is just not the best use case for change data capture the way certainly not the log based change data capture that we focused on during most of our conversation here. That's where,

like, a comparison and applying just the differences becomes more relevant because, like, hey. If we do a batch reload and let's say we have a few years of of historical data that we're dealing with, but still it's a truncate reload on a daily basis, well then 99 plus percent of the data actually doesn't change.

However, if you looked at it from a change data capture perspective,

well, the table is is emptied on a daily basis, and it's reloaded. Lots of changes. But in practice, there's not that many

changes. So that is absolutely an example where the

certainly, the law based CDC is not the right approach.

As you continue to build and invest in CDC at Fivetran, what are some of the things you have planned for the near to medium term?

There's the continued

focus on getting to the high volume use cases.

The use cases that are absolutely critical to the customer's primary business processes and unlocking the data

and incorporating the data in their consolidated

data feeds, their streaming analytics, their

data warehouse workloads.

That is ongoing, but at the same time, it's that desire

to simplify the use cases. If you think about, like, okay. We have a particular

database technology, and you go to the Fivetran website and you look at, well, okay. I want to unlock

data out of this technology, we might present you with

2, 3, in some cases, 4 or 5 different approaches

to get the data out. You end up self selecting like, okay. This is probably the best approach for me, and maybe you talk to a representative from Firetran to help guide you through this.

However,

in the ideal world,

you shouldn't have to make that choice. We should be able to

present the right choice to you. So maybe that is a flow of a few questions. And, of course,

we can never

out of the cloud, well, unless you provide the credentials to do so, you may end up having to install a bit of software

in your data center if that is what starts the handshake. But at the end of the day,

whether if you wanna replicate 500 tables out of your ERP system, a couple of those are loaded via a truncate

reload, for example,

well, then we should be able to

without

asking you to self select, like, okay. There's 2 different options here. And for those 2 tables, you wanna use this option, and for the rest, you wanna use this other option. We should be able to figure this out ourselves and make this so simple that

in the end, absolutely,

of course, we will always need to ask for credentials to the system. But beyond that, absolutely minimize and simplify

what that user experience looks like. And I think there is still room for improvement, and that's where we'll be looking at over the next few years, I I imagine.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think it's related to the visualization. Right? Like, from a visualization

perspective, it's the ability to discover

what would be the right visualization

of a dataset. I think there are still missing technology components there that would

essentially make the right choice

of how we can visualize the results from a dataset that's just not there. You have to know what you have to look for, and I think that's gonna be my submission to your question. Thank you. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on change data capture

at Fivetran

and definitely

very interesting and constantly evolving space. I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you. You too.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production product

from the show, then tell us about it. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links