Digging Into Data Replication At Fivetran

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's

linode,

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, DataVersity, and the Open Data Science Conference with upcoming events, including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum.

Go to data engineering podcast.com/conferences

to learn more and to take advantage of our partner discounts when you register.

And go to the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing George Fraser about Fivetran,

a platform for shipping your data to data warehouses in a managed fashion. So, George, can you start by introducing yourself? Yeah. My name is George. I am the CEO of Fivetran, and I was 1 of 2 cofounders of Fivetran almost 7 years ago when we started. And do you remember how you first got involved in the area of data management? Well, before Fivetran, I was actually a scientist, which is a bit of an unusual background for someone in data management. Although it was,

sort of an advantage for us that we were coming at it fresh. And

so much has changed in the area of data management, particularly because of

the new data warehouses,

that are so much faster and so much cheaper and so much easier to manage than the previous generation,

that a fresh approach is really merited.

And so in a in a weird way, the fact that, none of the founding team had a background in data management was kind of an advantage. And

so can you start by describing it about describing a bit about the problem that Fivetran was built to solve and the overall story of how it got started and

what motivated you to build a company around it? Well, I'll start with the story of how it got started. So,

in late 2012,

when we started the company,

Taylor and I, and then Mel, who's now our VP of engineering, who joined early in 2013.

Fivetran was originally a vertically integrated data analysis tool. So it had a user interface,

that was sort of a super powered spreadsheet slash BI tool. It had a a data warehouse on the inside,

and it had a data pipeline that was speeding

the data warehouse. And through many iterations

of that idea, we discovered

that the really valuable thing we had invented was actually the data pipeline that was part of that. And so we threw everything else away, and

the data pipeline became the product. And the problem that Fivetran solves

is the problem of getting all your company's data in 1 place.

So companies today use all kinds of tools to manage their business.

You use CRM systems like Salesforce,

You use payment systems like Stripe, support systems like Zendesk,

finance systems like QuickBooks or Zuora.

You have a production database somewhere. Maybe you have 20 production databases.

And if you want to know what is happening in your business,

the first step is usually to synchronize all of this data into a single database

where an analyst can query it, and where you can build dashboards and BI tools on top of it. So that's the primary problem that Fivetran solves. People use Fivetran to do other things. Sometimes they use the data warehouse that we're syncing to as a production system. But the most common use case is they're just trying to understand what's going on in their business. And the first step in that is to sync

all of that data into a single database.

And in recent years, 1 of the

prevalent approaches for being able to get all of the data into 1 location for being able to do analysis across it is to dump it all into a data lake because of the fact that you don't need to do as much upfront schema management or data cleaning, and then you can

experiment with everything that's available. And I'm wondering what your experience has been as far as the contrast between loading everything into a data warehouse for that purpose versus just using a data lake. Yeah. So in this area, I think that sometimes people present a bit of a false choice between you can either

set up a data warehouse, do full on Kimball dimensional schema, data modeling, and Informatica

with all of the

upsides and downsides that come with that, or you can build a data lake, which is like a bunch of JSON and CSV files in s 3.

And I say false choice because I think the right approach is a happy medium

where,

you don't go all the way to sticking raw JSON files and CSV files in s 3. That's really unnecessary.

Instead,

you use

a proper relational data store,

but you

exercise

restraint

in how much normalization

and customization you do on the way in. So you say, I'm gonna make my first goal to create an accurate replica of all of the systems in 1 database, and then I'm gonna leave that alone. That's gonna be my sort of staging area, kinda like my data lake, except it lives in a regular relational data warehouse. And then I'm gonna build whatever transformations

I want to do over that data on top of that data lake schema.

So another way of thinking about it is that I am advising that you should take a data lake type approach, but you shouldn't make your data lake a separate physical system. Instead,

your data lake should just be

a different logical system within the same database that you're using

to analyze all your data and to support your BI tool. It's just a higher productivity,

simpler workflow to do it that way. Yeah. And that's where the current trends towards moving the transformation step until after the data loading into the ELT pattern has been coming because of the flexibility of these cloud data warehouses that you've mentioned as far as being able to consume semi structured and unstructured data while still being able to query across it and introspect it for the purposes of being able to join with other information that's already within that system. Yeah. The ELT pattern is is really, just a great way to get work done.

It's simple.

It allows you to recover from mistakes.

So

if you make a mistake in your transformations, and you will make mistakes in your transformations, or even if you just change your mind about how you wanna transform the data. The great advantage of the ELT pattern

is that the original untransformed

data is still sitting there side by side in the same database.

So it's just really easy to iterate in a way that it isn't if you're transforming the data on the fly, or even if you have a data lake where you, like, store the API

responses from all of your systems. That's still more complicated than if you just have this nice replica

sitting in its own schema in your data warehouse. And

so 1 of the things that you pointed out is needing to be able to integrate

across multiple different data sources that you might be using within a business. And you mentioned things like Salesforce

for CRM

or things like ticket tracking and user feedback such as Zendesk, etcetera. And I'm wondering what your experience has been as far as being able to

map the sort of logical entities across these different systems together to be able to effectively join and query across those datasets given that they don't necessarily have a the

sort

of

common the sort of common field names might be to be able

to map across those different, those different entities? Yeah. This is a really important step. And the first thing we always advise our customers to do, and even anyone who's building a data warehouse, I would advise to do this, is that you need to keep straight in your mind that there's really 2 problems here. The first problem is replicating all of the data, and the second problem is rationalizing all of the data into a single schema. And you need to think of these as 2 steps. You need to follow

proper separation of concerns,

just as you would in a software engineering project. So we really focus on that first step on replication.

What we have found is that the approach that works really well for our customers for rationalizing all the data into a single schema is to use SQL.

SQL is a great tool

for

unioning things, joining things, changing field names, filtering data,

all the kind of stuff you need to do to rationalize a bunch of different data sources into a single schema, we find the most productive way to do that is to use a bunch of SQL queries that run inside your data warehouse. And do you have your own tooling and interfaces

for being able to expose that process to your end users? Or do you also

integrate with tools such as DBT for being able to have that overall process controlled by the end user? So we originally did not do anything in this area other than give advice. And we got the advantage that we got to sort of watch what our users did,

in that context. And what we saw is that a lot of them set up cron to run SQL scripts on a regular schedule.

A lot of them used Looker persistent derived tables.

Some people used Airflow.

They used Airflow in kind of a funny way. They didn't really use the Python parts of Airflow. They just used Airflow as a way to trigger SQL. And when dbt came out,

we have a decent community of users who use dbt.

And we're supportive of whatever mechanism,

you want to use to to transform your data. We do now have our own transformation

tool,

built into our UI.

And, it's the first version that you can use right now. It's basically a way that you can provide the SQL script, and you can trigger that SQL script

when Fivetran

delivers new data to your tables. And we've got lots of people using the first version of that. That's gonna continue to evolve

over the rest of this year. It's gonna get a lot more sophistication, and it's gonna do a lot more to give you insight into the transforms that are running and how they all relate to each other. But the core idea of it is

that SQL is the right tool for transforming data. And before we get too far into the rest of the feature set and capabilities of Fivetran, I'm wondering if you can talk about how the overall system is architected and how the overall system design has evolved since you first began working on it. Yeah. So

the

overall architecture

is fairly simple.

The hard part of Fivetran is really not

the sort of high class data problems,

things like queues and streams and giant datasets flying around.

The hard part of Fivetran is really all of the incidental

complexity of all of these data sources. Understanding all of the small,

sort of crazy rules that every API has.

So most of our effort over the years has actually been devoted

to hunting down all these little details of every single data source we support, and that's what makes our product really valuable. The architecture itself is fairly simple. The original architecture,

was essentially a bunch of EC 2 instances

with Cron running a bunch of Java processes

that were running on a on a fast batch cycle, syncing people's data.

Over the last year and a half,

the engineering team has built a new architecture based on Kubernetes.

There are many advantages of this new architecture for us internally.

The biggest 1 is that it auto scales.

But from the outside, you can't even tell,

when you migrate from the old architecture to the new architecture, other than you have to white list a new set of IPs.

So the you know, it was a very simple architecture in the beginning. It's gotten somewhat more

complex. But, really, the hard part of Fivetran is not the high class data engineering problems. It's the

little details of every data source so that from a the user's perspective, you just get this magical replica of all of your systems in a single database.

And for being able to keep track of the overall health of your system and ensure that data is flowing from end to end for all of your different customers, I'm curious what you're using

for monitoring and alerting strategy and any sort of

ongoing continuous testing as well as advanced unit testing that you're using to make sure that all of your API interactions are consistent with what is necessary for the source systems that you're working with?

Yeah. Well, first of all, there's several layers to that. The first 1

is actually the testing that we do on our end to

validate that all of our sync strategies, all those little details I mentioned

a minute ago are actually working correctly.

Our testing problem is quite

difficult,

because we interoperate with so many external systems. And in many cases,

you really have to run the tests against the real system for the tests to be meaningful.

And so our our build architecture is actually 1 of the more complex parts of Fivetran.

We use a build tool called Bazel, and we've done a lot of work, for example, to

run all of the databases, and FTP servers, and things like that that we have to interact with in Docker containers, so that we can actually produce

reproducible

EDE tests.

So that actually is 1 of the more complex engineering problems at Fivetran. And if that sounds interesting you to you, I encourage you to apply to our engineering team because we have lots more work to do on that.

So that's the first layer is really all of those tests that we run to verify that our sync strategies

are correct.

The second layer is

that, you know, is it working in production?

Is the customer's data actually getting synced, and is it getting synced correctly?

And 1 of the things we do there that may be a little unexpected to people who are accustomed to building data pipelines themselves is all of Fivetran's data pipelines are typically fail fast. That means if anything unexpected happens, if we see, you know, some event from an API endpoint that we don't recognize,

we stop.

Now that's different than when you build data pipelines yourself. When you build data pipelines for your own company,

usually, you will have them try to keep going no matter what. But Fivetran is a fully managed service, and we're monitoring it all the time. So we tend to make the opposite choice. If anything suspicious is going on, the correct thing to do is just stop and alert Fivetran. Hey. Go check out this customer's data pipeline. What the heck is going on?

Something unexpected hap is happening, and we should make sure that our sync strategies are actually correct. And then that brings us to the last layer of this, which is alerting. So when data pipelines fail,

we get alerted and the customer gets alerted at the same time.

And then we communicate with the customer, and we say, hey. We may need to go in and check something. Do I have permission to go, you know, look at what's going on in your data pipeline in order to figure out, what's going wrong? Because Fivetran is a fully managed service, and that is critical to making it work. When you do what we do and you say, we are gonna take responsibility

for actually creating an accurate replica of all of your systems in your data warehouse,

that means you're signing on to

comprehend

and fix every little detail of every data source that you support. And a lot of those little details

only come up in production. When some customer shows up and they're using a feature of Salesforce that Salesforce hasn't sold for 5 years, but they've still got it and you've never seen it before.

Some of a lot of those little things only come up in production. The nice thing is that that set of little things, while it is very large, it is finite.

And we only have to discover each problem once, and then every customer

thereafter

benefits from that fix.

For the

system itself, 1 of the things that I noticed while I was reading through

works

and

works and the overall logic flow that you have built in. If it's just a static mapping that you have for each different data source or if there's some sort of more complex algorithm that's going on behind the scenes there as well as how that works for any sort of customized data sources such as application databases that you're working with or,

maybe just JSON feeds or event streams?

Sure. So the first thing you have to understand is that there's really 2 categories of data sources in terms of schema normalization.

The first category is databases, like Oracle or MySQL or Postgres, and database like systems. Like, NetSuite is really basically a database when you look at the API, so is Salesforce. There's a bunch of systems that basically look like databases. They have arbitrary,

tables, columns. You can set any types you want in any column. What we do with those systems is we just create an exact 1 to 1 replica of the source schema. It's really as simple as that.

So there's a lot of work to do because the change feeds are usually very complicated from those systems,

and it's very complex

to turn those change feeds back into the original schema, but it is automatable. So for databases and database like

systems, we just produce the exact same schema in your data warehouse as it was in the source.

For apps, for things like Stripe

or

Zendesk

or GitHub or Jira,

We do a lot of normalization of the data. So tools like

that, when you look at the API responses,

the API responses are very complex and

and usually very far from the original normalized schema that this data probably lived in in the source database.

And every time we add a new data source of that type, we,

study the data source.

We I joke that we reverse engineer the API. We've basically figured out what was the schema in the database that this originally was, and we unwind all the API responses

back into the normalized schema. These days, we often just

get an engineer at the

company

that is that data source on the phone and ask them, you know, what is the real schema here?

We can we found that we can save ourselves a whole lot of work by doing that.

But the the goal is always to produce a normalized schema in the data warehouse. And the reason why we do that is because we just think if we put in that work upfront to normalize the data in your data warehouse, we can save

every single 1 of our customers a whole bunch of time

traipsing through the data, trying to figure out how to normalize that. So we figure it's worthwhile for us to put the effort in upfront so our customers don't have to. And 1 of the other issues that comes up with normalization

and particularly for the source database systems that you're talking about is the idea of schema drift when new fields are added or removed or data types change or the overall sort the, sort of default data types change. I'm wondering how you manage schema drift overall in the data warehouse systems that you're loading into while preventing data loss, particularly in the cases where a column might be dropped or the data type changed?

Yeah. So it's it's, there's a core pipeline

that all Fivetran

connectors,

databases, apps, everything is written against that we use internally.

And all of the rules of how to deal

with schema drift are encoded there.

So some cases are easy.

Like, if you drop a column, then that data just isn't arriving anymore. We will leave that column in your data warehouse. We're not going to delete it in case there's something important in it. You can drop it in your data warehouse if you want to, but we're not going to. If you add a column, again, that's pretty easy. We add a column in your data warehouse.

All of the old rows will have nulls in that column, obviously.

But then going forward, we will populate that column.

The tricky cases are when you change the types.

So when you when you alter the type of an existing column, that can

be more difficult to deal with. Now we will actually

there there's 2 principles we follow. First of all, we're going to propagate that type change to your data warehouse. So we're gonna go and change

the type of the column in your data warehouse to fit the new data. And the second principle we follow

is that when you change types, sometimes you sort of contradict yourself.

And we follow the rule of subtyping,

in in handling that. If you think back to your undergraduate computer science classes,

this is the good old concept of subtypes. For example, an int is a subtype of a real. A real is a subtype of a string,

etcetera. So we we look at all the data passing through the system and we infer

what is the most specific type that can contain

all of the values that we have seen. And then we alter the data warehouse to be that type so that we can actually fit the data into the data warehouse. Another

capability that you provide is change data capture for when you're loading from these relational database systems into the data warehouse.

And that's a problem space that I've always been interested in as far as how you are able to

capture the change logs within the data system and then be able to replay them effectively

to reconstruct the current state of the database without just doing a straight SQL dump. And I'm wondering how you handle that in your platform.

Yeah. It's very complicated. Most people who build in house data pipelines, as you say, they just do a a dump and load,

of the entire table because the change logs are so complicated. And the problem with dump and load is that it requires huge bandwidth, which isn't always available, and it takes a long time. So you end up running it just once an hour if you're lucky, but for a lot of people, once a day. So we do change data capture. We read the change logs of each database. Each database has a different change log format.

Most of them are extremely

complicated. If you look at the MySQL change log format or the Oracle changelog format, it is like going back in time to the history of MySQL. You can sort of see every architectural

change in MySQL in the chains log format.

The answer to how we do that, there's no trick.

It's just a lot of work understanding

all the possible corner cases of these chains logs. It helps that we have many customers with each database.

So that unlike when you're building a system just for yourself, because we're building a product, we have lots of MySQL users. We have lots of Postgres users. And so over time, we see all the little corner cases

and

you eventually figure it out. You eventually find all the things, and you get a system that just works.

But the short answer is there's really no trick. It's just a huge amount of effort by the databases team at Fivetran, who at this point has been working on it for years with, you know, 100 of customers.

So at this point, it's you know, we've we've invested so much effort in tracking down all those little things. There's just like no hope that you could do better yourself,

building a change reader just for your own company. For the particular problem space that you're in, you have a sort of many to many issue where you're dealing with a lot of different types of data sources, and then you're loading it into a number of different options for data warehouses.

And on the source side, I'm wondering what you have found to be some of the most complex or challenging sources to be able to work with reliably and some of the strategies that you have found to be effective for picking up a new source and being able to

get it production ready in the shortest amount of time? Yeah. It's funny. You know, if you ask any engineer at 5 Tran, they'll they can all tell you what the

most difficult data sources are because we've had to do so much work on on them over the years.

Undoubtedly, the most difficult data sources is Marketo.

Close seconds are are Jira,

Asana,

and and then probably NetSuite. So those APIs, they they just have a ton of incidental complexity.

It's really hard to get data out of them fast.

We're working with some of these sources to try to help them improve their APIs,

to make it easier to do replication.

But there there's a handful

of data sources that have required disproportionate work

to,

to get them working reliably.

In general, 1 funny observation that we have seen over the years is that the companies

with the,

the best APIs tend to unfortunately be the least successful companies.

It seems to be a general principle

that

companies which have really beautiful, well organized APIs tend to not be very successful businesses. I guess because they're just not focused enough on sales or something. We've seen it time and again where we integrate a new data source and we look at the API and we go, man, this API is great. I wish you had more customers so that we could sync data for them. The 1 exception I would say is Stripe, which has a great API and is a highly successful company. And that's probably because their API is their product.

So there's there's definitely a spectrum

of difficulty.

In general,

the oldest, largest companies have the most complex

APIs. Yeah. I wonder if there's some,

reverse incentive where they make their APIs obtuse and difficult to work with so that they can build up an ecosystem around them of contractors who are,

whose sole purpose is to be able to integrate them with other systems? You know, I think there's a little bit of that, but less than you would think.

For example, the company that has, by far,

the most extensive ecosystem of contractors helping people

integrate their tool with other systems is Salesforce. And Salesforce's API is quite good. Salesforce is actually 1 of the simpler APIs out there. It was harder a few years ago when we first implemented it,

but they made a lot of improvements.

And it's it's actually 1 of the better APIs now. Yeah. I think that's probably coming off the tail of their acquisition of MuleSoft

to sort of reformat their internal systems and data representation to make it easier to integrate. Cause I know beforehand, it was just a whole mess of XML.

You know,

it was really,

before the MuleSoft acquisition that a lot of the improvements in the Salesforce API happened. The Salesforce REST API was always pretty well structured and rational.

5 years ago, it would fail a lot. You would send queries, and they would just not return,

when you had really big datasets, and now it it's more performance. So I I think it predates the MuleSoft acquisition. They just did the hard work to make all the corner cases work reliably and scale the large datasets. And and Salesforce is now 1 of the easier data sources to actually sync. There are certain objects that have complicated roles. And I I think the developers at Fivetran who work on Salesforce will get mad at me when they hear me say this.

But compared to, like, NetSuite, it's, it's pretty great. On the other side of the equation where you're loading data into the different target data warehouses, I'm wondering what your strategy is as far as being able to make the most effective use of the feature sets that are present, or do you just target the lowest common denominator of SQL representation for being able to load data in and then leave the complicated aspects of it to the end user for doing the transformations and analyses?

So most of the code for doing the load side is shared between the data warehouses.

The differences are not that great between different destinations,

except BigQuery. BigQuery is a little bit of a

unusual creature. So if you look

at Fivetran's code base, there's actually a different implementation for BigQuery that shares very little with all of the other destinations.

So the differences between destinations are not that big of a problem for us. There are certain things that that do you know, there's functions that have to be overridden for different destinations

for things like the names of types. And and there's some special cases around performance where our load strategies are slightly different, for example, between Snowflake and Redshift just to get faster performance.

But in general,

that actually is the easier side of the business is the destinations.

And then in terms of transformations,

it's really up to the user to write the SQL that transforms their data. And it is true that to write effective transformations,

especially

incremental transformations,

you always have to use

the proprietary

features of the particular

database that you're working on. On the incremental piece,

I'm interested in how you address that for some of the different source systems. Because for the databases where you're doing change data capture, it's fairly obvious that you can take that approach for incremental data loading. But for some of the more API oriented systems, I'm wondering if there are,

if there's a high degree of variability of being able to

pull in just the objects that have changed since a certain last sync time or if there are a number of systems that will just give you absolutely everything every time, and then you have to do the diffing on your side. The the complexity of those change logs, I know I mentioned this earlier, but it is it is staggering. But yes, on on the API side, we're also doing change data capture of apps. It is different for every app. But just about every API we work with provides some kind of change feed mechanism.

Now it is complicated. You often end up in a situation where

the API will give you a change feed that's incremental,

but then other endpoints are not incremental. So you have to do this thing where you read the change feed and you look at the individual events in change feed, and then you go look up the related information from the other entity.

So you end up dealing with a bunch of extra complexity because of that. But as with all things at Fivetran, we have this advantage that we have many customers with each data source. So we can we can put in that

disproportionate effort that you would never do if you were building it just for yourself,

to make the the change capture mechanism work properly. Because we just have to do it once, and then everyone who uses that data source can benefit from it. For people who are getting onboarded onto the Fivetran system, I'm curious what the overall workflow looks like as far as the initial setup, and then what their workflow looks like as they're adding new sources or just interacting with their their Fivetran account for being able to

keep track of the overall health of their system or if it's largely just fire and forget, and they're only interacting with the data warehouse at the other side? It's pretty simple. The joke at Fivetran is that our demo takes about 15 seconds. So because we're so committed to automation and we're so committed to this idea that Fivetran's

fundamental job is to replicate everything into your data warehouse, and then you can do whatever you want with it. It means that there's very little UI.

The process of setting up a new data source is basically connect source, which for many sources is as simple as just going through an OAuth redirect. And you just click, you know, yes, Fivetran is allowed to access my data, and that's it.

And connect destination, which,

which now we're actually integrated with Snowflake and BigQuery. So you can just push a button in Snowflake or in BigQuery and and create a Fivetran account that's pre connected to your data warehouse.

So the setup process is really simple. There's once after setup, there is a bunch of UI around monitoring what's happening.

We like to say that Fivetran is a glass box. It was originally a black box, and now it's a glass box. You can see exactly what it's doing. You can't change it,

but you can see exactly what we're doing at all times. And, you know, part of that is in the UI and and part of that is in emails you get when things go wrong and or the sync finishes for the first time, that kind of thing. For part of that visibility, I also noticed that you will ship the transaction logs to the end user's log aggregation system. And I thought that was an interesting approach,

as far as being able to give them a way to be able to access all of that information in 1 place without having to go to your platform just for that 1 off case of trying to see what the transaction logs are and gain that extra piece of visibility. So I'm wondering what types of feedback you've got from users as far as the overall

visibility into your systems and the ways that it they're able to integrate it into their monitoring platforms.

Yeah. So the logs you're talking about are the logs of every action Fivetran took. Like, Fivetran made this API call against Salesforce. Fivetran ran this LogMiner query against Oracle. And so we record all this metadata about everything we're doing. And then you can see that in the UI, but you can also ship that to your own logging system like CloudWatch or Stackdriver.

Because a lot of companies have like a in the same way they have a centralized

data warehouse, they have a centralized logging system.

It, it's mostly used by larger companies. Those are the ones who invest the effort in setting up those centralized logging systems.

And it's it's actually the system we built first before we built it into our own UI. And later, we found it's also important just to have it in our own UI, just so there's a quick way to view what's going on.

And,

yeah. I I think people have appreciated that we're happy to

support,

the systems they already have rather than try to build our own thing and and force you to use that. I imagine that that also plays into efforts within these organizations for being able to track data lineage and provenance for understanding the overall life cycle of their data as it spans across different systems.

You know, that's not so much of a logging problem. That's more of like a metadata problem inside the data warehouse.

When you're trying to track lineage to say like, this row in my data warehouse came from this transformation,

which came from these 3 tables, and these tables came from Salesforce, and it was connected by this user, and it synced at this time, etcetera. That lineage problem is really more of a metadata problem. And that's kind of a greenfield in our area right now. There's a couple different companies that are trying to solve that problem.

We're doing some interesting work on that in conjunction with our transformations.

I think it's a very important problem.

It's still

still a lot of work to be done there.

So on the sales side of things too, I know you said that your demo is about 15 seconds as far as, yes, you just do this, this, and then your data is in your data warehouse. But I'm wondering what you have found to be some of the common questions or common issues that people have

that bring them to you as far as evaluating your platform for their use cases

and just some of the overall user experience

design that you have put into the platform as well to help ease that onboarding process? Yeah. So

a lot of the discussions in the sales process really revolve around that ELT

philosophy

of Fivetran is gonna take care of replicating all of your data, and then you're gonna cure curate it non destructively using SQL, which for some people just seems like the obvious way to do it. But for others, this is a very shocking

proposition. This idea that your data warehouse is going to have this

comparatively

uncurated

schema

that Fivetran is delivering data into, and then you're basically gonna make a second copy of everything. For a lot of people who have been doing this for a long time, that's a very surprising approach. And so a lot of the discussion in sales revolves around the trade offs of that, why we think that's the right answer

for the data warehouses that exist today. Which are just so much faster and so much cheaper

that it makes sense to adopt that more human friendly workflow,

than maybe it would have in the nineties. And what are the cases where Fivetran is the wrong choice for being able to replicate data or integrate it into a data warehouse? Well, if you already have a working system, you should keep using it.

So we don't advise people to change things just for the sake of change. If you've set up a bunch of Python scripts that are syncing all your data sources

and it's working, keep using it. What usually happens that causes people to take out a system

is schema changes.

Death by a 1000 schema changes.

So they find that the data sources upstream are changing.

Their scripts that are syncing their data are constantly breaking. It's this huge effort to keep them alive.

And so that's the situation

where prospects will abandon

existing system and adopt Fivetran. But what I'll tell people is, you know, if your schema is not changing, if you're not having to go fix your these pipes every week, don't change it. Just just keep using it. And as far as the overall challenges or complexities

of the problem space that you're working with, I'm wondering what you have found to be some of the most difficult to overcome or some of the ones that are most noteworthy and that you'd like to call out for anybody else who is either working in this space or considering,

building their own pipeline from scratch. Yeah. You know, I think that

when we got our first customer

in 2015,

syncing Salesforce to Redshift,

And 2 weeks later, we got our 2nd customer

syncing Salesforce and HubSpot and Stripe into Redshift. I sort of imagined

that this sync problem was, like,

gonna be we were gonna have this solved in a year, and then we would go on and build a bunch of other related tools. And the sync problem is much harder than it looks at first.

Getting all the little details right so that it just works

is,

an astonishingly

difficult

problem.

It it is a

parallel problem. You can have lots of developers working on different data sources,

figuring out all those little details. We have accumulated

general lessons that we've incorporated into our core code. So we've gotten better at doing this

over the years. And it really works when you have multiple customers who have each data source. So it works a lot better as a product company than as someone building an in house data pipeline. But the level of

complexity

associated with just doing replication

correctly

was kind of astonishing for me. And I think it is astonishing for a lot of people who try to solve this problem. You know, you look at the API docs of a data source and you figure, oh, I think I know how I'm gonna sync this. And then you go into production with 10 customers and suddenly you find 10 different

corner cases that you never thought of that are gonna make it harder than you expected to sync the data. So the the level of difficulty of just that problem is kind of astonishing. But the value of solving just that problem is also kind of astonishing. On both the technical and business side, I'm also interested in

understanding what you have found to be

as far as the most interesting or unexpected or useful lessons that you've learned in the overall process of building and growing Fivetran? Well, I've talked about some of the technical

lessons in terms of, you know, just solving that problem,

really well is is both

really hard and and really valuable.

In terms of, yeah, the business lessons we've learned, it's, you know, growing the company is like a coequal

problem to growing the technology. I've been really pleased with how we've made a place where people seem to genuinely

like to work, where a lot of people have been able to develop their careers,

in different ways. Different people have different career goals, and you need to realize that as someone leading a company. Not everyone at this company is like myself. They have different goals that they want to accomplish. So that that problem of growing the company is just as important

and just as complex as solving the technical problems and growing the product and growing the sales side and and helping people find out that you have this great product that they should probably be using. So I I think that has been a real lesson for me over the

almost 7 years that we've been doing this now. For the future of Fivetran, what do you have planned both on the

business roadmap as well as the feature sets that you're looking to integrate into Fivetran

and just some of the overall goals that you have for the business as you look forward? Sure. Sure. So some of the most important stuff we're doing right now is on the sales and marketing side. We have done all of this work to

solve this

replication problem, which is very fundamental and very reusable.

And I like to say, no 1 else should have to deal with all of these APIs since we have done it.

You should not need to write a bunch of Python scripts

to sync your data or configure Informatica or anything like that. We've done it once so that you don't have to. And I guarantee you, it will cost you less to buy Fivetran than to have your own team basically build an in house data pipeline. So we're doing a lot of work on the sales and marketing side just to get the word out that Fivetran's out there and it might be something that's really useful to you. On the product side,

we are doing a lot of work now in helping people manage those transformations in the data warehouse. So we have the first version of our transformations

tool in our product.

There's gonna be a lot more sophistication getting added to that over the next year. We really view that as the next frontier for Fivetran

is helping people manage the data after we've replicated it. Are there any other aspects

of the Fivetran company and technical stack or the overall problem space of data synchronization that we didn't touch on that you'd like to cover before we close out the show? I don't think so. I I think the

the thing that people tend to not realize because they tend to just not talk about it as much is that the real difficulty in this space is all of that incidental complexity of all the data sources.

The, you know, Kafka is not gonna solve this problem for you. Spark is not gonna solve this problem for you. There there is no fancy technical solution.

Most of the difficulty

of the data centralization problem is just in understanding and working around all of the incidental complexity of all of these data sources. For anybody who wants to get in touch with you or follow along with the work that you and Fivetran are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I I think that the biggest gap right now is in the tools that are available to analysts who are trying to curate the data

after it arrives. So writing all of the SQL that curates

the data

into a format that's ready for the business users to attack with BI tools is a huge amount of work. It remains a huge amount of work. And if you look at the workflow of the typical analyst, they're writing a ton of SQL,

and they're using tools

that it's a very analogous problem to a developer writing code using Java or C Sharp. But the tools that analysts have to work with look like the tools developers had in, like, the eighties. I mean, they don't even really have auto complete. So I I think that is a really

under invested in problem.

Just the tooling for analysts to make them more productive in the exact same way as we've been building tooling for developers over the last 30 years. A lot of that needs to happen for analysts too, and I think it hasn't happened yet. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Fivetran and some of the insights that you've gained in the process. It's definitely an interesting platform and an interesting problem space, and I can see that you're providing a lot of value. So I appreciate all of your efforts on that front, and I hope enjoy the rest of your day. Thanks for having me on.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links