Simplifying Data Integration Through Eventual Connectivity

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU

instances. Go to data engineering podcast.com/linode,

that's linode,

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And to grow your professional network and find opportunities with the startups that are changing the world, then AngelList is the place to go. Go to data engineering podcast.com/angel

to sign up today.

And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events, including the O'Reilly AI Conference, the Strata Data Conference,

and the combined events of the Data Architecture Summit in Graphorum.

Go to data engineering podcast dotcom/conferences

to learn more and to take advantage of our partner discounts when you register.

And go to the site at data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL. And just as a full disclosure, Tim is the CEO of Kludin, who is a sponsor of the podcast. So, Tim, can you just start by introducing yourself? Yeah. Sure. My name is Tim Ward. As Tobias has said, I'm the CEO of a of a data platform

called Cluedin.

I'm based out of Copenhagen, Denmark. I have

with me my wife,

my little boy, Finn, and, a little dog that, looks like an Ewok called Seymour.

And do you remember how you first got involved in the area of data management?

Yeah. So, I mean, I'm I'm, I guess, a classically trained software engineer. I've been working in software space for around,

13, 14 years now. I've been predominantly working in the web

space, but mostly for enterprise businesses. And

around,

I don't know, maybe 6 or 7 years ago, I was, given a a project,

which was in the space of what's called multivariate

testing. It's the idea that if you've got a website and maybe

the the the home page of a website,

If we make some changes or different variations, which which variation works better,

for the amount of traffic that you're wanting to attract or maybe the amount of purchases that a that a company makes on the website.

So, I mean, using this,

that was my first foray into

okay. So this involves me having to capture data on analytics. It then took me down this rabbit hole of realizing,

ah, got it. I have to

not only get the analytics from the website, but I need to correlate this against, you know, back office systems, CRM systems, and,

you know,

ERP systems and PIM systems, and

I I kinda realized, oh god. This becomes quite tricky with the integration piece. And

once I went down that rabbit hole, I realized, oh, for me to actually start doing something with this data, I need to clean it. I need to normalize it.

And, you know, basically, I I got to this point where I realized, well, data's kind of a hard thing to work with. It's not something you can pick up and just start getting value out of,

straight away. So that's kinda what led me into the the path of

around, yeah, 4 and a half, 5 years ago saying, you know what? I'm gonna get into this data space. And, you know, ever since then, I've just enjoyed,

immensely being able to help, large enterprises in becoming a lot more data driven. And so to frame the discussion a bit, I'm wondering if you can just start by discussing some of the challenges and shortcomings that you have seen in the existing practices of ETO.

Yeah. Sure. I mean, I guess I wanna start by not trying to be that grumpy old man that's yelling at old technologies, and and I'm always this person that it 1 thing I've learned in my career is that it's very rare that a particular technology

or approach

is right or wrong. It's just right for the the right use case.

And,

I mean, also, you're you're seeing a lot more patterns in in integration emerge. Of course, we've got the ETL that's been around forever. You've got this ELT approach, which has been emerging over the last few years, and then you've kinda seen streaming platforms,

also take up the idea of joining

streams of data instead of something that is

kind of done upfront.

And, you know,

to be honest, I've always wondered with ETL,

Now how on earth

are people

achieving

this for an entire company? You know, like, ETL for me has always been something that if you've got 2, 3, 4 tools to be able to integrate, it's a fantastic

kind of approach. Right? But, you know, now we're definitely dealing with a lot more data sources, and the demand for having free flowing data available is becoming, you know, much more apparent. And it was to the point where I thought, am I the stupid 1? Like, I can't

if I have to use ETL to integrate data from multiple sources, as as soon as we go over a certain limit of data sources, the problem just exponentially

becomes a lot harder.

And I think the thing that I found interesting as well with this ETL approach is that, typically, once the data was processed through these classic,

you know, designers

workflow

DAGs,

you know, directed acyclical graphs,

the the output of this

process was typically,

oh, I'm going to store this in a a relational database.

And and therefore, you know, I could understand why ETL existed. I could understand

that, yeah, if if you know what you're going to do with your data after this ETL process, process. I mean, classically, it would go into something like a data warehouse.

I can see why that existed.

And I think there's just different types of demands that are in the market today. There's much more need for, you know, flexibility

and access to data

and not necessarily

data that's being modeled

as rigidly

as you do get in the kinda classical data warehouses, and I kind of thought,

well, the relational database is not the only database

available to us as engineers, and 1 of the ones that I've been focusing on for last few years is this graph database.

And I kinda when you think about it,

most problems that we're trying to solve in the modeling world today, they are actually a network. They are a graph. They're not necessarily a relational or a kind of flat document store. So I thought, you know, this seems more like the right store to be able to model the data.

I mean, I think the second thing was that

just from being hands on, I found that this ETL process,

what it meant was that when I was trying to solve problems and integrate data,

upfront,

I had to know what were all the business rules that dictated

how these systems integrate, but what dictated clean data. I mean, you're probably, Tobias, used to these, ETL designers where I get these built in

tokenize the text and,

and things like that. And you think,

yeah, but I need to know upfront

what is considered

a bad

ID or a bad record. You're you're probably also used to seeing, you know, we've got these IDs, and sometimes it's a it's a beautiful looking ID, and sometimes it's negative 1 or n a or, you know, placeholder

or hyphen, and you think, I've got a upfront in the ETL world define

what are all those possibilities before I run my ETL job. I just found this quite rigid in its approach,

and and I think the key kind of game changer for me was that,

you know, when I was using ETL and these classic designers to integrate more than 5 systems, I realized how long upfront

it took that I needed to go around the different parts of the business and have them explain,

okay. So how does the Salesforce lead table connect to the Marketo lead table? Like, how how does it do that? And then time after time,

after, you know, weeks of investigation, I would realize,

oh, I have to jump to the, I don't know, the the

exchange server or the active directory to get the information that I need to join those 2 systems, and it just it just resulted in

the spaghetti of point to point integrations. And I think that's 1 of the key things that ETL suffers from is that it puts us in an architectural

design,

thinking pattern of, oh,

how am I going to map systems

point to point? And I can tell you after working in this industry for 5 5 years so far,

that systems don't naturally

blend together point to point. Yeah. Your point too about the fact that you need to understand

what are all the possible representations of a null value

means that in order for a pipeline to be sufficiently robust, you have to have a fair amount of

quality testing built in to make sure that any

values that are coming through the system map to some of the existing

values that you're expecting and then be able to raise an alert when you see something that's outside of those bounds so that you can then go ahead and fix it and then being able to have some sort of a dead letter queue or bad data queue for holding those records until you get to a point where you can reprocess them and then being able to go through and back populate the data. So it definitely is something that requires

a lot of engineering

effort in order to be able to have something that is functional

for all of the potential values. And also, there's the aspect of schema evolution and being able to figure out how to propagate that through your system and have your logics

flexible enough to be able to handle different schema values for cases where you have data flowing through that is at the transition

boundary between the 2 schemas, so it's definitely a complicated issue. And so

you recently released a white paper discussing some of your ideas on this concept of eventual connectivity. And so I'm wondering if you can describe your thoughts on that and touch on how it addresses some of the issues that you've seen with the more traditional ETL pattern. Yeah. Sure. I mean, I think 1 of the concepts

behind this pattern we've kind of named

eventual connectivity and,

is the I there's there's a couple of fundamental things to understand. First of all, it's a it's a pattern that essentially embraces the idea that we can throw data into a store,

and as we continue to throw more data,

records will find out itself how to be merged.

And it's the idea of being able to place records into this kind of central place, this central repository

with little hooks

with little hooks that are or flags that are indicating, hey.

I'm a record, and here are

my

unique references.

So,

you know, obviously, with the idea being that,

as we bring in more systems,

those other records will say, hey. I actually have the same ID. Now that might not happen upfront.

It might be after you've integrated system 1, 2, 3, 4, 5, 6,

that system 24 are able to say, hey.

I now have the missing pieces

to be able to merge our records. So in an eventual connectivity world,

what this this really brings in advantages is that,

first of all, if I'm trying to integrate systems, I only need to take 1

system at a time. And I found it rare

in the enterprise that I could find someone who understood the domain knowledge behind their Salesforce account

and their Oracle

Oracle account or and their Marketo account, I would often run into someone who completely understood the business domain behind the Salesforce account. And for the for the reason I'm using that as an example is because Salesforce is an example of a system where you can do anything in it. You can add objects

that are, you know, animals or dinosaurs, not just the the ones that are out of the box. I don't know who's selling to to to dinosaurs, but,

essentially, what this allows me to do is when I walk into

an integration job and that business says,

hey. We have 3 systems. I say, got it.

And if they say, oh, sorry. That was actually 300 systems. I go, got it. It makes no difference to me. It's only a time based thing. The complexity

doesn't get more

complex because

of this type of pattern that we're,

of that we're taking, and I'll explain the pattern.

Essentially,

what we do is we

you can conceptualize

it as we go through a system,

a record at a time or an object at a time. Let's take something like leads or contacts.

And the patent basically asks us to,

highlight what are unique references to that object.

So in the case of a a person, it might be something like a a passport number. It might be, you know, a a local,

personal

identifier.

You know, in in Denmark, we have what's called the CPR number, which is a unique reference to me. No 1 else in in Denmark can have the same number.

But then you get to things like emails, and what you discover

pretty quickly in, in enterprise, in the enterprise data world is that email in no way is a unique identifier of an object.

Right? We can have group emails that refer to multiple different people and, you know, not all systems will,

specify as if this is a group email or this is an email, referring to an individual. So

the pattern asks us or dictates us to mark those references as aliases, something that

could allude to a unique identifier of an object.

And then when we get to the referential pieces so imagine that we have a contact that's associated with a company. You could probably imagine that there's a a column in the contact table that's called company ID.

And the key thing with the eventual connectivity pattern is that

although I wanna highlight that as a unique reference to another object,

I don't want to tell the the integration pattern where that object exists.

I don't want it to tell that it's in the Salesforce

organization table because, to be honest, if that's a unique reference, that unique reference might exist in other systems.

And so what this means is that

I can take an individual

system at a time and not have to have the standard point to point type of of relationship between data. And I think if I was to highlight kind of 3 main wins that you get out of this, I think the first is that it's quite powerful to walk into a a large business and say,

hey. How many systems do you have? Well, we have a 1, 000, and I think, good. When can we start?

Now if I was in the ETL approach, I'll be thinking, oh,

god. Are we can we actually honestly do this?

Like,

as you could probably know yourself, Tobias,

often we go into projects with big smiling faces, and then when you see the data, you realize, oh, this is gonna be a difficult project. So that advantage of being able to walk in and say, I don't care how many systems you have. It makes not a lot of complexity difference to to me.

I think the other piece is that the eventual connectivity

pattern addresses this idea of that you don't need to know all the business rules upfront

of what

how systems connect to each other, but then what's considered bad data versus good data.

And rather that, you know, we let things happen and we have a a much more reactive approach to be able to to rectify them. And I think this is more

cognizant or it's more representative of the world we look into that we live in today. Companies are wanting more real time data to their

consumers or to the the consumption technologies where we get the value, things like business intelligence, etcetera,

And they're not willing to put up with these kind of rigid approaches of, oh, the ETL process has broken down. I need to go back to our design. I need to update that and run it through and make sure that we we we, guarantee that, you know, the data is in the perfect order before

we actually do the merging.

And I think the final thing that has become

obvious time after time where I've seen companies use this pattern

is that this eventual connectivity pattern will discover joins

where it's really hard for you and me to just sit down and figure out where these joins are. And I think it comes back to this core idea that systems don't connect well point to point. There's not always a nice ID

that or this ubiquitous

ID that

we can just join 2 systems together. Often, we have to jump in between different data sources

to be able to wrangle this into a unified type of set of data. Now at the same time, I can't deny that, you know, like, there's quite a lot of

work that's going on in the field of, you know, ETL. You've got platforms like NiFi and Airflow, and you know what? Those are still very valid. They're still, you know, they're very good at moving data around. They're fantastic at breaking down a workflow into these kind of

discrete components that can, in some cases, play independently.

I think that the eventual connectivity pattern for us time after time has allowed us to blend systems together

without this overhead of complexity.

And, to bias, there's not a big enough whiteboard in the world when it comes to integrating, you know, 50 systems.

You just have to put yourself in that situation and realize, oh, wow. The old ways of doing it are just not scaling. And as you're talking through this

idea of eventual connectivity, I'm wondering how it ends up being materially different from a data lake where you're able to just do the more ELT pattern of just ship all of the data into this repository without having to worry about modeling it up front and understanding what all the mappings are and then doing some exploratory analysis after the fact to be able to then create all of these connection points between the different data sources and do whatever cleaning happens after the fact.

Yeah. I mean,

you 1 thing I've gathered in my career as well is that, you know, something like an overall data pipeline for a business is gonna be made up of so many different components. And in our world, in the in the eventual connectivity world, the lake still makes complete sense to have. I see the lake as a place to dump data,

there I can read it in a ubiquitous language. In most cases, it's SQL, but it's exposed.

You know, I don't know a single person in our industry that doesn't know SQL to some perspective, so that that is fantastic to have that like there.

Where I see the problem often evolving is that

the lake is is obviously kind of a place where we would typically store raw data. It's where we,

abstract away the complexity that, oh, now I need if I need data from a SharePoint site, I have to learn the SharePoint API. No. But the the lake is there to basically say, that's already been done. I'm gonna give you SQL, and that's the way that you're going to get this data. What I find is that

when I look at, the the companies that we work with is that, yes, but there's a lot

that needs to be done from the lake to where we can actually get the value. I think something like machine learning is a good example.

Time after time, we hear, and it's it's true that

machine machine learning doesn't really work that well if you're not working with good quality,

well integrated

data that is complete. So it's missing, you know, nulls, and it's missing empty columns and and things like that.

And what I found is that,

we went through this,

in our industry, we went through this this period where we said, okay. Well, the lake is gonna give the data science teams and the different teams direct access to the law. And what we found is that every time they tried to use that data, they went through these common practices of, now I need to blend it. Now I need to catalog it. Now I need to normalize it and clean it, and you could see that the eventual connectivity pattern is there to say, no. No. No. This is something that sits in between the lake that matures the data to the point where it's already blended. And that's 1 of the biggest challenges I kinda see there is that, you know, if I get, you know, a couple of, different,

files out of

of the lake, and then I go to investigate how this joins together,

I still have this, you know, this experience of, oh,

this doesn't easily blend together. So then I go on this exploratory,

this discovery phase of what other

data sets do I have to use to string these 2 systems together,

and we would kinda just like to eliminate that. So to make this a bit more concrete for people who are listening and wondering how they can put this pattern into effect in their own systems, can you talk through an example system architecture

and data flow for a use case that you have done or

at least experimented with

to be able to put this into effect and how the overall architecture plays together to make this a viable approach and how those different connection points between the data systems end up manifesting? Yeah. Definitely. And so maybe it's it's good to to use an example. Imagine you have

3 or 4 data sources that you need to blend.

You need to ingest it. You then need to to usually merge the records into kind of a flat, flat and unified, datasets, and then you need to, you know, push this somewhere. So it might be a data warehouse, something like BigQuery or or Redshift, etcetera. And,

the the the the fact is that, you know, in today's world,

that data

also needs to be available for the data science team, and now it needs to be available for things like exploratory

business intelligence.

So when you're building your your integrations,

I think architecturally

from a from a a modeling perspective, the the 3 things that you need to think about are what we call entity codes, aliases,

and edges.

And those 3 pieces together is what we need to be able to map this properly into a a graph store.

So simply put, an entity code is is kind of a absolute

unique reference

to an object. As I alluded to before, something like a passport number,

that's a unique reference to an object, but by itself, just the passport number

doesn't mean that it's unique across all the systems that you have

at your workplace.

So the other is aliases. So aliases is more of like this this email,

a phone number, a nickname. They're all alluding

to some type of

overlap between these records, but they're not something that we can just,

honestly go ahead and just a 100% merge records based off these. Now, of course,

having that, you, of course, then need to investigate things like inference engines

to build up, you know, confidence on now how confident can I be that a person's nickname is is is unique in the reference of the data that we've plugged in, these 3 or 4 data sources that I'm talking? And then finally, the edges, they're they're placed

essentially,

they're there to be able to build the referential integrity. But what I find architecturally

is that when we're trying to solve, data problems for for companies,

a majority

of the time,

their model represents much more a network than the classic relational store or column database or document store.

And so when we look at the the technology that's that's needed to, you know, support the system architecture, 1 of the key technologies at the heart of this is a graph database, and to be honest, it doesn't really matter which graph database you use,

but it is kind of what we found important is that it needs to be this a native graph store.

There are triplet stores out there. There are multimode databases

like,

Cosmos DB and SAP HANA, but what we found is that you really need a native graph to be able to do this properly. So the way that you can conceptualize

the pattern is that

every record that we pull in from,

a system or that you import,

it will go into this network or graph as a node,

and

every

entity code

for that record, I e the unique

ID or multiple unique IDs of that record,

they will also go in as a node

connected to that record.

Now every alias

will go in as a property

on that original node because we wanna probably do some processing later to figure out if we can turn that alias

into 1 of these entity codes or these unique references.

Now here's the the interesting part. This is the this is the part where the eventual connectivity

pattern kicks in. All the edges,

I. E. If I was, you know, referencing a person to a company, that person works at a company.

Now those edges are placed into the graph,

and a node is created but is marked

as hidden.

Now we call those shadow nodes. So you could imagine if we brought in a record on

on Barack Obama,

and it had, Barack's phone number. Now that's not a unique reference. But what we would do is we would create a record, a node in the graph that's referring to that phone number,

link it to Obama,

but mark the phone number node as hidden. As I said before, we call these shadow nodes. And, essentially, you can see that as 1 of these hooks that,

if I ever get other records that come in later that also have an alias or an entity code that overlap,

that's where I need to start doing my merging work. And what we're hoping for,

and this is what we see time after time as well,

is that as we import system one's data,

it'll start to come in, and you'll see a lot of nodes that are the shadow nodes, I

e, I have nothing to hook onto on the other side of this this ID.

And the analogy that kind of we use for this this shadow node is that, you know, records come in,

they're by default a clue.

So a clue is in no way

factual,

in no way do we have any other records that are correlating to these same values, and our goal is to turn in this eventual connectivity pattern

clues to facts.

And what makes facts is records that have the same entity codes that exist across different systems.

So the architectural

key patterns to this

is that a a graph store needs to be there to model our data, and here's 1 of the key reasons.

If I realize that the landing

zone of this integrated data was a relational database,

I would need to have an upfront schema.

I would need to specify

how these objects connect to each other. What I've always found in the past is that when I need to do this, it becomes quite rigid. Now I believe I'm a strong believer in every database needs a schema at some point or you can't scale these things.

But what's nice about the graph is that

1 of the things that got really or design patterns that got really well was

flexible data modeling. There is no necessarily

more important object that exists within the graph

structure,

they're all equal in their complexity, but also in their,

importance. And

really pick and choose the graph database that you want, but it's 1 of the keys to this architectural path. So 1 of the things that you talked about in there is the fact that there's this flexible data model.

And so I'm wondering

what type of upfront modeling is necessary in order to be able to make this approach viable. I know you talked about the idea of the entity codes and the aliases,

but for somebody who is taking a source data system and trying to load it in to this graph database in order to be able to take advantage of this eventual connectivity pattern,

what is the actual process

of being able to load that information in and,

assign the appropriate

attributes to the different

records and to the different attributes in the record. And then, also, I'm wondering if there are any limitations in terms of what the source format looks like as far as the serialization format or the types of data that,

that that this approach is viable for.

Sure. Good question. So, I mean, I think the first thing is is to identify that

the eventual connectivity pattern and modeling it in the graph,

the key to this is that there will always be

extra modeling

that you do after this step, and the reason why is because

if you think about the data structures that we have as engineers,

the network or the graph is the highest fidelity

data structure we have.

It's it's a higher or more,

detailed structure than a tree. It's more,

structured than hierarchy or a or a relational store, and definitely more we have more structure or fidelity than something like a document. And with this in mind,

we use the eventual connectivity

to solve this piece of integrating data from different systems and modeling it, but

we know we will always do better modeling

for the purpose fit

case later. So it's it's worth highlighting that the value of the eventual connectivity pattern is that it makes the integration of data easier, but this will definitely not be the last modeling that you would have. And therefore, this allows

flexible modeling because you always know, hey. If I'm trying to build a data warehouse based off the data that we've modeled in the graph, you're always going to do extra processes

after it to model it into the, probably, the relation store for a data warehouse or a column.

You're gonna model

it purpose

fit to solve that problem.

However,

if what you're trying to achieve with your data is flexible access to data to be able to feed it off to other systems,

you want the highest fidelity, and you want the highest flexibility in modeling.

But the key is that if you were to drive your data warehouse directly off this graph, it would do a terrible job.

That's not what the graph was purpose built for. The graph

was always good at

flexible data modeling.

It's always good at,

being able to join records

very fast, and I mean

just as fast as doing an index lookup. That's how these native graph stores have been designed. And so it's it's it's important to to highlight that,

the up front modeling

really, it's not a lot of up front modeling.

Of course, we shouldn't do silly things,

but I'll give you an example. If I was modeling a skill,

a person,

and a company,

it's completely

fine to have a graph where

the skill points to the person

and the person points to the organisation,

and it's also okay to have that the person points to the skill and the skill points to the organization.

That's not as important.

What's important

at this stage is that

the eventual connectivity

pattern

allows us to integrate data more easily. Now when I get to the point where I want to to do something with that data, I might find that, yes, I actually do need an organization

table, which has a a foreign key to person,

and then person has a foreign key to skill. And that's because, you know, that's typically what a data warehouse is built to to do. It's to model the data perfectly. So if I have a 1, 000, 000, 000 rows of data, this report still runs fast, but we lose that kind of that flexibility

in the data modeling. Now as for formats

and things like that,

what I found is that just to some degree that the formatting and and the source data, well, you could probably imagine the data is not created equally. Right? For so for many different systems,

they'll allow you to do absolutely

anything you want,

where the kind of ETL approach allows you to, you know, or kinda dictates that you capture these exceptions upfront of if I've got a certain looking data coming in, how does it connect to the other systems? What eventual connectivity

does is it just captures them later in the process. And my thoughts on this is that,

to be honest, you will never know all these business rules upfront, and therefore, kinda let's embrace an integration pattern that says, hey. If the schema in the source

or the format of the data changes,

and you kind of alluded to this before as well, Tobias,

is okay. Got it. I want to be alerted that there is an issue with deserializing

this data. I want to start queuing up the data in a a message queue or maybe a stream,

and I want to be able to fix that schema and a platform to be able to say, got it. Now that that's fixed, I'll continue on with serializing the things that will,

that will now serialize, and these kind of things will happen all the time. I think I've referred to it,

before and and heard other people refer to it as schema drift,

and this will always happen in source

and in target.

So what I found success with is

embracing

patterns

where

failure is going to happen

all the time. And when we look at the ETL approach, it's much more of a when things go wrong,

everything stops.

Right? That the different workflow stages that we've put into our kind of classical ETL designers,

they all go red, red, red, red. I have no idea what to do, and I'm just going to kind of fall over. And so what we would rather is a pattern where it says,

got it.

Scheme has changed, I'm gonna log up what you need to do until the point where you've changed that schema, and when you put that in place, I'll say,

I'll test it. I'll say, yep. That schema that seems to be I can serialize that now. I'll continue on.

And,

what I find is that if you don't embrace this technology, spend most of your time in just

reprocessing,

data through ETL back. And so it seems that there actually is still a place for workflow engines or

some measure of ETL

where you're extracting the data from the source systems, but rather than loading it into your data lake or your data warehouse, you're adding it to the graph store for then being able to pull these mappings out and then also potentially

going from the graph database where you have joined the data together and then being able to pull it out from that using some sort of

query to be able to have the maps data extracted and then load that into your eventual target. I I mean, what you've just described there is a workflow, and therefore, you know, these workflow systems, they still

make sense. They're very logical to look at these at these workflows and say, oh, that happens, then that happens, then that happens. They completely still make sense.

And I I still actually use in in some cases, I actually still use

these ETL tools for very specific jobs,

But what you can see is if we were to use these kind of classical workflow,

systems,

you can see the eventual connectivity pattern as you described. It's just 1 step in that overall pattern.

But I think what I found over time is that,

you know, we use these workflow systems to be able to join data, and I would I would actually rather throw it to an individual step called eventual connectivity

where it does the joining and things like that for me, and then continue on from there. You could,

very similar to the the kind of example you gave is and and that I've also been been mentioning here as well is there will always be things you do after the graph, and that is something you could easily push into 1 of these workflow designers.

Now as for an example of, you know,

the the times when our company still uses these these these tools out at our customers,

I think 1 of the ones that makes complete sense is IoT data.

And it's mainly because it's not typical

in at least the cases that we've seen,

that there's as much hassle

in blending and cleaning data. We see that more with, you know, operational data, things like transactions

and,

you know,

customer data and customer records.

That's

typically quite hard to blend, but when it comes to IoT

IoT data, you know, if there's something wrong with the data that it can't blend, it's often that, well, maybe it's a bad reader that we're, you know, reading off instead of something that is actually dirty data. Now, of course,

every now and then,

if you worked in in that space, you'll realize that, you know, readers can lose a network and they can, you know, have holes in the data. But, I mean, eventual connectivity

would not solve that either. Right? And typically, in those cases, you'll do things like impute the values from

historical and future data to fill in the gaps. And it's always a little bit of a guess that's why it's it's it's it's, we're imputing it. But to be honest, if it was my task to build a unified dataset from across different sources,

I would just

choose this eventual connectivity pattern every single time. If it was to have to,

a workflow of data processing where I I know that data blends easy, then there's not a data quality issue. Right? Where there is,

you need to jump across multiple different systems to merge data. I just time after time have found that, you know, these workflow systems, they they reach their limit where it just becomes too complex.

And for certain

scales or varieties of data, I imagine that there are certain edge cases

that come up when trying to load everything into the graph store. And so I'm wondering what you have run up against as far as limitations to this pattern

or at least alterations to the pattern to be able to handle some of these larger volume tasks? I think I'll start with this. The graph is notoriously

hard to scale,

and most of the the graph databases that I've had experience with, and you're essentially bound to 1 big graph. So there's there's no I there's no kind of idea of clustering these data stores with, you know,

sub graphs that you could query a cost. So

scaling that is actually quite hard to start with. But I think the limitations

from the pattern itself,

there are many. I mean, it starts with the fact that you need to be careful. I'll I'll give you a good example. I've seen many companies that use this pattern,

and they flag something like an email as unique, and then we realize later,

no, no, no, it's not. We have merged records that are not duplicates,

And this means, of course, that you need,

support in the platform that you're you're utilizing the ability to split these records and fix them and reprocess them at a later point. But, I mean, these are also things that will be very hard to pick up an ETL, ELT types of patterns. But I think 1 of the other, you know, downsides of this approach is that

upfront, you don't know how many records will join. You're kind of like the name alludes to.

Eventually,

you'll get joins or connectivity, and you can think of it as this pattern

will

decide how many records it will join for you based off these entity codes or unique references

or the power of your inference engine

when it comes to things that are a little bit fuzzy

unique,

fuzzy,

ID to to someone things like, you know, phone numbers and things like that. The great thing about this is it also means that you don't need to choose what type of join,

that you're doing. I mean, in the relational world, you've got plenty of different types of joins, you know, inner joins, outer joins, left outer, left right outer joins, things like this. In a graph, there's 1 join. Right? And so

with that pattern, you know,

it's not like you can pick the wrong join to go with. There's only 1 type of thing. So it it really becomes useful when no. No. No. I'm just trying to to merge these records. I don't need to hand hold how these joins will happen. I think 1 of the other downsides that I've had experience with this is that let's just say you have, you know, system

12,

and what you'll often find is that when you integrate these 2 systems,

you have a lot

of these,

shadow nodes in the graph, I e

or sometimes we call them floating edges where, hey. I've got a reference to a company with an ID of 123,

but I've never found the record on the other side with the same ID. So I have, you know, in fact, I'm storing lots of extra information that, you know what, I'm not actually utilizing it. But I think the advantages of saying,

yeah, but you will integrate system 4. You will integrate system 5 where that data,

sits.

But the value is that you don't need to tell the system how they join. You just need to flag these unique references. And I think that the kind of final

maybe limitation or that I think that I found with these patterns is that it you learn pretty quickly as I alluded to before that there are many records in your data sources where you think

a column is unique, but it's not. It might be unique

in your system, I e in

Salesforce,

the ID is unique.

But if you actually look across the the other parts of the stack, you realize,

no. No. No. There is another company

in another system with a record called 123,

and they have nothing to do with each other.

And so what we you know, these entity codes that I'm talking about, they're made up of multiple different parts. They're made up of the ID 123.

They're made up of a type, something like organization,

and they're made up of a source of origin, you know, Salesforce

account

456. And so what this does is it guarantees

uniqueness

if you added in 2 Salesforce accounts or if you added in

systems that have the same ID, but it came from a different source.

And as I said before, a good example would be the email. I mean, even at our company,

we use GitHub Enterprise to be able to to store our our source code,

and, you know, our we have notifications that our engineers get when, you know, there's pull requests and things like this. And

it actually our GitHub

identifies each employee as notifications

at github.com.

That's what that record sends us as its unique reference, and, of course, if I mark this as a unique reference, all of those employee records using this pattern would merge. However, what I like about this approach is that, you know, at least I'm given the tools to rectify the the bad data when I see it. And to be honest, if companies are wanting to become much more data driven as kind of we aim to help our customers with, then I just believe that it means

we have to start to

shift

or learn to accept that there's more risk that could happen,

but is that risk of having data, you know, more readily available to the forefront

worth more than the old approaches that we're taking? And for anybody who wants

to dig deeper into this idea

or learn more about your thoughts on that or,

some of the adjacent technologies, what are some of the

resources that you recommend they look to? Yeah. So, I mean, I guess, the the first thing and and, Tobias, you and I have talked about this before is that I think the first thing that

the the the way to to kind of learn more about it is to to kinda get in contact and kinda challenge us on this idea. I mean,

when you, you know, when you see a technology and you're an engineer, you go out and start using it, you have this tendency to kind of gain a bias around it that, you know, time after time you see it working and then you you think,

why why is not everybody else doing this? And actually, the answer is quite clear. It's because,

well, things like graph databases

were not as ubiquitous as they are right now. You know, you can get off the shelf free graph databases to use, and you know, 1010 even 10 years ago, this was just not the case. You would have to build these things yourself,

and so I think the first thing is, you know, you can get in touch with me at, at tiw@kludin.com

if you're just interested in challenging this this design pattern and really getting to the crux of

really, is this something that we can replace ETO with completely?

I think the other thing you mentioned, the white paper that you alluded to, that's available from our website, so you can always jump over to cluedin.com

to to to read that white paper. It's completely open and free to to everyone to to read. And then we also have a couple of, YouTube videos,

if you just type include in, I'm sure you'll find them,

where we talk in-depth about, you know, utilizing the graph to be able to merge different

datasets, and we really go into depth.

And but, I mean,

I always like to to talk to to other engineers and have them challenge me on this. So feel free to get in touch, and I guess if you're wanting to learn much more, we also have our developer training and that we give here,

at Kludin, which, you know, we compare this pattern towards,

you know, other patterns that are out there, and you can get hands on experience with taking different data sources,

taking the multiple different approaches that are out there as integration patterns, and really just seeing the 1 that works for you. Is there anything else about the ideas of eventual connectivity

or ETL patterns that you have seen or the overall space of data integration that we didn't discuss yet that you'd like to cover before we close out the show? I think for

me, I always like when I have more engineering

patterns

and tools on my tool belt.

So I think for me, the the thing for for listeners to to take from this is that use this as an extra

little piece on your tool belt. If you find that you walk into,

you know, a company that you're helping

and they say, hey. Listen. We're really wanting to start to do things

with our data. And they say, yeah. We've got, we got 300 systems, and to be honest, I've been given the direction to to to to kind of pull and wrangle this into something we can use.

Really

think about this eventual connectivity pattern.

Really, it investigate it as a possible option. It's actually that to implement the pattern, you can you'll be able to see it in the white paper, but to implement the pattern yourself, it's really not complex. It just

like I said before, 1 of the keys is to just embrace maybe a new database

family to to be able to model your data. And, yes, get get in touch if you need any more information on. And 1 follow on from that too, I think, is the idea of migrating from an existing ETL workflow

and into this eventual connectivity space. And it seems that the logical step would be to just replace your current target system with the graph database

and adding in the mapping

for the entity

IDs and the aliases,

and then you're sort of at least partway on your way to being able to take advantage of this and then

just adding a new ETL or work flow at the other end to pull out of the connected data into what your original target systems were. Yeah. Exactly. I mean, it's it's it's

it's quite often we walk into a business and they've already got, you know, many years

of business logic,

inbred into these, ETL pipelines. And my my, you know, my my idea on that is not to just throw these away. There's a lot of good stuff there. It's to really just complement it with this extra design pattern,

that's probably a little bit better at the whole merging

and deduplication of data. Alright. Well, for anybody who wants to get in touch with you, I'll add your email and whatever other contact information to the show notes, and I've also got a link to the white paper that you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Well,

I would say a little bit off topic, but,

I would actually see

say that I'm

amazed

how many companies I walk into, and they don't know what is the quality of the data

they are working with. So I think 1 of the big gaps that needs to be fixed in the data management

market is to be able to

integrate data from different sources

to be explicitly

told via different metrics. I mean, the classic ones that we're used to would be accuracy

and completeness and and things like this.

To for businesses to know,

what are they dealing with? I mean,

just that simple fact of knowing, hey. We're dealing with 34%

accurate data, and guess what? That's what we're pushing to the data warehouse to build

reports and that our management is making key decisions of. So I think, first of all, the gap is in knowing what, quality of data you're dealing with. And I think the second piece is in facilitating the process around how do you increase that. And a lot of these things can often be fixed by normalizing

values. You know, if I've got 2 different,

names for a company, but

they are the same record. Which 1 do you choose? And do we normalize to the value that's upper case or lower case or tile case or the 1 that has a, you know, incorporated at the end, and I think

that that part of the industry just needs to to get better. Alright. Well, thank you very much for taking the time today to join me and discuss your thoughts on eventual connectivity and some of the ways that it can augment or replace some of the ETL patterns that we have been working with up to date.

So I appreciate your thoughts on that, and I hope we enjoy the rest of your day. Thanks,

Tobias.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links