Summary
The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL?
- What is eventual connectivity and how does it address the problems with ETL in the current data landscape?
- In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case?
- How do different implementations of graph databases impact their viability for this use case?
- Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity?
- How much up-front modeling is necessary to make this a viable approach to data integration?
- How do the volume and format of the source data impact the technology and architecture decisions that you would make?
- What are the limitations or edge cases that you have found when using this pattern?
- In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern?
- What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Eventual Connectivity White Paper
- CluedIn
- Copenhagen
- Ewok
- Multivariate Testing
- CRM
- ERP
- ETL
- ELT
- DAG
- Graph Database
- Apache NiFi
- Apache Airflow
- BigQuery
- RedShift
- CosmosDB
- SAP HANA
- IOT == Internet of Things
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
And to grow your professional network and find opportunities with the startups that are changing the world, then AngelList is the place to go. Go to data engineering podcast.com/angel to sign up today. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events, including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graphorum.
Go to data engineering podcast dotcom/conferences to learn more and to take advantage of our partner discounts when you register. And go to the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL. And just as a full disclosure, Tim is the CEO of Kludin, who is a sponsor of the podcast. So, Tim, can you just start by introducing yourself? Yeah. Sure. My name is Tim Ward. As Tobias has said, I'm the CEO of a of a data platform
[00:02:16] Unknown:
called Cluedin. I'm based out of Copenhagen, Denmark. I have with me my wife, my little boy, Finn, and, a little dog that, looks like an Ewok called Seymour.
[00:02:29] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:33] Unknown:
Yeah. So, I mean, I'm I'm, I guess, a classically trained software engineer. I've been working in software space for around, 13, 14 years now. I've been predominantly working in the web space, but mostly for enterprise businesses. And around, I don't know, maybe 6 or 7 years ago, I was, given a a project, which was in the space of what's called multivariate testing. It's the idea that if you've got a website and maybe the the the home page of a website, If we make some changes or different variations, which which variation works better, for the amount of traffic that you're wanting to attract or maybe the amount of purchases that a that a company makes on the website.
So, I mean, using this, that was my first foray into okay. So this involves me having to capture data on analytics. It then took me down this rabbit hole of realizing, ah, got it. I have to not only get the analytics from the website, but I need to correlate this against, you know, back office systems, CRM systems, and, you know, ERP systems and PIM systems, and I I kinda realized, oh god. This becomes quite tricky with the integration piece. And once I went down that rabbit hole, I realized, oh, for me to actually start doing something with this data, I need to clean it. I need to normalize it. And, you know, basically, I I got to this point where I realized, well, data's kind of a hard thing to work with. It's not something you can pick up and just start getting value out of, straight away. So that's kinda what led me into the the path of around, yeah, 4 and a half, 5 years ago saying, you know what? I'm gonna get into this data space. And, you know, ever since then, I've just enjoyed,
[00:04:26] Unknown:
immensely being able to help, large enterprises in becoming a lot more data driven. And so to frame the discussion a bit, I'm wondering if you can just start by discussing some of the challenges and shortcomings that you have seen in the existing practices of ETO.
[00:04:43] Unknown:
Yeah. Sure. I mean, I guess I wanna start by not trying to be that grumpy old man that's yelling at old technologies, and and I'm always this person that it 1 thing I've learned in my career is that it's very rare that a particular technology or approach is right or wrong. It's just right for the the right use case. And, I mean, also, you're you're seeing a lot more patterns in in integration emerge. Of course, we've got the ETL that's been around forever. You've got this ELT approach, which has been emerging over the last few years, and then you've kinda seen streaming platforms, also take up the idea of joining streams of data instead of something that is kind of done upfront.
And, you know, to be honest, I've always wondered with ETL, Now how on earth are people achieving this for an entire company? You know, like, ETL for me has always been something that if you've got 2, 3, 4 tools to be able to integrate, it's a fantastic kind of approach. Right? But, you know, now we're definitely dealing with a lot more data sources, and the demand for having free flowing data available is becoming, you know, much more apparent. And it was to the point where I thought, am I the stupid 1? Like, I can't if I have to use ETL to integrate data from multiple sources, as as soon as we go over a certain limit of data sources, the problem just exponentially becomes a lot harder.
And I think the thing that I found interesting as well with this ETL approach is that, typically, once the data was processed through these classic, you know, designers workflow DAGs, you know, directed acyclical graphs, the the output of this process was typically, oh, I'm going to store this in a a relational database. And and therefore, you know, I could understand why ETL existed. I could understand that, yeah, if if you know what you're going to do with your data after this ETL process, process. I mean, classically, it would go into something like a data warehouse. I can see why that existed.
And I think there's just different types of demands that are in the market today. There's much more need for, you know, flexibility and access to data and not necessarily data that's being modeled as rigidly as you do get in the kinda classical data warehouses, and I kind of thought, well, the relational database is not the only database available to us as engineers, and 1 of the ones that I've been focusing on for last few years is this graph database. And I kinda when you think about it, most problems that we're trying to solve in the modeling world today, they are actually a network. They are a graph. They're not necessarily a relational or a kind of flat document store. So I thought, you know, this seems more like the right store to be able to model the data. I mean, I think the second thing was that just from being hands on, I found that this ETL process, what it meant was that when I was trying to solve problems and integrate data, upfront, I had to know what were all the business rules that dictated how these systems integrate, but what dictated clean data. I mean, you're probably, Tobias, used to these, ETL designers where I get these built in tokenize the text and, and things like that. And you think, yeah, but I need to know upfront what is considered a bad ID or a bad record. You're you're probably also used to seeing, you know, we've got these IDs, and sometimes it's a it's a beautiful looking ID, and sometimes it's negative 1 or n a or, you know, placeholder or hyphen, and you think, I've got a upfront in the ETL world define what are all those possibilities before I run my ETL job. I just found this quite rigid in its approach, and and I think the key kind of game changer for me was that, you know, when I was using ETL and these classic designers to integrate more than 5 systems, I realized how long upfront it took that I needed to go around the different parts of the business and have them explain, okay. So how does the Salesforce lead table connect to the Marketo lead table? Like, how how does it do that? And then time after time, after, you know, weeks of investigation, I would realize, oh, I have to jump to the, I don't know, the the exchange server or the active directory to get the information that I need to join those 2 systems, and it just it just resulted in the spaghetti of point to point integrations. And I think that's 1 of the key things that ETL suffers from is that it puts us in an architectural design, thinking pattern of, oh, how am I going to map systems point to point? And I can tell you after working in this industry for 5 5 years so far, that systems don't naturally
[00:10:02] Unknown:
blend together point to point. Yeah. Your point too about the fact that you need to understand what are all the possible representations of a null value means that in order for a pipeline to be sufficiently robust, you have to have a fair amount of quality testing built in to make sure that any values that are coming through the system map to some of the existing values that you're expecting and then be able to raise an alert when you see something that's outside of those bounds so that you can then go ahead and fix it and then being able to have some sort of a dead letter queue or bad data queue for holding those records until you get to a point where you can reprocess them and then being able to go through and back populate the data. So it definitely is something that requires a lot of engineering effort in order to be able to have something that is functional for all of the potential values. And also, there's the aspect of schema evolution and being able to figure out how to propagate that through your system and have your logics flexible enough to be able to handle different schema values for cases where you have data flowing through that is at the transition boundary between the 2 schemas, so it's definitely a complicated issue. And so you recently released a white paper discussing some of your ideas on this concept of eventual connectivity. And so I'm wondering if you can describe your thoughts on that and touch on how it addresses some of the issues that you've seen with the more traditional ETL pattern. Yeah. Sure. I mean, I think 1 of the concepts
[00:11:41] Unknown:
behind this pattern we've kind of named eventual connectivity and, is the I there's there's a couple of fundamental things to understand. First of all, it's a it's a pattern that essentially embraces the idea that we can throw data into a store, and as we continue to throw more data, records will find out itself how to be merged. And it's the idea of being able to place records into this kind of central place, this central repository with little hooks with little hooks that are or flags that are indicating, hey. I'm a record, and here are my unique references.
So, you know, obviously, with the idea being that, as we bring in more systems, those other records will say, hey. I actually have the same ID. Now that might not happen upfront. It might be after you've integrated system 1, 2, 3, 4, 5, 6, that system 24 are able to say, hey. I now have the missing pieces to be able to merge our records. So in an eventual connectivity world, what this this really brings in advantages is that, first of all, if I'm trying to integrate systems, I only need to take 1 system at a time. And I found it rare in the enterprise that I could find someone who understood the domain knowledge behind their Salesforce account and their Oracle Oracle account or and their Marketo account, I would often run into someone who completely understood the business domain behind the Salesforce account. And for the for the reason I'm using that as an example is because Salesforce is an example of a system where you can do anything in it. You can add objects that are, you know, animals or dinosaurs, not just the the ones that are out of the box. I don't know who's selling to to to dinosaurs, but, essentially, what this allows me to do is when I walk into an integration job and that business says, hey. We have 3 systems. I say, got it.
And if they say, oh, sorry. That was actually 300 systems. I go, got it. It makes no difference to me. It's only a time based thing. The complexity doesn't get more complex because of this type of pattern that we're, of that we're taking, and I'll explain the pattern. Essentially, what we do is we you can conceptualize it as we go through a system, a record at a time or an object at a time. Let's take something like leads or contacts. And the patent basically asks us to, highlight what are unique references to that object. So in the case of a a person, it might be something like a a passport number. It might be, you know, a a local, personal identifier.
You know, in in Denmark, we have what's called the CPR number, which is a unique reference to me. No 1 else in in Denmark can have the same number. But then you get to things like emails, and what you discover pretty quickly in, in enterprise, in the enterprise data world is that email in no way is a unique identifier of an object. Right? We can have group emails that refer to multiple different people and, you know, not all systems will, specify as if this is a group email or this is an email, referring to an individual. So the pattern asks us or dictates us to mark those references as aliases, something that could allude to a unique identifier of an object.
And then when we get to the referential pieces so imagine that we have a contact that's associated with a company. You could probably imagine that there's a a column in the contact table that's called company ID. And the key thing with the eventual connectivity pattern is that although I wanna highlight that as a unique reference to another object, I don't want to tell the the integration pattern where that object exists. I don't want it to tell that it's in the Salesforce organization table because, to be honest, if that's a unique reference, that unique reference might exist in other systems. And so what this means is that I can take an individual system at a time and not have to have the standard point to point type of of relationship between data. And I think if I was to highlight kind of 3 main wins that you get out of this, I think the first is that it's quite powerful to walk into a a large business and say, hey. How many systems do you have? Well, we have a 1, 000, and I think, good. When can we start?
Now if I was in the ETL approach, I'll be thinking, oh, god. Are we can we actually honestly do this? Like, as you could probably know yourself, Tobias, often we go into projects with big smiling faces, and then when you see the data, you realize, oh, this is gonna be a difficult project. So that advantage of being able to walk in and say, I don't care how many systems you have. It makes not a lot of complexity difference to to me. I think the other piece is that the eventual connectivity pattern addresses this idea of that you don't need to know all the business rules upfront of what how systems connect to each other, but then what's considered bad data versus good data.
And rather that, you know, we let things happen and we have a a much more reactive approach to be able to to rectify them. And I think this is more cognizant or it's more representative of the world we look into that we live in today. Companies are wanting more real time data to their consumers or to the the consumption technologies where we get the value, things like business intelligence, etcetera, And they're not willing to put up with these kind of rigid approaches of, oh, the ETL process has broken down. I need to go back to our design. I need to update that and run it through and make sure that we we we, guarantee that, you know, the data is in the perfect order before we actually do the merging.
And I think the final thing that has become obvious time after time where I've seen companies use this pattern is that this eventual connectivity pattern will discover joins where it's really hard for you and me to just sit down and figure out where these joins are. And I think it comes back to this core idea that systems don't connect well point to point. There's not always a nice ID that or this ubiquitous ID that we can just join 2 systems together. Often, we have to jump in between different data sources to be able to wrangle this into a unified type of set of data. Now at the same time, I can't deny that, you know, like, there's quite a lot of work that's going on in the field of, you know, ETL. You've got platforms like NiFi and Airflow, and you know what? Those are still very valid. They're still, you know, they're very good at moving data around. They're fantastic at breaking down a workflow into these kind of discrete components that can, in some cases, play independently.
I think that the eventual connectivity pattern for us time after time has allowed us to blend systems together without this overhead of complexity. And, to bias, there's not a big enough whiteboard in the world when it comes to integrating, you know, 50 systems. You just have to put yourself in that situation and realize, oh, wow. The old ways of doing it are just not scaling. And as you're talking through this
[00:19:34] Unknown:
idea of eventual connectivity, I'm wondering how it ends up being materially different from a data lake where you're able to just do the more ELT pattern of just ship all of the data into this repository without having to worry about modeling it up front and understanding what all the mappings are and then doing some exploratory analysis after the fact to be able to then create all of these connection points between the different data sources and do whatever cleaning happens after the fact.
[00:20:03] Unknown:
Yeah. I mean, you 1 thing I've gathered in my career as well is that, you know, something like an overall data pipeline for a business is gonna be made up of so many different components. And in our world, in the in the eventual connectivity world, the lake still makes complete sense to have. I see the lake as a place to dump data, there I can read it in a ubiquitous language. In most cases, it's SQL, but it's exposed. You know, I don't know a single person in our industry that doesn't know SQL to some perspective, so that that is fantastic to have that like there. Where I see the problem often evolving is that the lake is is obviously kind of a place where we would typically store raw data. It's where we, abstract away the complexity that, oh, now I need if I need data from a SharePoint site, I have to learn the SharePoint API. No. But the the lake is there to basically say, that's already been done. I'm gonna give you SQL, and that's the way that you're going to get this data. What I find is that when I look at, the the companies that we work with is that, yes, but there's a lot that needs to be done from the lake to where we can actually get the value. I think something like machine learning is a good example.
Time after time, we hear, and it's it's true that machine machine learning doesn't really work that well if you're not working with good quality, well integrated data that is complete. So it's missing, you know, nulls, and it's missing empty columns and and things like that. And what I found is that, we went through this, in our industry, we went through this this period where we said, okay. Well, the lake is gonna give the data science teams and the different teams direct access to the law. And what we found is that every time they tried to use that data, they went through these common practices of, now I need to blend it. Now I need to catalog it. Now I need to normalize it and clean it, and you could see that the eventual connectivity pattern is there to say, no. No. No. This is something that sits in between the lake that matures the data to the point where it's already blended. And that's 1 of the biggest challenges I kinda see there is that, you know, if I get, you know, a couple of, different, files out of of the lake, and then I go to investigate how this joins together, I still have this, you know, this experience of, oh, this doesn't easily blend together. So then I go on this exploratory, this discovery phase of what other data sets do I have to use to string these 2 systems together,
[00:22:44] Unknown:
and we would kinda just like to eliminate that. So to make this a bit more concrete for people who are listening and wondering how they can put this pattern into effect in their own systems, can you talk through an example system architecture and data flow for a use case that you have done or at least experimented with to be able to put this into effect and how the overall architecture plays together to make this a viable approach and how those different connection points between the data systems end up manifesting? Yeah. Definitely. And so maybe it's it's good to to use an example. Imagine you have
[00:23:23] Unknown:
3 or 4 data sources that you need to blend. You need to ingest it. You then need to to usually merge the records into kind of a flat, flat and unified, datasets, and then you need to, you know, push this somewhere. So it might be a data warehouse, something like BigQuery or or Redshift, etcetera. And, the the the the fact is that, you know, in today's world, that data also needs to be available for the data science team, and now it needs to be available for things like exploratory business intelligence. So when you're building your your integrations, I think architecturally from a from a a modeling perspective, the the 3 things that you need to think about are what we call entity codes, aliases, and edges.
And those 3 pieces together is what we need to be able to map this properly into a a graph store. So simply put, an entity code is is kind of a absolute unique reference to an object. As I alluded to before, something like a passport number, that's a unique reference to an object, but by itself, just the passport number doesn't mean that it's unique across all the systems that you have at your workplace. So the other is aliases. So aliases is more of like this this email, a phone number, a nickname. They're all alluding to some type of overlap between these records, but they're not something that we can just, honestly go ahead and just a 100% merge records based off these. Now, of course, having that, you, of course, then need to investigate things like inference engines to build up, you know, confidence on now how confident can I be that a person's nickname is is is unique in the reference of the data that we've plugged in, these 3 or 4 data sources that I'm talking? And then finally, the edges, they're they're placed essentially, they're there to be able to build the referential integrity. But what I find architecturally is that when we're trying to solve, data problems for for companies, a majority of the time, their model represents much more a network than the classic relational store or column database or document store.
And so when we look at the the technology that's that's needed to, you know, support the system architecture, 1 of the key technologies at the heart of this is a graph database, and to be honest, it doesn't really matter which graph database you use, but it is kind of what we found important is that it needs to be this a native graph store. There are triplet stores out there. There are multimode databases like, Cosmos DB and SAP HANA, but what we found is that you really need a native graph to be able to do this properly. So the way that you can conceptualize the pattern is that every record that we pull in from, a system or that you import, it will go into this network or graph as a node, and every entity code for that record, I e the unique ID or multiple unique IDs of that record, they will also go in as a node connected to that record.
Now every alias will go in as a property on that original node because we wanna probably do some processing later to figure out if we can turn that alias into 1 of these entity codes or these unique references. Now here's the the interesting part. This is the this is the part where the eventual connectivity pattern kicks in. All the edges, I. E. If I was, you know, referencing a person to a company, that person works at a company. Now those edges are placed into the graph, and a node is created but is marked as hidden.
Now we call those shadow nodes. So you could imagine if we brought in a record on on Barack Obama, and it had, Barack's phone number. Now that's not a unique reference. But what we would do is we would create a record, a node in the graph that's referring to that phone number, link it to Obama, but mark the phone number node as hidden. As I said before, we call these shadow nodes. And, essentially, you can see that as 1 of these hooks that, if I ever get other records that come in later that also have an alias or an entity code that overlap, that's where I need to start doing my merging work. And what we're hoping for, and this is what we see time after time as well, is that as we import system one's data, it'll start to come in, and you'll see a lot of nodes that are the shadow nodes, I e, I have nothing to hook onto on the other side of this this ID.
And the analogy that kind of we use for this this shadow node is that, you know, records come in, they're by default a clue. So a clue is in no way factual, in no way do we have any other records that are correlating to these same values, and our goal is to turn in this eventual connectivity pattern clues to facts. And what makes facts is records that have the same entity codes that exist across different systems. So the architectural key patterns to this is that a a graph store needs to be there to model our data, and here's 1 of the key reasons. If I realize that the landing zone of this integrated data was a relational database, I would need to have an upfront schema.
I would need to specify how these objects connect to each other. What I've always found in the past is that when I need to do this, it becomes quite rigid. Now I believe I'm a strong believer in every database needs a schema at some point or you can't scale these things. But what's nice about the graph is that 1 of the things that got really or design patterns that got really well was flexible data modeling. There is no necessarily more important object that exists within the graph structure, they're all equal in their complexity, but also in their, importance. And
[00:30:01] Unknown:
really pick and choose the graph database that you want, but it's 1 of the keys to this architectural path. So 1 of the things that you talked about in there is the fact that there's this flexible data model. And so I'm wondering what type of upfront modeling is necessary in order to be able to make this approach viable. I know you talked about the idea of the entity codes and the aliases, but for somebody who is taking a source data system and trying to load it in to this graph database in order to be able to take advantage of this eventual connectivity pattern, what is the actual process of being able to load that information in and, assign the appropriate attributes to the different records and to the different attributes in the record. And then, also, I'm wondering if there are any limitations in terms of what the source format looks like as far as the serialization format or the types of data that, that that this approach is viable for.
[00:31:03] Unknown:
Sure. Good question. So, I mean, I think the first thing is is to identify that the eventual connectivity pattern and modeling it in the graph, the key to this is that there will always be extra modeling that you do after this step, and the reason why is because if you think about the data structures that we have as engineers, the network or the graph is the highest fidelity data structure we have. It's it's a higher or more, detailed structure than a tree. It's more, structured than hierarchy or a or a relational store, and definitely more we have more structure or fidelity than something like a document. And with this in mind, we use the eventual connectivity to solve this piece of integrating data from different systems and modeling it, but we know we will always do better modeling for the purpose fit case later. So it's it's worth highlighting that the value of the eventual connectivity pattern is that it makes the integration of data easier, but this will definitely not be the last modeling that you would have. And therefore, this allows flexible modeling because you always know, hey. If I'm trying to build a data warehouse based off the data that we've modeled in the graph, you're always going to do extra processes after it to model it into the, probably, the relation store for a data warehouse or a column.
You're gonna model it purpose fit to solve that problem. However, if what you're trying to achieve with your data is flexible access to data to be able to feed it off to other systems, you want the highest fidelity, and you want the highest flexibility in modeling. But the key is that if you were to drive your data warehouse directly off this graph, it would do a terrible job. That's not what the graph was purpose built for. The graph was always good at flexible data modeling. It's always good at, being able to join records very fast, and I mean just as fast as doing an index lookup. That's how these native graph stores have been designed. And so it's it's it's important to to highlight that, the up front modeling really, it's not a lot of up front modeling.
Of course, we shouldn't do silly things, but I'll give you an example. If I was modeling a skill, a person, and a company, it's completely fine to have a graph where the skill points to the person and the person points to the organisation, and it's also okay to have that the person points to the skill and the skill points to the organization. That's not as important. What's important at this stage is that the eventual connectivity pattern allows us to integrate data more easily. Now when I get to the point where I want to to do something with that data, I might find that, yes, I actually do need an organization table, which has a a foreign key to person, and then person has a foreign key to skill. And that's because, you know, that's typically what a data warehouse is built to to do. It's to model the data perfectly. So if I have a 1, 000, 000, 000 rows of data, this report still runs fast, but we lose that kind of that flexibility in the data modeling. Now as for formats and things like that, what I found is that just to some degree that the formatting and and the source data, well, you could probably imagine the data is not created equally. Right? For so for many different systems, they'll allow you to do absolutely anything you want, where the kind of ETL approach allows you to, you know, or kinda dictates that you capture these exceptions upfront of if I've got a certain looking data coming in, how does it connect to the other systems? What eventual connectivity does is it just captures them later in the process. And my thoughts on this is that, to be honest, you will never know all these business rules upfront, and therefore, kinda let's embrace an integration pattern that says, hey. If the schema in the source or the format of the data changes, and you kind of alluded to this before as well, Tobias, is okay. Got it. I want to be alerted that there is an issue with deserializing this data. I want to start queuing up the data in a a message queue or maybe a stream, and I want to be able to fix that schema and a platform to be able to say, got it. Now that that's fixed, I'll continue on with serializing the things that will, that will now serialize, and these kind of things will happen all the time. I think I've referred to it, before and and heard other people refer to it as schema drift, and this will always happen in source and in target.
So what I found success with is embracing patterns where failure is going to happen all the time. And when we look at the ETL approach, it's much more of a when things go wrong, everything stops. Right? That the different workflow stages that we've put into our kind of classical ETL designers, they all go red, red, red, red. I have no idea what to do, and I'm just going to kind of fall over. And so what we would rather is a pattern where it says, got it. Scheme has changed, I'm gonna log up what you need to do until the point where you've changed that schema, and when you put that in place, I'll say, I'll test it. I'll say, yep. That schema that seems to be I can serialize that now. I'll continue on.
And, what I find is that if you don't embrace this technology, spend most of your time in just reprocessing,
[00:37:12] Unknown:
data through ETL back. And so it seems that there actually is still a place for workflow engines or some measure of ETL where you're extracting the data from the source systems, but rather than loading it into your data lake or your data warehouse, you're adding it to the graph store for then being able to pull these mappings out and then also potentially going from the graph database where you have joined the data together and then being able to pull it out from that using some sort of
[00:37:44] Unknown:
query to be able to have the maps data extracted and then load that into your eventual target. I I mean, what you've just described there is a workflow, and therefore, you know, these workflow systems, they still make sense. They're very logical to look at these at these workflows and say, oh, that happens, then that happens, then that happens. They completely still make sense. And I I still actually use in in some cases, I actually still use these ETL tools for very specific jobs, But what you can see is if we were to use these kind of classical workflow, systems, you can see the eventual connectivity pattern as you described. It's just 1 step in that overall pattern.
But I think what I found over time is that, you know, we use these workflow systems to be able to join data, and I would I would actually rather throw it to an individual step called eventual connectivity where it does the joining and things like that for me, and then continue on from there. You could, very similar to the the kind of example you gave is and and that I've also been been mentioning here as well is there will always be things you do after the graph, and that is something you could easily push into 1 of these workflow designers. Now as for an example of, you know, the the times when our company still uses these these these tools out at our customers, I think 1 of the ones that makes complete sense is IoT data.
And it's mainly because it's not typical in at least the cases that we've seen, that there's as much hassle in blending and cleaning data. We see that more with, you know, operational data, things like transactions and, you know, customer data and customer records. That's typically quite hard to blend, but when it comes to IoT IoT data, you know, if there's something wrong with the data that it can't blend, it's often that, well, maybe it's a bad reader that we're, you know, reading off instead of something that is actually dirty data. Now, of course, every now and then, if you worked in in that space, you'll realize that, you know, readers can lose a network and they can, you know, have holes in the data. But, I mean, eventual connectivity would not solve that either. Right? And typically, in those cases, you'll do things like impute the values from historical and future data to fill in the gaps. And it's always a little bit of a guess that's why it's it's it's it's, we're imputing it. But to be honest, if it was my task to build a unified dataset from across different sources, I would just choose this eventual connectivity pattern every single time. If it was to have to, a workflow of data processing where I I know that data blends easy, then there's not a data quality issue. Right? Where there is, you need to jump across multiple different systems to merge data. I just time after time have found that, you know, these workflow systems, they they reach their limit where it just becomes too complex.
[00:40:54] Unknown:
And for certain scales or varieties of data, I imagine that there are certain edge cases that come up when trying to load everything into the graph store. And so I'm wondering what you have run up against as far as limitations to this pattern or at least alterations to the pattern to be able to handle some of these larger volume tasks? I think I'll start with this. The graph is notoriously
[00:41:19] Unknown:
hard to scale, and most of the the graph databases that I've had experience with, and you're essentially bound to 1 big graph. So there's there's no I there's no kind of idea of clustering these data stores with, you know, sub graphs that you could query a cost. So scaling that is actually quite hard to start with. But I think the limitations from the pattern itself, there are many. I mean, it starts with the fact that you need to be careful. I'll I'll give you a good example. I've seen many companies that use this pattern, and they flag something like an email as unique, and then we realize later, no, no, no, it's not. We have merged records that are not duplicates, And this means, of course, that you need, support in the platform that you're you're utilizing the ability to split these records and fix them and reprocess them at a later point. But, I mean, these are also things that will be very hard to pick up an ETL, ELT types of patterns. But I think 1 of the other, you know, downsides of this approach is that upfront, you don't know how many records will join. You're kind of like the name alludes to.
Eventually, you'll get joins or connectivity, and you can think of it as this pattern will decide how many records it will join for you based off these entity codes or unique references or the power of your inference engine when it comes to things that are a little bit fuzzy unique, fuzzy, ID to to someone things like, you know, phone numbers and things like that. The great thing about this is it also means that you don't need to choose what type of join, that you're doing. I mean, in the relational world, you've got plenty of different types of joins, you know, inner joins, outer joins, left outer, left right outer joins, things like this. In a graph, there's 1 join. Right? And so with that pattern, you know, it's not like you can pick the wrong join to go with. There's only 1 type of thing. So it it really becomes useful when no. No. No. I'm just trying to to merge these records. I don't need to hand hold how these joins will happen. I think 1 of the other downsides that I've had experience with this is that let's just say you have, you know, system 12, and what you'll often find is that when you integrate these 2 systems, you have a lot of these, shadow nodes in the graph, I e or sometimes we call them floating edges where, hey. I've got a reference to a company with an ID of 123, but I've never found the record on the other side with the same ID. So I have, you know, in fact, I'm storing lots of extra information that, you know what, I'm not actually utilizing it. But I think the advantages of saying, yeah, but you will integrate system 4. You will integrate system 5 where that data, sits.
But the value is that you don't need to tell the system how they join. You just need to flag these unique references. And I think that the kind of final maybe limitation or that I think that I found with these patterns is that it you learn pretty quickly as I alluded to before that there are many records in your data sources where you think a column is unique, but it's not. It might be unique in your system, I e in Salesforce, the ID is unique. But if you actually look across the the other parts of the stack, you realize, no. No. No. There is another company in another system with a record called 123, and they have nothing to do with each other.
And so what we you know, these entity codes that I'm talking about, they're made up of multiple different parts. They're made up of the ID 123. They're made up of a type, something like organization, and they're made up of a source of origin, you know, Salesforce account 456. And so what this does is it guarantees uniqueness if you added in 2 Salesforce accounts or if you added in systems that have the same ID, but it came from a different source. And as I said before, a good example would be the email. I mean, even at our company, we use GitHub Enterprise to be able to to store our our source code, and, you know, our we have notifications that our engineers get when, you know, there's pull requests and things like this. And it actually our GitHub identifies each employee as notifications at github.com.
That's what that record sends us as its unique reference, and, of course, if I mark this as a unique reference, all of those employee records using this pattern would merge. However, what I like about this approach is that, you know, at least I'm given the tools to rectify the the bad data when I see it. And to be honest, if companies are wanting to become much more data driven as kind of we aim to help our customers with, then I just believe that it means we have to start to shift or learn to accept that there's more risk that could happen, but is that risk of having data, you know, more readily available to the forefront worth more than the old approaches that we're taking? And for anybody who wants
[00:46:49] Unknown:
to dig deeper into this idea or learn more about your thoughts on that or, some of the adjacent technologies, what are some of the
[00:46:58] Unknown:
resources that you recommend they look to? Yeah. So, I mean, I guess, the the first thing and and, Tobias, you and I have talked about this before is that I think the first thing that the the the way to to kind of learn more about it is to to kinda get in contact and kinda challenge us on this idea. I mean, when you, you know, when you see a technology and you're an engineer, you go out and start using it, you have this tendency to kind of gain a bias around it that, you know, time after time you see it working and then you you think, why why is not everybody else doing this? And actually, the answer is quite clear. It's because, well, things like graph databases were not as ubiquitous as they are right now. You know, you can get off the shelf free graph databases to use, and you know, 1010 even 10 years ago, this was just not the case. You would have to build these things yourself, and so I think the first thing is, you know, you can get in touch with me at, at tiw@kludin.com if you're just interested in challenging this this design pattern and really getting to the crux of really, is this something that we can replace ETO with completely?
I think the other thing you mentioned, the white paper that you alluded to, that's available from our website, so you can always jump over to cluedin.com to to to read that white paper. It's completely open and free to to everyone to to read. And then we also have a couple of, YouTube videos, if you just type include in, I'm sure you'll find them, where we talk in-depth about, you know, utilizing the graph to be able to merge different datasets, and we really go into depth. And but, I mean, I always like to to talk to to other engineers and have them challenge me on this. So feel free to get in touch, and I guess if you're wanting to learn much more, we also have our developer training and that we give here, at Kludin, which, you know, we compare this pattern towards, you know, other patterns that are out there, and you can get hands on experience with taking different data sources, taking the multiple different approaches that are out there as integration patterns, and really just seeing the 1 that works for you. Is there anything else about the ideas of eventual connectivity
[00:49:08] Unknown:
or ETL patterns that you have seen or the overall space of data integration that we didn't discuss yet that you'd like to cover before we close out the show? I think for
[00:49:18] Unknown:
me, I always like when I have more engineering patterns and tools on my tool belt. So I think for me, the the thing for for listeners to to take from this is that use this as an extra little piece on your tool belt. If you find that you walk into, you know, a company that you're helping and they say, hey. Listen. We're really wanting to start to do things with our data. And they say, yeah. We've got, we got 300 systems, and to be honest, I've been given the direction to to to to kind of pull and wrangle this into something we can use. Really think about this eventual connectivity pattern.
Really, it investigate it as a possible option. It's actually that to implement the pattern, you can you'll be able to see it in the white paper, but to implement the pattern yourself, it's really not complex. It just like I said before, 1 of the keys is to just embrace maybe a new database family to to be able to model your data. And, yes, get get in touch if you need any more information on. And 1 follow on from that too, I think, is the idea of migrating from an existing ETL workflow
[00:50:30] Unknown:
and into this eventual connectivity space. And it seems that the logical step would be to just replace your current target system with the graph database and adding in the mapping for the entity IDs and the aliases, and then you're sort of at least partway on your way to being able to take advantage of this and then just adding a new ETL or work flow at the other end to pull out of the connected data into what your original target systems were. Yeah. Exactly. I mean, it's it's it's
[00:51:04] Unknown:
it's quite often we walk into a business and they've already got, you know, many years of business logic, inbred into these, ETL pipelines. And my my, you know, my my idea on that is not to just throw these away. There's a lot of good stuff there. It's to really just complement it with this extra design pattern, that's probably a little bit better at the whole merging
[00:51:28] Unknown:
and deduplication of data. Alright. Well, for anybody who wants to get in touch with you, I'll add your email and whatever other contact information to the show notes, and I've also got a link to the white paper that you mentioned. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:49] Unknown:
Well, I would say a little bit off topic, but, I would actually see say that I'm amazed how many companies I walk into, and they don't know what is the quality of the data they are working with. So I think 1 of the big gaps that needs to be fixed in the data management market is to be able to integrate data from different sources to be explicitly told via different metrics. I mean, the classic ones that we're used to would be accuracy and completeness and and things like this. To for businesses to know, what are they dealing with? I mean, just that simple fact of knowing, hey. We're dealing with 34% accurate data, and guess what? That's what we're pushing to the data warehouse to build reports and that our management is making key decisions of. So I think, first of all, the gap is in knowing what, quality of data you're dealing with. And I think the second piece is in facilitating the process around how do you increase that. And a lot of these things can often be fixed by normalizing values. You know, if I've got 2 different, names for a company, but they are the same record. Which 1 do you choose? And do we normalize to the value that's upper case or lower case or tile case or the 1 that has a, you know, incorporated at the end, and I think
[00:53:16] Unknown:
that that part of the industry just needs to to get better. Alright. Well, thank you very much for taking the time today to join me and discuss your thoughts on eventual connectivity and some of the ways that it can augment or replace some of the ETL patterns that we have been working with up to date. So I appreciate your thoughts on that, and I hope we enjoy the rest of your day. Thanks, Tobias.
Introduction to Eventual Connectivity
Tim Ward's Background and Journey
Challenges with Traditional ETL
Concept of Eventual Connectivity
Example System Architecture
Workflow Integration and ETL Tools
Limitations and Challenges
Resources and Further Learning
Biggest Gaps in Data Management Tooling