Building An Enterprise Data Fabric At CluedIn

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances to ensure that you get the performance that you need.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

And managing and auditing access to all of those servers and databases that you're running is a problem that grows in difficulty alongside the growth of your teams.

If you're tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need, then it's time to talk to our friends at StrongDM.

They have built an easy to use platform that lets you leverage your company's single sign on for your data.

Go to dataengineeringpodcast.com/strongdm

today to find out how you can simplify your systems.

An Eluxio is an open source distributed data orchestration layer that makes it easier to scale your compute and your storage independently.

By transparently pulling data from underlying silos, Eluxio unlocks the value of your data and allows for modern,

data efficiently, accelerate business analytics, and ease the adoption of any cloud.

Go to dataengineeringpodcast.com/aluxio

today to learn more and thank them for their support.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

Go to data engineering podcast dotcom/conferences

to learn more and take advantage of our partner discounts when you register.

And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Tim Ward about clued in, an integration platform for implementing your company's data fabric. So, Tim, could you start by introducing yourself? Yeah. Sure. And first of all, thanks for having me on, Tobias.

Yeah. My name is Tim Ward. I've been working in the software engineering space for around the last, 12, 13 years and mainly focusing in the kind of enterprise,

space. Myself, I'm based out of Copenhagen,

Denmark, and

I have a little boy and, a little dog that looks like an Ewok, and and, my wife all live in, in small little Copenhagen. And do you remember how you first got involved in the area of data management?

Yeah. So

based off the fact I've been working software engineering for for that long, it was around 6 years ago that I got my first glimpse into a new world. I'd been kind of focusing mainly in the web space. So, you know, building, big websites for kind of big brands. And,

I got given a project which is around

how do we optimize

content on the website to the right audiences, and you could probably imagine this comes with a lot of, data as well. And, you know, in this industry of,

of testing content, you would usually have this, process called a split AB test. You know, give 2 variations of a a website and, you know, which 1 is the is the more popular, and we had come up with this, kind of interesting idea of, well, should there really be 1 winner, or is there actually

that both are relevant to different audiences?

So this took me down this rabbit hole of data mining, clustering techniques with the machine learning.

And what got me into the more data management piece is that I realized, oh god, I have to blend data from different systems, and it it doesn't really blend well because it's not clean. And so I need to normalize the data and well, fast forward to today, and I've been in the industry for the last, 6 years now. And so before we get digging too deep into clued in in particular, can you start by sharing your definition of what a data fabric is? Because there are a lot of different resources that people might look to that might have conflicting definitions, and that way we can have a sort of consistent view for the purpose of this conversation.

Yeah. Well, I mean, first of all, I will,

I will apologize for introducing yet another data acronym in the data space. I think we don't need them anymore.

We've got data lake and data warehouse and data,

marts, and now we've got data fabric. So for that, I apologize. But,

you know, if you,

if you look up data fabric on the web and, you know, you find us, you'll see that we're kind of talking about this idea of a a foundation or a fabric across our content. It really comes down to when I first got into this industry,

I realized how overwhelming

the data space

is. And, I mean, I guess, just some of the keywords or buzzwords I mentioned before, like data integration and data governance and data warehouse, it just was overwhelming from a technology stack. And, you know, when we,

started the the platform, we saw this common theme of people were buying these kind of individual tools like a data warehouse and business intelligence tools. But, really, where most of these projects failed was the kind of stitching of all of these,

different pieces together. So, suddenly, if we had bought a, you know, a purpose fit integration platform, and now we had

purchased a best in breed data preparation tool. We had this huge challenge of how do we make that a coherent and

cohesive story. And so when we describe the data fabric, you can think of it more as this kind of plumbing or this this stitching where, you know, we know that at the end of the day,

you might already have bought a data warehouse. It's probably been around for a very long time, but what you miss is this kind of core

stitching and plumbing of, you know, an end to end story. So when we talk about fabric, it's it's really about how do I make sure that when I've got data coming in, that it's giving value to me at the end.

And and those end use cases will typically be things like visualizations

in in charts and BI tools. It could be, you know, figuring out,

patterns in your data using machine learning techniques and tools like Dataiku

or Azure ML or or RapidMiner. So, really, we're wanting to to kinda make this stitching solve,

and you can plug in your parts where you see fit. And so that brings us to the work that you've been doing with Kludin to

make this whole process of stitching together different data sources easier. So can you give an explanation about what it is that your mission is at Klueden in and some of the story of how it got started? Yeah. Sure. I mean, really, we're interested in helping companies become data driven, and this means a lot of things to me as well. There's another data acronym for you. So for me, the idea of data driven is that I, as a company,

am using data to make very decisive,

decisions.

So

something that I could also

not just use as a gut feeling or a hypothesis, but something that I can point at a screen and say, hey. Here's the data that's allow me to make my opinion or give this input. And, you know, we think it's so important to also be able to trace that data back to where did it come from, how was it cleaned over time, you know, how stale is it,

what's the quality, what's the accuracy, and I also saw all these pieces as, you know, missing in the market and something that I would like in the core fabric or the core plumbing.

So, I mean, like any good story, I guess, it starts with a necessity. We,

a couple of the engineers at Kludin were working at a a larger enterprise, and we were, at the time, a smaller

core engineering group, and, you know, you can you know, not a lot about what's happening in the business when you're that kind of small team. You know, we knew who our customers were. We knew, you know, what was wrong with our platform. We knew what our customers were asking for. That kind of communication

was there. And, you know, as most businesses do, they grow. And, you know, within a year, we had grown from 5 kind of core engineers to 125,

and, you know, we we lost this connection to why we were building this product. Why were we building what we what we were? And, you know, like any typical engineer would try to to solve things,

Our CTO now, at the time, he said, well, why don't I just build this kinda hub, and I can I can get everybody to kind of plug in all the tools that they use, like the CRM and the and the marketing tools, and I'll just go there to to figure things out instead of, god forbid, talk to people or take meetings? So it's a very kind of has a very typical engineering start, but, really, what happened from there was this natural progression of well, if I look up a particular customer, well, I get 10 different results, but they're all the same customer.

So, okay, I have to introduce maybe some deduplication

or some emerging engine. And then we realize, well, things don't always merge perfectly, so we need a fuzzy merging engine. And, you know, well, this would probably be represented well in a graph database instead of just 1 type of of of data format. And,

you know, then then you come to this natural step of, well, I can't form a graph properly if the data's not clean and normalized. So this really,

snowballed into this kind of self fulfilling prophecy where it's landed today as this kind of, like, core stitching of the the data foundation of a business.

And as far as the types of customers that you work with, I guess, if you can just describe what the ideal customer is as far as whether they're a technically oriented organization

or if your aim is to be to sort of plug and play where somebody just sends you data and then you automatically do the integration and send it to their destinations or integrate it with the various data streams that they're working with? Yeah. So, I mean, there's lots of good pieces there in this question. I mean, the piece automatic. Right? You could probably imagine, you know, there's no magic in our in our field to bias. You know, there's rules, there's,

techniques, and there's things that fall through the cracks. So, I mean, now, funnily enough, our ideal customer is large enterprises that have a huge problem. Right? Thousands of

tools. God forbid, you have to integrate these a 1000 tools into 1 central hub and then clean it and then kind of invent all these processes themselves. And there is,

there is a lot of different things and techniques we can use in our field right now to automate some of the cleaning. So things like, you know, standardizing on a a date format, things like normalizing phone numbers into ISO formats, you know, trying to do a relatively good job at, you know, normalizing

addresses and and things like this. And there's really no science to some of these things. But there are things that fall through the cracks. So, really, our goal with the, like, our ideal customer would be, hey. You've got a a mess with the data. It's not blending well together. And, therefore, when you see that data in your, you know, BI tools or your machine learning tools where you're actually gonna get value, well, you're not seeing that value. And, really, our goal is to to put in the processes and the pipeline to say, hey. We're we're facilitating a process for you to improve this quality over time. So it's really large enterprise customers. And, you know, funnily enough, our the the majority of our customers are in the finance and the and the insurance industry. And a lot of this also comes down to regulation.

You know, a lot of these industries are put under more strenuous

regulations than, for example, maybe technology companies today. And, you know, being being,

living in in Europe, there's strong policies are now around data privacy and, you know, what customers can data they can have on individuals. And all this plays a role in in really kind of focusing on in on enterprise businesses.

Yeah. And your point that you're making about regulation

and compliance

architectural and system level in terms of managing data privacy and security as it transits your system

and what the sort of deployment environment looks like, whether you're running a hosted system that people send data through or if you're collocated with the data to reduce some of the latency effects of having to cross network boundaries and just,

some of the overall ways that you have built your company and the ways that the data flows through. Yeah. I mean, it's it's,

you could probably gather that a lot of our customers are really interested in not only hosting this environment in their own premises, in their own kind of virtual machine clusters, But a lot of them have have kind of got, they're moving into the cloud, and some of them even have this kinda hybrid approach. So, really, the the architectures that we're seeing is, you know, first of all, we have to start with a good base. Right? So if someone's going into the cloud, it's so important that we're adhering to the kind of leading

security

network and at rest encryption.

So things like TLS,

things like SSL,

things like making sure that, you know, we're running all of this behind VPN infrastructure.

Now it's so important, of course, with the customer data that, you know, as it's flowing through and if we're encrypting it at both network and and and at rest, we also need to make sure that we held the highest of encryption,

at least as, the same or higher than any of the sources.

So, for example, if we plugged in something like the, Office 365 stack and we were analyzing something like mail or calendar events, well, we would need to make sure that we're running at 256 AES

encryption at the at the data level, you know, that we're using,

industry standard protocols for for network transfer. So that's the security side of things, and then there's really the privacy side of things. So 1 of the ways that we help with this is within our platform, you can manage what's called the consent. You know, what do you have consent for for, you know, personally identifying data? And in fact, 1 of the ways that we help with this is because we're cataloguing the data that comes through this pipe. You know, we're looking at what are the fields coming from Salesforce, what are the fields coming from the ERP system? This allows you to have a direct mapping between, hey. For this individual,

we have the consent

to use their email address and their phone number and their first name and last name in these specific,

situations.

But governance is also so much bigger than than just compliance. There's also the the policies around what do we allow to flow through this stack. What if I see something like a credit card come through this pipe? How should I react to it? And kind of the ways that we support that is, you know, there's 2 different types of ways to react to that. 1 of this is just to be notified and alerted that, hey. You've got this data and, you know, it's high risk. The other is, hey. I need to, you know, stop. Right? Don't let this data go any further to upstream consumers like the data warehouse and, you know, machine learning platforms and, you know, we often see that this is something that,

is not really covered in in data governance. You know, the typical data governance is, hey. Can we set up some rules around ownership of data, ownership of products? And I think 1 of the the things that we miss in our industry is, yeah, how are we actually tracing that against the actual data? And I think this is why,

you know, this is a problem for most companies, and to some, think it's kind of unsolvable to to be able to really meet all these regulations. So we're making sure that we cover off these kind of security blankets of, hey. If you've got data flowing through us, at least we've got you covered on these kind of main pieces and concepts that you'll need. And as far as the overall life cycle of data as it traverses your system, can you just talk through the sort of different components of how you're processing it and the systems that you're running and some of the evolution that has gone on since you first began working on Clueden? Yeah. Definitely. I mean, you could probably, you'll probably agree that everything in our world starts with getting the data. And so, you know, the integration layer, that's really where it starts. And I often like to say that platforms like Kludin are useless if you can't put data through it. So, you know, the way that we help there is, you know, first of all, we've got, you know, over 200 integrations to popular platforms like Salesforce and SQL Server and Oracle databases,

Hadoop, you know, the the common kind of big data type of, in environments as well. And 1 of the things that we do first off,

as the data is flowing through, the absolute first thing is we start to score the data raw.

So we start to look at the data on about 12 different metrics, and these are your classic kind of, you know, data quality metrics like accuracy and completeness and,

you know, because we're using certain technology like, 1 of the the databases, I think I'll probably touch on a few times throughout this, is the graph database. You know, this allows us to also measure things like connectivity

of these records. And the reason why we bring it in raw is 1 of the the kind of common,

architectures we see is that a lot of people are introducing this idea of a data lake, I e, you know, a place to dump all the data from all the different systems. You get a ubiquitous language typically via something like SQL to be able to, you know, query this data in a in a common format. And and that's what allows us to to say, hey. That's great. If if you've got that, instead of the data going directly from the sources to clued

in, why don't we go through the data lake first and that clued in will then integrate with with the data lake itself?

And so that's that pivotal step of allowing flexibility is also why we often refer to ourselves as this data fabric, but,

I think we would both agree to bias that data by itself is really not that valuable. I mean, in some cases, especially with, privacy, it's kind of sometimes more of a a liability. And so, you know, the ability to score this data on these different metrics, we're setting it up for a whole bunch of processing and tests that we're going to do on it. And so the next kind of natural step in the pipeline is, hey. Well, I need to prepare this data to do some analysis. So 1 of the ways that we've tackled this is we actually use 5 different database families

to store the same data. So you can imagine that, you know, a relational database is very good at doing certain things. It's it's what runs a a data warehouse typically. It's very good at aggregating data. It's been around for for a very long time, so it's solid and it's robust. That, you know, it's just not good at doing things like modeling data that is more in kind of a a graph or network model, which is, to be honest, how most businesses actually are modeled. And so we're saving these same records and persisting them into these same database these different database families.

There's a lot of of value we get out of this. First of all, you know, we can ask questions about data that we couldn't just ask to a relational database or, God forbid, shouldn't ask to a a relational database. But the other great thing is when we're wanting to get data out of our system. Okay. So I have some pretty interesting querying techniques. I can do queries that are part graph and part search index and part column store

and kind of munge it all together into this Frankenstein

type of, query, but we can also optimize that query to run against the right databases.

So once we've got it in these databases, of course, we're gonna start applying these kind of classic cleaning techniques like normalization

of dates and things like, phone numbers and maybe even some simple things like genders. But and, of course, things like normalizing,

types, like, hey. This is an integer. This is a float. Hey. Let's normalize this so people don't need to do it upstream themselves, but things fall through the crack. Right? There's there's no magic to this this industry. There are just same things that need a manual application.

So for that, we, you know, that's when we kind of bring clued and clean into the into the whole pipeline. It's the ability for, you know, a data engineer to to really solve the the normalization

and standardization

things that really can't be automated very easily or or at least not very statistically confident. So once we've gone through these cleaning techniques,

really, the next thing is the governance piece is applied. You know, are you even allowed to send this data up stream? Do you have any personal data coming through or high risks data that we should be alerting the system of? And, really, the the endpoint of clued in is kind of kinda sometimes unfortunate for us, but we stop really where the fun happens.

At that point, we just say, hey. Here is the data. It's been blended. It's, it's cleaner. It's more accurate. I've traced where it came from, so you can always order it. Now go do something with it, and that's why we've brought,

GraphQL into the the situation. Have you, had a chance to play with, GraphQL Tobias? I haven't done much work with it, but I have, had a number of conversations about it. I know that

the, d graph project uses a modified version of GraphQL as their query language. And on my other podcast, I had an interview with a gentleman who has been building tools for simplifying

the creation of GraphQL APIs in Python and in JavaScript.

Got it. It's definitely an interesting approach to the sort of overall problem of saying, I just need this data. I don't care how it gets to me, and then pushing the sort of responsibility

further down the stack. So

makes it easier at some layers and more complicated at others. But I think that it in terms of just overall system design, it sort of simplifies things and just adds a fairly useful abstraction layer to it. And that's the word. The it's the abstraction. It's you know, in engineering, when you dial up simplicity, you lose more functionality sometimes, and I like to,

describe, GraphQL more as like a well, it's kind of like a schema for a query language, and, you know, you can do what you want with it. And, you know, if you want it just to talk to a relational database, go for it. But, I mean, in our case,

we're in this good, position where we say, hey. I've actually got the same data in 5 different database families. So if I want a part of the GraphQL query to run against the, you know, the search index because that, you know, maybe we're doing fuzzy searching or may we're searching in in different languages and, you know, traditionally, they're harder to actually achieve in the other types of database families. We can say, okay. Why don't you do that part in the search index? And when you get to the part where you're, you know, getting out the results from GraphQL

and you need edges, well, don't get the relational database or the search index to do that. Why don't you get the graph to do that piece? So that's why I kinda describe it sometimes at this

Frankenstein of of languages, but like you said, that's kind of that complexity is abstracted away from you. And in the end, all you kinda get is

the power of SQL, but kinda just the the select part and and the where part, because by that point, the data's already been joined. So this whole complexity around well, I need data from Salesforce, and I need it from Dynamics, but now I need to figure out the inner joins or the outer joins to blend that data. That's already been done before this, so that's why we also like to say that it makes this data much more accessible to the business because we don't have to be domain experts in 1 system or many systems to actually get the data that we want out. 1 of the things that I was curious about as you were discussing using these different storage engines and being able to leverage their different sort of strengths and capabilities is

in terms of ensuring that you're able to reconcile the records as they get split out across these different systems and make sure that you are able to keep track of them in individually and in aggregate. And in the event that there's something like a, GDPR right to be forgotten request, you're able to then go back through and delete them from all the different systems where they need to be removed from and just some of the overall sort of record keeping that you use to be able to manage all these different databases and ensure that they are that there's some sort of consensus across them. Yeah. Exactly. And you could probably I mean, just by you asking that question, I mean, you've probably already done, you know, your your research and thought, maybe this is something where it's kind of more eventual consistency based because

when you've got,

I mean, the kind of the backbone of of clued in is a is a

message queuing system called RabbitMQ,

and, you know, all of these databases are essentially saying, hey. I'm just gonna consume the queue called,

create record. But even though the individual databases could be transactional, in no way do we wrap an entire transaction over 5 databases. It's just that's just fraught with issues. You know, you would and, of course, we do have these mechanisms, but in in theory, it just won't even I I don't think even work in reality that what happens if we have a lot of retry mechanisms where we try to enter in first 1, 2nd, 3rd, 4th failed. Do we go back and clean up? There's there's nothing transactional about that. And so 1 of the ways that we've addressed this is actually that now there is what you would call like a a right first journal, so like a log. So all of the the the messages that are consumed off the queue that's coming from these different sources, it's actually just written to 1 log file, like, an append only

super fast write speed. And then what the database is do is they're basically reading off those logs, and the great thing about that is that instead of having this kind of, hey. Insert this and then

insert this next thing and insert this next thing, these systems can say, hey. Go get me a 1, 000 lines off the off the log and process that all in 1 big call to the to the database. And then, of course, that piece is transactional. So in the end, what you get is including is an a system that is has eventual consistency across the stack. And to answer the question around

the kind of GDPR

subject request, right for portability part of the question, first of all, let's start with it's hard. It's a really hard problem to solve.

And, you know, sometimes people will talk about,

1 of the things they want to do with their data is, you know, the single view of the customer. And, you know, I've never seen a system do this very well without having to do a lot of the work yourself. When you kind of think about it, the single view of the customer and also this right for portability, it's a kind of a similar problem. You want everything

connected to an individual,

but you really wanna just highlight the things that are personally identifying,

that's more for just the portability. You've got to have the right to to remove that data if necessary.

So, you could probably start to to,

see that, you know, a graph database

seems to be a really good structure

to host data connected to things. And in this case, it's it's a person. What we need to make sure we're doing is as the data comes in, we are cataloging it so we know, hey. This,

job title of CEO, it came from the Salesforce

contact record. And what this allows us to do is with this lineage, if we're wanting to kind of change that record in our system, you can kind of think of it like a a master data management type of component of, hey,

Tobias has, rung up our company. He's told us actually we've got some of his data wrong, you know, like, we've got the wrong job title, and we've got the job rob, phone number. We can actually,

change that in the clued in platform,

and then it will basically say, hey. Got it. So and this is what we call our mesh API. It basically can unravel and say, hey. Here are the queries that you need to run against the source systems to be able to change those values in in the source systems.

Now there is a lot of complexity around that. Right? So, you know, what happens if we want to delete a record and then when we look back at the database, you know, we realize that it's got the need for cascading deletes. So it's gonna throw us a kind of an exception saying, yeah, you can't just delete the contact because he's got referential or she's got referential integrity to these other tables. So there's always complexities, and that's why a lot of the time, these are you know, we give this kind of skeleton and framework for our customers to be able to to implement it. But I guess, Tobias also, this is,

this whole idea of,

you know, unstructured data and even structured data that's that's not clean. We often say to our customers, really, if you're you're wanting this type of privacy solutions, just know that it's a thing about, it's more about confidence that you have the right data rather than something that can just be 100% solved. And particularly in the area of data cleaning and being able to reconcile records across these different systems, I'm wondering what you found to be some of the most challenging aspects of that integration step and any

specific requirements for

domain knowledge or manual intervention

across industries

or across sort of business verticals

and, any steps that you incorporate as part of the onboarding process and ongoing

execution of these integration routines to make sure that your customers are able to ensure that these integrations and reconciliations

are happening appropriately?

Yeah. Well, I think it starts with this idea of of of what we call our core vocabulary.

You can think of it as like a schema as well. Basically, you know, what we've done is we've said, okay. Listen. A person is a person. Whether they exist in Salesforce or Dynamics or HubSpot, they're still a person. They have these generic

properties that exist no matter where the source system. Now they might call them different things. We might call them, you know, last name and might be surname in another, but, effectively, they're the same thing. So this idea of this kind of core vocab is is kind of what gives us this pillar of structure in what could be very unstructured.

And,

I mean,

the complexities with with integration really

start with the the conceptual complexities,

I e,

I know you and I, Tobias, could probably sit in a room with a whiteboard. We have 3 systems to connect. We could probably drop some boxes and say, oh, you have to you have to join under this table, but that'll give you the ID to join on to tool 3 and things like that. You know? But if you look at the enterprise, you know, as you can imagine, it's not 3 tools. It's 300. It's 3, 000. You Now in some of our customers' case, it's over 10, 000 different systems. This is a lot sometimes due to things like acquisitions, right, that you suddenly inherit this whole new technology stack, and, you know, what do you do about it? And so 1 of the complexities is there. And, you know, 1 of the ways that we help address this is that, well, if I knew the end goal

of this blended data

was a relational database or a data warehouse,

I'd have to be really careful about how I model this data. I wouldn't want to model it into a place where querying it would take too long or, you know, just joining between too many tables is just well, that's never gonna scale at all, but the graph database

is really good at flexible

modeling. There are things it's not good at, but that's 1 of the things that it's really good at. So what this allows us to do at the integration level is say, hey. Let's not only take a system at a time. Let's take just an object at a time, and I'm not interest I'm not interested at all in how the product table and the customer table

join. I'm not interested about that integrity.

I'm more interested about what's a unique reference to this person or customer. Is it their email? Okay. Let's flag that. Is it their phone number? Well, not unique, but it's a potential alias for it. And what that allows us to do is in what is typically an ETL world where you might use something like SSIS

or or other ETL tools. We're much more of a ELT where, hey. I'm actually gonna load in all the data first, and I'm gonna reverse engineer how these systems are joined. So you could probably understand that, you know, when we plug in 1 system, we've got this customer in the graph and, you know, he or she is just floating in the graph saying, hey. I don't connect to anything.

It might be, you know, 1 month later when we've plugged in system 15 that this other node says, hey. We've got the same ID. We should merge. So those kind of design principles have allowed us to kind of scale to these more larger installations.

I think the final point around cleaning is that places like Kludin, and as I described before, is a data hub. They are the place to really standardize on how we're gonna represent gender. Right? How we're gonna represent

something like,

these categories or these labels?

And I'm not so worried about changing them in the source systems because, you know, they're probably like that in the source systems for a very good reason. But I do wanna make sure that any upstream consumers of us are getting some consistency. They're getting standardization.

So I don't want them to take care of this. Yeah. But in Salesforce, it's called gender, and in Dynamics, it's called sex, and, you know, in our custom built system, it's in Danish or another language. I don't want them to think about that at all. I wanna standardize on that at the the clued in kind of level. And so the classic cleaning challenges that we see, there's some really nice edge cases,

you know, just simple something as simple as, hey. Let's go get the gender field. Let's,

group the values or cluster them, and let's fix things like, oh, someone spelled,

male, m I I l, and,

or someone's,

done it in Danish. So it's mandor

or for for for a woman. Oh, let's,

let's clean those types of things. But sometimes we run into these situations where,

it's not so easy. So for example,

you know, you might have a system that's, trying to optimize for performance, and so it stores a gender as a 0 or a 1. Let's just say that 0,

represents a woman and 1 represents a man. They might do that in 1 system, but in another system, the business might have flipped those. So it's not always easy as just saying, hey. Just turn all the zeros into female and turn all the ones into to men. It there's these business cleaning rules that come in. So there's some of the complexities that we run into. And as far as being able to

manage the data lineage or data provenance,

and then be able to expose it to the customer so that if there are any issues with the cleaning or reconciliation

or if there's just some incorrect data in the record, that they're then able to trace it back to the source system and maybe do some fixes for how the data is being transferred to you or fixes in the source system itself or how they're capturing the data originally. I'm just wondering

what you're using to manage that and some of the challenges

and benefits that you found, in the process of building that system. Yeah. I think, first, it's probably a good idea to start with the,

the object structure that we use. You kinda think the object structure that we use is kinda like a git object. Right? So it I would refer to as, like, a versioned

object graph where you've got all the history, all the permutations,

all in this kind of binary, and we've leveraged that idea, and and we call this model our clue. And, you know, the interesting thing behind the clue model is that, hey. Just because you have the data doesn't mean it's true. Right? We need to be statistically confident, and making it statistically confident is something you get by throwing clues through our our pipeline. And so what this allows us to do is then say, okay. Well, I generically

take in clues, and all of those clues have an origin. All of those origins also need a little bit more detail. They need what was the account it was added under as well. We want complete uniqueness of where the the data came came in. This allows us to kind of, I guess, you could say, unravel

this giant object in in some cases, absolutely huge object. You can imagine, you know, maybe an Excel sheet or the the the version history and the of a very old Excel sheet that's gonna lead to potentially a large, clue object. So we maintain

all of that history, and why it's interesting is that, I guess, the GDPR example is a good 1. If I needed to ever purge the data, what I can actually do is I can say, hey. Go get this record for Tobias.

Unravel

all the history, and then remove the 13th clue that we got from Salesforce

and then reprocess that as if we never saw it in in the in the past. So that's that's 1 way that we help with the lineage piece.

But the other complexity with lineages, well, where's the data going? And, 1 of the ways that we help with that is, you know, we have our GraphQL endpoint, and every time that you run a query, it will actually generate a new API token. So it's like, okay. Well, we can use that to trace where that that particular

API token was used. The other way that we expose our data is still through GraphQL,

but instead of this classic kind of paging of data, it will stream the data out of us. So it will use this kind of GraphQL streaming technique to say, hey. I've seen a new record. It's, Tobias,

and, you have a GraphQL query that is looking for people called Tobias. Hey. I would match that. Where should I send you? And that stream can either be a classic stream like, maybe something like,

a Kafka stream or a Spark stream, or it could just be something as simple as like, hey. Post do a HTTP post over to this endpoint. There's some of the ways that we're, you know, able to trace that, hey. Tobias'

data has come in from Salesforce and Dynamics. It's gone through the the lineage and tracking in our system, and then we pushed it over to Power BI and to Azure ML, so you have that kind of end to end lineage.

And this problem of managing the destination and being able to push out to these,

systems that other people are using for analysis, but also at the other side of being able to consume the data and the number of integrations that you mentioned that you have. I'm wondering how you manage any changes in the APIs or the data formats or any breakage or failures

that might occur because of, you know, network outages or system failures or maintenance on the,

3rd parties and just your overall strategy for

ensuring that your system remains operable in the face of all these potential edge cases and failure modes? Yeah. I mean, I will start by saying this. It's the hardest part of Cluedin is the maintenance of third party integrations.

I think 1 of the sanity points for us is that because our focus in customers is is enterprise that they are well aware that this happens. Right? That, you know, if we're going to plug, Dynamics,

2, 017

into our system, I know in, you know, in the world we live in that,

people are going to introduce new objects in Dynamics. They're gonna introduce new properties, and how how do we handle these? Well, in a lot of cases, I'll take the kind of Dynamics or Salesforce,

example. You could probably already realize you could do absolutely anything you want in those systems. You could branch out and not only use them for leads and contacts, but you could just put animals or hot dogs or any any object you could think of. Now a lot of these more enterprise tools, they expose,

it was a silly example I I'm aware,

But, a lot of these tools expose discovery endpoints where you can actually say, okay. Well, instead of just guessing what your endpoints are, why don't you tell me what objects are available in your system? Now this really only happens in the kind of more enterprise types of tools. Let's take something like HubSpot, which is more, you know, CRM target towards a small and medium businesses. It happens actually all the time that and you could probably imagine this as well is that, you know, they update their API and the documentation is not reflecting that. So 1 of the interesting ideas and and processes we had in the past is that, well, why don't we just watch API pages? And when there's a content change, it'll alert our system and alert our to our team that, hey. Okay. So something's changed on the page, and we would just use a standard diff tool to to tell us what's changed. And this this

kind of is just flawed to start with. You and it's because of this idea of usually the smaller companies are also innovating so fast, so they wanna get these things out. And what happens in the end is that there are times when this just breaks.

It just breaks.

A good example would be HubSpot have stopped giving you back a list of people. It now gives you back a dictionary

of people, and, you know, your serialization

in your, crawler didn't cater for that. So, I mean, the good thing is that you get signals

from the system immediately

that you have this issue. And in a lot of other cases,

you know, you know, platforms are getting a little bit better and better and better at versioning, you know, semantic versioning and, you know, making sure that they

release,

versioning in their API. So I guarantee that if you're using version 2, I'm never gonna change that. But if you wanna move to version 3 with the new stuff, yeah, I'm sorry. You're gonna have to do some rework to to to deserialize that and move to the new models. So it does happen. It's a big piece of the work that we,

work on, but with proper alerting, we're pretty quick to act. It's in these cases where there's been such fundamental changes. It doesn't happen often. 1 of this like I said before, 1 of the sanity checks for us is that our customers host this infrastructure

themselves.

And so, typically, they've got people dedicated to, hey. We've plugged in, you know, Oracle databases

and Workday and enterprise systems, and they're usually a little bit more mature around their release policy. So I guess what I'm saying is it happens. It's a very hard problem to solve, and

I think the way to solve it is that businesses really need to fit more

standards. Right? We need to move towards much more standards of, hey, if you wanna be a good business and expose your data, you need to adhere to at least these rules

of the,

of the game. And, actually, 1 of the ways that we're seeing this in an odd kind of space is actually in the authentication space.

There's some

proposals for standards out for the new OAuth.

So OAuth 2, of course, is 1 of the kind of industry standards for authentication between systems. In the new version, they're talking about, hey. How can I just have applications

talk directly to each other? How can I automatically

spin up new applications

without having to do them through some type of developer portal? So we are seeing these shifts towards more standardizations, but I I believe it's the only way that we'll actually get sanity and and,

any control over the complexities of integration. And to your point about being able to alert on different failure cases

or

being able to keep track of the overall health of the system. I'm wondering what types of metrics you're using to keep an eye on that. And in the environments where you are actually deploying the system to customer's infrastructure,

I'm wondering how you manage the overall life cycle and deployment model to them to ensure that they're able to stay up to date with the systems and that they don't have too much drift where, you know, they may decide 5 versions from now that they're actually going to upgrade versus actually staying with the current way that you're architecting the system. Yeah. There's a couple of, I think there's a couple of technology choices that we bet on early that have just turned out really good for us in this space. So I guess the first thing is that coming from enterprise background, upgrading was 1 of always the the big issues. How how do you get your customers to upgrade? And, you know, in many cases, it it will cost that customer quite a lot of money to put in the effort to upgrade your product.

And and so from day 1, we we bet on Docker. We bet on containerization.

Then, you know, these orchestration frameworks, we took a bet on. And, you know, what our deployment to our customers looks like now

is, hey. Here's some here's some Docker containers

in a hub. Here is Kubernetes.

Here's Helm charts. You know, feel free to use the services in the cloud to deploy those

and or deploy this into your kind of VMware

environment.

And when I'm thinking about signals and metrics, I kind of I've actually moved in kind of, like, into 2 kind of fields.

I think for sometimes, it was always around, like, DevOps types of of metrics, but

I actually think more our signals are more moving into the kind of data ops side of things. So for DevOps, of course, there's lots of metrics that,

of course, third party tools and a lot of them are actually open source and fantastic

that, you know, just have native support for Kubernetes.

So spin up, you know, your Docker containers, spin up your all your databases,

put it into a Kubernetes cluster, orchestrate that with, with Helm charts as well. Oh, and, let's bring in Grafana because that comes free.

Let's put in a a Netstat d,

because that's got native support for that. So we would really, for the DevOps side, just say, hey. We're gonna stay out of it. Right? We're gonna give you kind of industry standard ways for deployment,

and then pick your

tool of choice. Now the signals with the DataOps part becomes interesting because that's kind of a new field.

When I talk about DataOps, it's it's more, you know, if you wanna be data driven, it's about how do you maintain a consistent flow of data

throughout your business,

and the thought of something as simple as a password change meant that data wasn't flowing from Salesforce for, you know, 24 hours or or or longer. I refer to that as data ops. So some of the signals that we're and metrics we're sending from our system is, you know, you've got an expiry on a token soon. Would you like to generate a new 1 and schedule for me to switch it in and out at a certain time? A lot of the other metrics, of course, I've already touched on would be things like data quality and and data

data accuracy.

And that's kind of more I guess, it's not as much on the data upside, but it there's still important things. Because what I see in fields like us is what's becoming more important is how do I have people that are monitoring that I'm sending good quality data upstream? Because,

you know, everyone's relying on it, and to be honest, if I fix a few things, there's multiple people that win off it. Right? We're all listening off the same pipe and whether I'm a BI system or an ML system, they all want clean, good quality,

enriched, you know, complete data, and blended data is the other big thing as well. So I think the signals for us are really broken up into into the DataOps part of the DevOps. And

what have been some of the most notable customer success stories that you've experienced

or interesting or or unexpected ways that people have used the Klueden platform? Mhmm. I mean, the 1 that always stands out for me is it's 1 of our cases around data privacy, and

as a lot of businesses were doing, it was, you know, getting to that time in

May,

2018, and, you know, a lot of companies were kinda scrambling. What do we do? We had a a a a larger company that came to us, and and they're just over a 100, 000 employees, and

and they said, hey. It's, it's March. We have a 124 applications.

We're we're we're over a 100, 000 employees, so we have, you know, we have 30, 25 years of historical data, and we wanna hit May 25th. What can we do about that? And once again, I think just down to some core technology choices, like, you know, containerization

of

and and clear separation of concern and, you know, backing the system with a service bus. So, hey. To to scale this thing, you just introduce more machines, subscribe to the buses you want, and you're processing faster. Of course, the trick with that is every time you scale, you move the bottleneck somewhere else. So it was an interesting challenge in getting, you know, the system to basically say, hey, the bottleneck is the sources. We can't pull any faster because they're throttled,

or this quality of service that we need on the line because these are production systems.

I think that's 1 of the great stories of just being able to integrate that amount of data in within a few months

and to have them hit that date with the kind of high confidence that they could, you know, be compliant

and fulfill the the regulations.

That was 1, and I think, you know, some of the banks that we've been working with, you know, just a lot of people are suffering from this kind of history of,

silo data.

And, you know, 1 of our customers

have over

10, 000 systems, and we don't integrate into all of them. In fact, it's a smaller subset, but I think for

the the part I like about it is that we gave them a view of this could do it. Like, that it's daunting to think how would you integrate 10, 000 systems. And now with this kind of approach where, you know, you take 1 system at a time and you take even 1 object at a time, it's I I like the fact it's giving them this this visibility of this. We could actually do this. And what are some cases where a customer was interested in using Clueden and it either wasn't the right fit or they tried to get started with it, and they weren't able to meet these sort of desired end goals that they set out with when they first started engaging with you. Yeah. Definitely. I mean,

when you enter into the the land of, machine learning, you enter into the land of magic. So we've had some pretty fantastic

examples of failures.

I think 1 of my favorite is,

we brought on our first customer in the food industry,

And,

you know, as we're integrating their systems,

there were a lot of PDF documents in file storage, and they're actually all the menus of of all the different restaurants that they, supported. Kind of like they were an aggregator of of restaurant data.

And,

well,

due to the kind of

kind of way that, statistics work and,

that's how a lot of these, kinda natural language processing

techniques are working as well. It's just based off statistics.

It detected,

you know, all these, menu items like spicy chicken as a as a person and and lamb korma as a person and

our system would kind of pop up and say things like, hey. Here's the phone number of spicy chicken. Would you like to give them a call? And, you know, it's it's it's really hard.

It's it's really hard in those situations to describe.

This is done off not off rules. It's not done off business logic. It's done off statistics. And we all had a laugh about it, but it also,

for me, it showed how it showed me the reality of machine learning and where it's useful and where it's not. So I think that's 1 of my, favorite,

failure stories. I have to tell 1 more because,

I always do this and I I always do this to to embarrass our our CTO,

but, I think you'll get a laugh out of it. Our CTO, Martin, he he kind of looks like your classic Nord. He's got, very long hair and has death metal shirts and

and like a a a beard. He kinda he kinda looks like Jesus a little bit, like the the the pictures and and of and the drawings of of Jesus. So

he,

we were using, image object recognition

to basically scan through images and say, hey. There's some objects in this image, and I placed to to test out the the system. I placed a picture of myself, and it was, you know, me at my brother's wedding, and it picked up, glasses and

suit and bow tie, and we thought, wow. This is amazing. Like, how does this work? And we put, Martin's picture through the the same engine about 30 seconds later and it, with 90% confidence had classified him as a a fur coat, and

then you really have to just look up a picture of of his name is Martin Hildall, and you really just need to look up a picture of it to get the true emphasis of the laugh, but it's they're the kind of failures we run into. Yeah. I think I might have actually seen the, example photo in when I was looking through your blog to prepare for the show and get a good laugh out of it at the time as well.

It's 1 of those things that never fails to make me laugh as well. And so are there any other cases where you found that clued in is not the right choice and

a company or organization

is, better off using a different system, whether it's because of the size of the organization and complexity that they're dealing with or because of just the overall sort of goals that they have for managing their data integrations or any sorts of issues with control or visibility into the system? Yeah. I think 1 of the big things is that,

a lot of, you know, people are throwing this word around of, like, real time data. Right? So we've got IoT data and signal data. I would say to a person that we play no role in that world. Right? Because it's kinda like this this classic

saying we have an engineering. Like, you've got all these dials. Right? Speed, you know, cost.

Right? If you move 1, the others move,

accordingly as well. And

for me, IoT data

is about, hey. I need to get it stream it in as fast as I can from the the plane or the jet or the wind turbine, and I need to put it into, like, a a logging system that show me real time sensor data. And if you've got

cleaning and and accuracy and quality of data, don't fall into it as much because, well, if the data is incorrect, maybe that's just a a wrong reader that you've bought. Right? Now easier said than done because people buy different readers, and they all, you know, send data in different types of formats, but I would say that's 1 case where it's like, let's just stay out of that. As soon as you don't want anything tracked, you don't want it cleaned, you don't want it blended,

There's no need for you to put your data through our pipeline. But I think, overall,

it's still

nice to be able to set up an overall foundation for a company where you say, okay.

IoT data, signal data, maybe it streams into something like,

I don't know, like a Kafka stream or or or maybe you're using, like, IoT Hub or Event Hub in in Azure, and you just wanna push it directly to the logging application. Klueden just really doesn't play a role in that story.

And what are your plans for the future of Klueden, both from

robustness. So I think from kind of a a feature set and functionality perspective,

we're really happy with where we are, and,

really, more it's about also making sure that we can adapt to new things that will come in our industry. I mean, it's going so fast. I didn't know what a data lake was a few years ago, and now it's the kind of thing that everybody wants. And, you know, in 2 years, there'll be new things, and it's really about how we make sure we're building robustness and flexibility to to be able to adapt to those new requirements. So, you know, from a from a technical side, it's,

you know, also making sure that we're always keeping up with, industry standards, that we're always moving to frameworks that are better. For example, know, we're mostly built in, dot net, but, you know, we've moved to .netcore

to get the the wins from that. From a business side, it's it's just more, you know,

moving into other countries where, you know, they have these same issues and,

you know, we recently have moved

into customers in Australia and the UK and the US, and, you know, you could probably imagine these problems exist everywhere with businesses. So I think that's the plan from the business side. And are there any other aspects of clued in or data integration

or data engineering that we didn't discuss yet that you'd like to cover before we close out the show? I think,

I think I used more data acronyms that anyone could take any more of. So I think,

I think we've had a good discussion. Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you and your company are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. Good question.

I mean, I think I'm, naturally biased when I say this, but I think it's this stitching. Right? It's this, yeah, I can go out, and I have a plethora of choices of technology, but how are these things gonna

work together? I mean, you know, we can't only forget that integration is not just about from data sources into a pipe. It's from the pipe into the governance and governance into the lineage and and all those different components. So, you know, I'd like to see us, you know, in a way kind of like standardize on what is the plumbing the businesses need, and and I think that's the the biggest piece that is missing from the overall story and why companies are finding it so complex to get value out of their data. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at CluedIn. It's definitely very interesting

platform and a challenging problem domain, so it's always great to get a view of how different people are solving it. So I appreciate that, and I hope you enjoy the rest of your day. Thanks, Tobias. Pleasure.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links