Breaking Down Data Silos: AI and ML in Master Data Management

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

It's 2024.

Why are we still doing data migrations by hand?

Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale.

DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold

today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macey, and today, I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business. So, Dan, can you start by introducing yourself?

Yeah. Thanks, Tobias. It's a pleasure to be here. I'm Dan Bruckner. I'm a

a cofounder and CTO at Tamer.

I've been

solving problems in this space for,

I don't know, going on 15 years now.

And, we we build solutions for master data management

using AI and machine learning

to simplify

and, make MDM projects

successful.

And do you remember how you first got started working in data?

Yeah.

It go it goes way back. So I actually started out as a physicist.

And my first,

my first job out of college was working at CERN on the LHC,

and it was

in the days before the LHC had actually started. I was getting going. And so most of what I did was actually write code,

solve computational problems.

In those days, we were doing analysis over

large volumes of simulated data and trying to model the system and and get a handle on our expectations for what was gonna happen.

So I did that. And as I was doing it and the system wasn't running, I kinda got more interested in the computational problems

that I was working on and and the code that I was writing.

And so when I got back to the states,

I decided to pivot and move into computer science,

started programming,

and then

got interested in in computer science research.

Because of my background, I kinda naturally gravitated into

data and large scale data processing, database systems,

and,

eventually started working with Mike Stonebraker at MIT

on

research into

large scale data integration

approached holistically

and approached using machine learning techniques and applying those techniques in in ways that scale to,

extremely large volumes of data.

And before we get too much into

the application of ML techniques

to that challenge of processing data, reconciling it, getting it into a usable state, I'm wondering if you can just start by giving a bit of an overview of some of the different ways that data at the organizational scale becomes unwieldy

and some of the challenges that arise from that lack of reconciliation?

Yeah. It's a class of problem that I think is is very common and taken for granted

and also

not necessarily deeply understood.

I like to start from an analogy to software engineering

and Conway's Law. Are you familiar with Conway's Law?

I am. That the software design will eventually reflect the organizational

communication patterns for better or worse.

That's, yeah, that's exactly right. So the structure of your organization

dictates the structure of your of your software architecture.

And the same is true to a large extent in data and data management.

The structure of data within a large organization

is naturally going to reflect

the structure of the teams and the groups and the divisions that created that data.

And

that can be a very good thing. It means individual teams can kind of operate naturally and independently and use the data that they need to be successful and to do do what needs doing. But it also creates

big challenges and and

missed opportunities

when you start to move up a level and wanna reason about and change

and ask questions of the the kinda data,

the the data across the whole organization.

Different teams are speaking fundamentally different languages.

They often have redundant, duplicated data,

And it can be very hard to actually use that data to communicate and to make kinda high level decisions within the org.

From a very kinda nuts and bolts

perspective,

what kinds of issues are are are we talking about?

I I as as basically a database guy, I kinda go back to the kinds of problems we're interested in are fuzzy unions

putting together schemas across different databases,

fuzzy joins, and fuzzy group buys.

So, essentially, cases where

you would like to treat

large, sets of data as a coherent whole single database,

but you don't have the keys. You don't have the the common attributes.

You don't have the common identifiers,

and so you're not actually able to just directly go and ask the questions you wanna ask.

First, you have this problem of

just, you know, mechanically getting all the data together,

linking it up, getting a coherent picture

that you can go and query and and use use for

applications, analytics,

what whatever it is you're trying to accomplish.

And

given the reflection of Conway's Law in that data ecosystem

for the business, what are some of the attributes

of either scale or

team dynamics that you see being the biggest contributors to that messiness and that lack of cohesion that brings out these problems?

Yeah. So

there I mean, depending on the scale of the organization, there could there can be many. But, frequently,

the kind of most common case is datasets come from applications. They come from processes,

I guess, processes whether they're software based or not that are well established, that are designed not primarily to create data, but to solve some some problem for the business. So

sales, marketing, you know, these kind of basic basic things that that companies do.

As a side effect, they produce these piles of data.

And then the the teams that that work with those processes and applications, they're very vested in kind of the way that things work. If you have other teams come in, data teams most frequently,

to do analytics to look across different groups, different parts of the org, there's kind of a natural conflict that arises in terms of, well, we would like it all to look this way. We think it this would solve the problem for the whole organization better. And teams say, no. Like, we that's not how we operate. We can't do that. You can't just come in and change our process, change our data.

You know, we've been doing this for for forever

in in this way. The problem gets worse

the larger the organization gets and especially

for

companies that grow through acquisition mergers.

You start bringing in, you know, data that's arisen not just from different teams, but completely different organizations,

start trying to put it together, consolidate.

And those kinds of those kinds of small inconsistencies

can can really start to undermine the process of of finding a good way to operate

and put all the data together coherently.

And so that process of reconciling data, bringing it together in a way that makes organizational sense so that you can start to ask those questions

across the business is largely called master data management

or building golden records.

And I'm wondering if you can talk to some of the typical approaches that teams and organizations

try to

take to be able to actually embark upon that process of building those master records and reconciling that data and some of the scaling challenges that they run into, whether that's in terms of scaling at the compute level or scaling just in terms of time, effort, and human capacity?

Yeah. That's a good question. That's a big question.

And so breaking that down a bit, master data management really does cover kind of the heart of of this problem of linking linking different datasets together.

There are a number of stages

in

a successful master data management, like, you kinda have to move through.

One one stage even starts just ahead of getting into master data, which is just physically getting the data together

up and and, you know, treating the data quality problem, essentially getting a common level of quality,

often pulling in third party source data,

reference data to to enrich it and and kinda get your base to a good spot.

Then, okay, you have you have a set of sources,

different data tables,

database systems. You put them in one physical place,

and now you you wanna link them together. You you wanna kinda create the the point of reference across common records

and and just solve solve that that linkage problem, the the entity resolution problem.

Once you've done that,

great. Now we have a common identifier

that we can use.

You're gonna

now draw in all the data from these systems

and attempt to consolidate it, produce golden records.

So now you have you have an identifier that links source data.

And for each identifier, you have a a kinda a golden record, like, this is the best this is the truth about this customer or this supplier or this part in our in our in our organization.

So you produce that record. And and now you you wanna start to manage that over time.

As you go farther,

you're now gonna want to push out more of that

to the systems the source systems themselves and to downstream applications,

analytic engines.

So, essentially,

solve this problem of the coexistence of a master data on one hand and all these operational and analytical datasets that exist everywhere in the organization.

The physical problem of linking those things together and keeping them consistent

become becomes a big challenge as you start to operationalize

the master data.

And they're they're kinda

different, in different scenarios, different use cases, different folks will focus on kinda different parts of this

this, like, journey through master data management.

You know, maybe some projects only require

getting that identifier. Throw the data together, get the identifier. Great. We can go, that's all we needed. We can go run with it. Maybe you're just doing some some analytics.

So you do that as a one off. Every quarter, you produce a report. So we refresh our our

our data. We get this high level of integrity with our master data.

We generate our report. We're good to go.

As you move farther along

and want to actually take that data, operationalize it, use it on an ongoing basis, keep it fresh constantly. You know? So as new data comes into operational systems, it's immediately mastered,

immediately incorporated with the master data. You have to start to go farther in this journey of kinda pulling together and closely integrating your your master data system with your operational database systems and and other applications.

And the

canonical example that's often brought to bear in this context is the idea of the customer record where you have this is our customer. This is all of the attributes about them.

And then there's the challenge of, well, which system is the one that we actually trust the most to collect that information accurately

or different systems collect different pieces of information.

And then there's also the challenge of when you're dealing with people, they change locations. So you have to make sure that you have the appropriate address, but you also want to know their old addresses. And so then you have the the issue of historizing that data, and this applies across other

business objects beyond just your customers.

And I'm wondering if you can talk to some of the

people problems of figuring out what are those decision points, what are the ways that we determine

what is the place that we actually trust the most for which pieces of that data and then being able to actually manage the merging of those attributes from the multiple systems to be able to say, this is the thing that we trust the most.

That other system over there has different information, so we're going to ignore that, or we actually need to use that in that system. But over here, we're gonna use this. I'm just wondering some of the

some of the ways that organizations

have to wrestle with that kind of constant decision making about what data to use where, when, and how.

Yeah. Yeah. I I think I think what you're picking up on is a really key observation about

master data management as, like, a problem space.

It's

fun it's it's not just a technical problem. If it were just a technical problem,

putting data together, creating, like, a coherent knowledge graph is, like, we know how to do that. We can do that. In real organizations,

it's also a political problem.

So you're not just trying to get the data to agree. You're actually trying to get these different teams

to

agree and coexist and have each have their own special

view of the data. Because, you know, we the the reason the data silos

were created in the 1st place was all of these teams operating independently and efficiently.

Pulling together those silos,

you you need to you need to make sure that you don't actually interfere with the independent, happy,

trustful operation of of everyone who created it.

And

what it what it comes down to is

solving the master data management problem less from, like, a

dictatorial,

we will come up with the one standard that will work for everybody

kind of approach.

And more creating a repository for the linkage and creating a system and a common touchpoint

for all of these different silos and applications

to touch base

and stay closely linked in a in a clean way. We, one of our early customers,

very large manufacturer,

when we started working with them, they they essentially said, okay, our our history in in

master data management, we we have this company with many lines of business, different divisions.

We have 26

different major ERP systems.

We have more, but, like, the long tail is too much to to worry about. We have 26 major ERP systems.

All All of our parts, all of our suppliers

exist across all these 26 systems.

We've had several efforts at master data management.

And what happens is we go in, we pick some of these systems, the the largest, most most popular ones that we think are the most trustworthy.

We we collect the data. We consolidate it. We create this master. We have a new identifier.

And at the end,

no one wants to use it. We have 27 systems now for all of our supplier data and all of our parts data. And so if if you go and and kinda

do the technical work, but don't also do it in a way that, that kinda meets

the consumers of the data where they are,

then

the project can be a failure and and and essentially just make the problem worse.

So it's it's really it's really critical to to find that the way to

not just create a standard, but create a system that bridges the gap and maintains the connectivity

between all these different consumers

and does it in a scalable way. If you take 3 of 20 systems and say, this is the con this is how we're consolidating.

Well, what about the 17 other teams? Like, their data's gone now. How are they they don't know it. They have no frame of reference. So you need to you need you need an approach that can scale to handle kind of the the the whole problem.

The other interesting piece of this is that business intelligence,

data warehousing, those have existed in some fashion for at least the past

30 years, give or take.

And so you would think that given that time span, this is a problem that would have been solved at least reasonably well by now. And yet even today, it's still

a challenge that organizations are tackling and starting new projects on today, tomorrow, next week. And I'm wondering what are some of the

evolutionary aspects of the problem that

lead us to having to keep revisiting it and keep resolving it across organization after org after organization

rather than it being a well established, well understood,

more or less solved

problem?

Yeah. It's a good question. I'd I'd say master data management

is,

it's it's about I think it's going on about 3 decades old now. So we've been

companies have been building systems to solve this problem for a while.

And

the traditional systems

tend to focus on using sets of rules

and strict data models

to put together data from source systems,

and and they tend to focus

Data Engineering Podcast