An Agile Approach To Master Data Management with Mark Marinelli

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode.

With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute.

And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.

Skafos maximizes interoperability

with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure

instantaneously.

Request a demo today at dataengineeringpodcast.com/metis

dash machine to learn more about how Metis Machine is operationalizing

data science.

And if you're attending the Strata Data Conference in New York in September, then come say hi to Metis Machine at booth p 16.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.

Your host is Tobias Macy. And today, I'm interviewing Mark Marinelli about data mastering for modern platforms. So, Mark, could you start by introducing yourself? Sure. My name is Mark Marinelli. I head product here at Tamr.

I've been in the, data management space for about 20 some odd years now, having having cut my teeth back when multidimensional

databases and and Molapp were a thing and and then doing, stem for quite some time in what is now the self-service

data prep

world.

And now here at Tamr, I am

am working on technologies that are really bringing machine learning and and AI into the fold here as,

over time, we've progressively gotten better at automating a lot of, those data management pipelines.

And do you remember how you first got involved in the space?

Yeah. I've always been fascinated by data.

Undergrad, it was computer science, and and the data stuff was more interesting than the network stuff or anything else. So when I, when I first

started as a software developer,

I was working on a data management platform, as I said, in the multidimensional,

database space back in the late nineties.

And ever since, I've just been

really interested

in in these variety of technologies that I've outlined in

where the rubber meets the road, where those data actually get into people's hands and facilitating

getting

useful data into people's hands, as quickly as possible, you know, as opposed to the sort of big iron ETL stuff. That's where, you know, self-service data prep or

now in this sort of modern data unification

machine learning driven thing is, how quickly can we derive value from those data and and get it to people so they can do their biz or their data science or or whatever it is. And so before we get too deep into the topic, can you start by sharing what your definition is of a master dataset and data mastering so that we can use that to build off of? Sure. Yeah. So data mastering or or maybe we would say at Tamr, data unification is a little bit broader than than a loaded term of mastering. But, essentially,

you're taking data from a variety of different systems,

which are all trying to describe the same thing, you know, the same entity, like a person or an invoice or whatever. We're taking data from across all of those systems, which were not designed to be interoperable.

And

unifying the data, putting them all in in 1 place where they're aligned to a common set of attributes, where we've discovered linkage among the data, either because

records are linked in some way or duplicated,

and then are rendering out,

canonical

versions of each of these things, you know, a single view of the customer or a single view of a supplier or whatever,

as what people would often call a, you know, golden record,

so that, the downstream analytical use cases or whatever consumption use cases are,

can can work with the latest and greatest,

and and most comprehensive

set of each of the

customers,

suppliers, invoices, whatever,

that they need to. And can you discuss a bit more about how that master dataset or the unified data records get used within the overall

processing and analytical systems within an organization? Yeah. Sure. So the the attributes of, a problem set we're mastering is essential

or where you do have a variety of different views of what's essentially the same thing. So

a great example of something we see a lot is, when you're trying to do customer analysis so that you can, track your customer journey to upsell, cross sell, you know, monetize your customers as best as you possibly can.

If I've got, 52 different variants of Mark Maranelli, the customer across all of my channels where I'm collecting customer data, Very, very difficult for me to bring all of those,

data to to have that unified pictures that I can say, wow. He he's he's buying stuff more online than he used to, and and that informs the way that I'm going to manage that customer. So

anywhere where there's that

disparity of data that are all describing the same thing,

Customer 360, as a lot of people would call it, is is really popular.

Another 1, you know, really different area of a business but has a lot of the same attributes

would be dealing, with your suppliers. If you're a big global a global multinational and

you are buying a lot of your,

components to build your hardware or or software or whatever from a variety of different suppliers,

those are all probably going to exist in in a lot of different systems. A lot of those data that you receive from your suppliers are totally not under your control. So they're all describing

the supplies that they're selling to you in different ways.

That's it's essential for you to have a single view across all of that, procurement or spend footprint

so that you know that what 1 person describes as

completely different from someone else is actually both

8 and a half by 11 inch printer paper. If if I can't

unify and, deduplicate and master these data, then I don't know how much money I'm spending on each of these products on each of with each of these suppliers.

So it makes it very difficult for me to do

good inventory management.

Enables me once I've gotten on the other side of this,

if I do have a single view of each of my suppliers and I'm no longer treating the 15 affiliates of a supplier as as 15 different suppliers because I I didn't have this visibility, but now I'm treating them as 1, then I can see my aggregate spend with him is much higher than I thought it was. I should renegotiate my contract or, you know, change my purchasing behavior.

So those are those are a couple of examples where having the data really dispersed across the systems,

bringing it all together and and having a single way to treat your, you know, your your cross tumors, your business workers, or whatever, is is essential

to to getting the the right analytical clock. And 1 of the upfront challenges that I was thinking about as you were discussing that is how

you would sort of plan ahead to have some sort of unifying attribute that you can use to be able to create this single record of a given entity, whether it's a customer or a vendor or

a particular

unit of, you know, resource that you're getting from that vendor.

So I'm sure that there is, in a lot of cases, some measure of manual effort involved there. But But I'm curious if you can talk a bit about some of the,

practical issues and strategies around being able to,

plan ahead

in either your data model or the way that you're collecting the data to

make it easier to create that unified record?

Sure. Yeah. We we see a variety of different techniques there. I I would quibble a little bit with plan ahead because,

we're we we talk a lot about agile mastering, and and see our customers sort of embracing this. And we we'll plan a little bit ahead, and we'll start with something. But know that

what this unified attribute set looks like 9 months from now is not necessarily what it's gonna look like 3 weeks from now and and not try to solve that entire problem a preorder, but but rather build on on, you know, something, that's that's a quick win. But, there's a few different ways to do this. 1, there are

off the shelf

schema for a lot of different entities. There's there's not a lot that differs, for a human being from, you know, company to company if if you're treating with customers. You can get stuff like that, from, you know, public, web

there's also probably somewhere in your organization

already a a master. You've probably already tried to solve this problem,

and and maybe you're struggling with it, but you've got a pretty decent definition of what a customer looks like that everybody has already started to plug into their analytical applications.

Use some subset of that and then

map all of the other data sources, which have a variety of different ways to describe those data, map all of them to that same attribute set.

That that itself can

be really cumbersome because you have rule sets,

that are doing these mappings. And then as soon as a new dataset arrives,

because

you've got a supplier that's sending you different type of data or you've just bought a new company and now they have a a system that has all of these data. As soon as that stuff arrives, it's gonna break all your rules. So you have to go back and retrofit those rules,

so that you can map those,

source datasets to that that unified

schema.

That that's actually an area where we've done a lot of, of research, and and we've we've got software that that can automate that because a machine can actually

figure that out pretty well and, and be more resilient in the face of changing data on both sides, both on the the source and the destination schema,

than a rule set that somebody's gotta build and maintain and and

will, constantly

be be behind the 8 ball. So, you know, in in brief, it's it's finding something off the shelf that that does, that already describes the relevant

entity or domain.

There's, out of the box stuff that that a lot of platforms bring you that that is already encapsulating some of those

end states. There's something maybe sitting around your your business that is, already good enough. And in any 1 of those scenarios, you're gonna wanna be able to modify that schema without

incurring,

you know, capital d, capital m, data modeling

exercise where now we have to, retrofit some entity relationship diagram or anything like that. You you wanna be as, lightweight and agile as possible

records

to canonically describe a given entity, it sounds like you want to establish the minimum number of attributes

that you can possibly get away with on that record so that you don't have this issue of the evolving data model than breaking any references to that entity from other systems.

Absolutely.

It's it's always gonna be this this common subset that's just enough. I mean, I think that just sort of philosophically infuses a lot of what I'd have to say. Just get to that small,

I will say minimum viable, schema

that is the the common subset

across all of these datasets

that is useful immediately for analytical or operational purposes and then build on top of that as the community that search to use these master data,

proliferates and and it applies more requirements. And it's also not 1 size fits all. Every single 1 of these destination attribute sets is gonna be contextual. That that's for golden records. It's for, just the the model that you're using, for the datasets. 1 person is going to want different things from your customer set than another. And and so it's really important for you not to try to just push out 1 sort of, you know, at least common denominator to everybody,

but have a 1 to end relationship

between your your mastering logic and the way that,

that golden record or or any of the data are surfaced to different constituents,

which itself could be really tricky. So 1 strategy seems like you might want to have the capability of layering additional attributes where you have your sort of core minimum number of entities, or maybe you just have a unifying ID, and then you have other systems that are adding degrees of information. So that if somebody wants to have all of the records, they would maybe go to 1 interface. Whereas if they want to infer their own information,

they would just go to that, sort of minimal set of records

to be able to reduce the amount of churn

at the core set, but at layer in additional information as you go sort of further afield from that? Yeah. It's it's a subset of these records. It's a subset of these attributes. And, sometimes it's a different

connection in those records.

1 use case may

define,

a common thing that we see is

individuals versus households.

1 data set is going to they're gonna have exactly the same

column set, but 1 of them is gonna consider my father and myself as 2 different people, and another one's gonna consider us as 1 household if if we were to if we live together. Having each

endpoint, analytical endpoint, or or consumer constituency

be able to impose their own model on this stuff and their own, say, ruleset, but but oftentimes, it's a machine learning model.

Push that

control over how the data are consumed as far out to the consumers as you possibly can,

which is really not the way that we've been doing this for the better part of 30 years. Instead, you sit down with the business user and they try to explain what they're looking for and you codify their their business logic in a series of, you know, ETL transformations or or MDM logic or an EDW.

And, then you get some of it wrong because you have to make some assumptions, and then you go back to them and rinse, repeat, and that there's a really protracted cycle there where you're working on their behalf if if and and a lot of this, the modern toolset allows

these individuals to have a bit more agency over how the data are mastered and consumed,

that that's the best way, to do it. Of course, that that has governance implications to make sure that not everyone every single individual is going off, you know, creating their own view.

But but there are ways to balance

that, need for

autonomy,

versus the the chaos that can ensue if, you know, if everybody's looking at things entirely

differently. And so when I first said the plan ahead, and you took issue with that and what you're describing here of this sort of waterfall process of defining the master schema and the way that this information is going to be codified and captured

and contrasting that with the more agile methodology

that you're promoting and that's become more popular in software engineering as well. I'm curious,

what about the current landscape of both businesses and the technology platforms that they're running on top of

makes that waterfall approach impractical

and enables us to move towards this more agile approach to managing these golden records? Well, a a tough 1 for any waterfall,

environment is is the shifting sands of data, and and so much data is being collected now.

A lot of it externally to companies where they have no control over the the format and quality of those data.

The speed with which data are arriving, the variety of, formats in which these data need to be assimilated

so that businesses can leverage all of these data for their, you know, competitive advantage or whatever their analytical use cases are, that just breaks.

You can't wait 2 months for someone to incorporate

a dataset that that may be ephemeral,

and you can't have to bring in your data engineering staff

every time a new dataset arrives. That's that's a little bit or or maybe appreciably different from the stuff that you've seen before. So I I think that,

the the industry has focused a lot of its fire in the last couple of year

decade, let's say,

on solving the volume problem,

big data, how to do Spark. That's all wonderful.

Not so much on the variety

problem,

or or applying as much,

you know, automation and and simplicity to that. So there are areas where,

waterfall, I think, is just fine. If

you've got a credit card processing,

transaction

ETL pipeline that's just humming along and and those data aren't going to the format of those data aren't gonna change all that much, then then fine. Do a yearly release of of your update of that platform. But when you're trying to, be really nimble and

constantly accumulate new data

and validate whether those data are even useful,

if you have to stand up a

year long project where where you don't even know if it's cost justified to do it because you're not sure if these data are useful, you're just not gonna do it. And so for somebody who is trying to move toward having a

centralized

sort of master

set of records for being able to model these entities that are important to their business if they don't already have a system in place to be able to capture and integrate that information,

and they don't want to get caught in this trap of having this multi month or multi year project to build it up. What are some of the steps that they can take

to move toward having that capability

without having to do a sort of stop the world approach of not doing any other feature development or halting other projects where they can just sort of integrate this into their existing workflows?

Yeah. There's I I think there's kind of 3 different

layers

to adopting

an agile approach. 1 of them is mindset, agile mindset. Everybody

in the organization

needs to be okay with some of, the risk of doing things

faster and maybe not as as completely

as they would in a waterfall project, knowing that they're going to

realize

immediate or or short term benefit

that is going to be,

important to their business. So there are certain areas where

waterfall is the way to go. I'm I'm I'm happy that the people who build large commercial airplanes are not, doing, you know, testing an MVP on me as a passenger, and they took this approach. And so there are things in data management where the accuracy

of those data has to be so complete and trustworthy,

that you may build in a lot of these checks and and a lot of the the big iron scaffolding. There are other areas

where if, let's say we're gonna run a marketing campaign

for our customers, and if I received the law the wrong mailer that was intended to someone else because they they got my golden record wrong or or they incorrectly mastered me. So what? If that's happening 2% of the time, but in order to get the system stood up in 6 weeks and and do something productive,

we had to cut that corner. Everybody's gotta be okay with that,

or at least judicious in how

they take on projects,

with this this agile mindset so they know where agility and probabilistic and potentially a little bit incorrect outcomes are are acceptable versus,

the other stuff that needs more rigor. So there's there's a mindset

component. That's the most important thing.

Then there's a skill set component

where as you're starting to do this work, make sure that you've got these squads of of folks

where you have the consumers of the data, the data engineers,

the the brokers of these data from these systems all working very closely together in as as close to a sort of agile,

scrum methodology as you could get in rapid iterations, collaboratively working together to build up incrementally. And then the last component, you know, mindset, skill set is tool set. You can have

a wonderful mindset and and want to adopt all of this agility and and be open to it. But if you're using legacy systems

that really cannot adapt,

cannot

introduce new functionality easily, you're just not gonna get anywhere. So we at Tamer definitely,

have been proponents

of what we call a a data ops

approach to this, where instead of,

taking 1

monolithic tool that supplies the entire

data supply chain,

from raw data out through this analytical ready data.

Instead,

just know from the beginning that you should be pulling together a suite of interoperable

technologies

that are best of breed. Each 1 of them is going to contribute the best of breed. 1 will be the best of breed for mastering. Another will be the best of breed for cataloging. Another 1 for governance.

And layer those capabilities

on because you don't have to get all of them working in their fullness before any useful data drips out the other side. So if you start with a small mastering project,

with technologies,

like our own and and put in a plug for Tamer where, we can we can get very far very quickly because we're offloading a lot of the,

historical data modeling and and rules creation onto machine learning models. We can get there really quickly if if you take on something, take in all of your customer data, start doing interesting things with the customer data. Were your risk profiles

okay and then layer in another technology for cataloging these data now that you've built a lot of these relationships and a lot of this metadata. And then layer in something for governance so that you have controlled consumption and controlled access to these data. Just don't try to take it on all at once and make sure that the tools that you choose

are,

each very interoperable with the others. And it's gonna be very API first design. You're gonna have to do some work to to stitch them together, but you don't have to do it all upfront. And the benefit

of layering this stuff in and keeping it disaggregated and decoupled

is that if somebody comes along and builds a better mouse trap for any 1 of these components,

toss your your existing 1 out, throw the other 1 in with with minimal disruption. And is there

a particular size or scale of a company or project

at which there's a tipping point where it makes sense to start building the master dataset, or do you think that it's something that anybody at any scale can start integrating into their platform?

The latter. If you've got more than 1 platform that or 1 know, dataset that contains

overlapping or similar data, you've gotta do mastering.

And and that so the the important thing is to make it economical

for people to be able to do mastering at at small scale and not only when you've already got 14 ERP systems that you've gotta stitch together, but, you know,

a

Salesforce instance or 2 Salesforce instances and and some CRM data that you've got in some on prem database, you should be mastering.

And so

1 of the things that you were discussing as far as being able to accelerate and simplify the efforts of building this master dataset is integrating machine learning and artificial intelligence methods. And I'm curious what you found to be the limitations of that approach for merging the datasets and where it's necessary to have human intervention for being able to build up these catalogs?

So, in reverse order, the human intervention is is a feature, not a bug, in that with

the machine learning approaches we see, it's supervised machine learning. It's really the machine needs to learn from the human understanding of the data.

Where,

the machine is exceedingly helpful is that it learns pretty quickly.

And in over a few iterations of training and and correcting for some of the

suggestions that the machine learning provides, you can get something very robust and and trustworthy versus the the sort of long

process of of codifying the the rule set that would be an alternative to these model based approaches.

The limitation to that is that if if a human being can't figure out if 2 records belong together or a human being can't figure out certain things about how these, golden records should be constructed, a model is not gonna be able to figure out either.

So there needs to be some baseline

of,

recognizable

attributes

and structure to the data

that,

in sort of data science terms,

we may have to do some, feature engineering and feature extraction. So if a bunch of unstructured data comes in, you may have to turn that into some attributes on the data that can be recognized both by the humans and the machines.

So there's there's oftentimes some preprocessing work and transformation necessary to make it human ready and and machine ready.

And if if that's not the case, you know, rule sets aren't aren't really gonna get to that that far either.

But but I definitely say that at this point, it's a supervised machine learning approach, not a, you know, completely unsupervised,

just just give me give the model your data, and it's gonna figure it out. It's still supervised and and, thus, is,

predicated on,

a human's ability to make these distinctions, and then the machine can do it at scale.

And do you think that it's practical for somebody to build up their own AI models on their own datasets? Or do you think that you at Tamr and similar companies have the advantage of being able to run these models against multiple customers' data for being able to determine what some common approaches and common formats of some of these attributes are for being able to more intelligently

merge these disparate datasets?

Yeah. That's a that's a great question, and it's 1 that we get from our customers and prospects all the time. I I just hired

50 data scientists from the best programs in the country. These algorithms are are well understood. You know, 20 years old now on on some of this stuff. So I'm just gonna do this myself. My my reaction to that is that there's a few different aspects of a product that does this that are gonna be superior to a project that does this. You're you're gonna stand up a project. You get your big brain data scientists to go and and apply some of these models, and we've seen it.

Then they move on to the nest next project. New data arrives,

and it it perturbs the model,

or the model degrades for for some other reason, where are they now? Or, okay, we're gonna bring them back off of that project, and they're gonna hopefully remember what they did in this thing and have really good,

development practices so they can quickly update the model. But we're we're gonna be sort of single threaded through them because it is only on on our behalf as data consumers that

data scientists were able to build this model.

Better would be a product

that allows

direct

contribution

to the upkeep and the quality of those models and the continual training of those models from the people who are consuming the data.

So so 1 aspect of product versus project

is the way that you continually solicit

the training that you need to keep those models,

trust worthy. Another is stitching

what you've done in with the rest of your ecosystem of platforms as I was describing before. How do we get plugged into our catalog? How do we get plugged into our governance platform? And how do we get plugged into, you know, Tableau or Qlik?

Having

a project

build

APIs

and and interoperability

and and the right layer of of connectivity,

that's that's something that most folks are just not gonna take on because they just wanna get it done, and they're not thinking of this as a

general purpose component within their data landscape.

So there's another 1 where

product

is going to be, superior there.

And the last is just the the particular ensemble

of models or, you know, algorithms in the model

that make these models accurate and also computationally acceptable. We've got a lot of compute out there, but these are really n squared or or very, very computationally expensive

practices. And there are,

ways that you can

pull together the the sequence of processing

where and and ways that you can constrain the problem

that can actually be a far more acceptable compute front footprint, give, things like real time or low

response.

That's something where

I I'd rather be working with a company that's been doing this for many years across dozens of customers than, than stubbing my toe trying to figure that out myself as as a data scientist or or a data engineer. And in terms of the

golden masters dataset once it's been created, what have you found to be some of the limitations there in terms of being able to integrate it with other analytical systems

or issues

with being able to create a canonical

entity

for a particular problem domain. Yeah. That that's it can be kind of tricky because,

there's an area where

it's always going to be contextual.

And as if it's fluid at the same time, if if the contributions to that golden record are changing on a near real time basis or, you know, with any frequency,

the folks who are consuming

these golden records need to know that. They need to know, they need and they need some agency over that. They need to be able to say freeze this golden record so that even as new data arrive and and the underlying rules may suggest that a a different, record is going to contribute to an address to this record. I I don't want it right now. I know that's the more current record, but I don't want that right now because I needed to sort of freeze things as they were. I also need to know some of the history

of of how we got here when these things change. Let's say I I didn't freeze it, but I need to know the, the sort of lineage of how this golden record has changed over time because in my context, that may be really, really important in of its own. So I I think the way I answer your question is that

it's essential for golden records,

the construction of golden records, the mechanism by which they're,

built,

and the the tools by which they are maintained to be really transparent

in all of the the underpinnings of how this stuff is done and to have both a a concrete and user configurable survivorship

capability,

but just giving all of the information, all the context that allows,

the consumers

to set the right rules and and to, you know, make make sure that they're not working with data that are stale and unless they want to. And that brings me to another question that I had in terms of maintaining

the historical context of a record as it does evolve. Are there practices

or platforms

that you have worked with that allow for easy versioning of that data so that you can, for instance, go back in time if you need to rerun a historical analysis to say, you know, this customer's address at x point in time was in

Nevada versus where they are now in Virginia, for instance.

Sure. Yeah. There there are,

both implicit and explicit ways to do that. There the, you know, implicitly to just keep versioning the data as as new data arrive and anything changes, you know, the deltas from from 1 golden record to another or from 1 cluster of records to another. You do that implicitly and and provide that to people. But you also want to provide some explicit versioning and tagging so that that folks, as I said, can can freeze this thing or to say, I wanna publish these golden records right now, or I wanna publish this rule set and have that be

the development rule set and have a different rule set be the production rule set. That there there needs to be a lot of configurability

around how this stuff is done. And then, yeah, there's absolutely

tools increasingly that are are providing more of this

end user friendly management of this stuff. Catalogs also help a ton

in retaining and making

very useful a lot of this

metadata,

around

lineage and history of data graphically and and in other ways where people can really intuitively figure out, what the the life cycle of these data components are and and

how they can traverse back to a different time when the data were more relevant or or better for whatever their purpose is.

Again, it's just there's so many different in in any business, there's so many different constituents

for these data that each want a different thing,

that you need pretty good workflow and collaboration and configurability

around how each of them is going to get what they want.

Otherwise, they're gonna take a snapshot. They're gonna put it in Excel, and now they've stranded this data off. It's never gonna get better. And, that that's that's the last thing that we want people doing. Just lower the the friction for somebody getting what they want in that point of time with all of the context they need to interpret it properly.

And when it comes to the storage of the master records, is there a common, sort of platform type? Is it generally just a relational database system, or is the way that the information is stored as diverse as the people who are using it based on whatever they're using it for. I see more of the latter these days to to publish these datasets out into a lake, because there are going to be myriad different datasets and myriad consumers

of them. You know, the other side of these datasets is

1 group that's using Tableau and another group that's using Power BI, and they're doing totally different things with the data. Trying to

to push push them back into an RDBMS,

at least

these these final golden record datasets,

can be kind of overly expensive, and and you you run the risk of trying to conform everyone to a single least common denominator,

set of of records rather than having the diversity

of outputs that's necessary to support support a diversity of use cases.

But

as you're doing this work, there's a lot there's a great

forensic outcome of this as you're building these golden records and you're, you know, deduplicating and clustering all of these records together, you're gonna find out that some of these sources are of relative

poor quality.

You can,

do upstream correction in those data sources and and backfill them with some of the information that came out of the the golden record,

affiliate them these clusters of records across different datasets by applying a new ID that, this process

of mastering has, has divined or derived from the data so that people can do these federated queries across the datasets.

So there's there's also ways to retrofit

the source systems from where this stuff comes with the information that you've gained in the process and and make them more useful as well. But but

I I do see more of that being pumped out as,

analytical datasets

in a lake for, you know, as I said, to just weirdly different,

use cases downstream. Yeah. And 1 of the other questions I had that partially answers is how you manage

the, subsequent integration

of the master records with the rest of your datasets,

particularly

when trying to reduce latency issues by, having to call out from multiple systems all to this central location. Whereas if it's all in the lake, then there's a higher probability that it'll be more directly integrated into the existing data.

Yeah. It's important that any interoperability

between a a mastering system and the rest of those systems not only supports

incremental and,

sort of bulk processing, but also low latency processing. And when when you want to

apply

the, the golden record in real time, if you say

a great example of this would be, I'm I'm in Salesforce,

and I'm about to enter a new customer record. I should be able to hit my mastering

system and and maybe some, another system

in real time so that I can return

the,

existing

potential match for this data and surface to that user. Hey. Hey. Are are you sure you wanna enter

another Mark Maranelli? Because I've already got 14 of them in here, And I this what you're about to enter looks all an awful lot like that. That's, it just it satisfies a a set of requirements that can surface an

analytical consumption of these data, but in preventing the production

of more bad data that you that you end up having to master.

So I I think it's just, it's really important for all of these systems to be able to support, you know, the real time

application of mastering or whatever they contribute to that whole supply chain to the data.

And in terms of

the

security

and auditing and access control

concerns

for the master records, do you think that it should be a step above what's implemented for the rest of the data platform, or do you think that it should be treated as,

homogeneously

as the rest of your systems and not have to constrain

access or,

capacity for those instances.

Yeah. Governance is,

it's tough. It's sort of a a a tax you have to pay. We wanna make sure that not everyone is,

we certainly wanna make sure that no 1 is,

making bad rules about how to

construct our our golden records or do our mastering or anything like that. So there need to be some countermeasures,

against

mistakes that that will end up, causing causing bad downstream data.

But

that needs to be balanced against wanting tens or hundreds of people to participate

in the construction

and curation of the the rule sets that we used to to, build all of these, these outcomes.

I I don't have a, you know, a hard and fast rule,

from from our experience there, but just say it it really based on on use case, based on the folks who are working on the data, some degree of governance,

is going to have to be applied. But just make sure that you, that the governance doesn't get in your way. It it should be

collaborate first,

govern later rather than govern first,

and then hope that people are going to be able to to collaborate.

That's,

it's it's it's important. And as I said, it's a it's a tax that you pay on any of this work, but you you just have to make sure that for the vast majority of use cases where governance is not

going to be as,

as as necessary,

that you're not applying the the same governance regime that you would have for, say, in a patient sensitive patient data

to something about a marketing campaign for doctors, let's say, if you're you're in a in a healthcare domain.

And for

cases where you have worked with

companies and individuals who are trying to build a system and a process for creating and managing these master datasets,

What have you found to be some of the most common stumbling blocks that they come up against? The biggest thing is not getting quick wins. If if because so many of these, legacy platforms and legacy approaches adhere to a waterfall model,

Everybody gets together. The steering committee meets. We we list out our most valuable datasets.

We start building this infrastructure.

And then months and months months go by before

any useful data comes out. That's where people just start to to fall off. They they're alright. Well, this is potentially gonna be a white elephant. I I'm not getting any value from this thing yet. I'm gonna stop showing up at the steering committee meetings, or we're gonna start forking off our own mastering,

or somewhat,

adjacent to mastering capability and instead of waiting for the, you know, the big public works project to get us what we need. And that's how these things fail, and they often do fail before they, can re recognize any value or or certainly not the the ROI and what is oftentimes a a resource intensive and expensive project. Quick wins are are the way to to stop that from happening to pick small problems

that are still valuable

in in in measured in a time span of weeks,

be able to produce something useful, and then build on that and then go adjacent into another domain using the, the the technologies and the techniques that you've learned from the first 1 and score a quick win there. And then over time, you take on some of the

more strategic,

but maybe harder problems.

But it it's really just keeping everybody in the boat

showing as quickly and as frequently as possible the value that they can derive from

the the mastering process.

And are there any particular

references

or resources

that you recommend people look at when they're starting on the path of building these master datasets or the systems to manage them either from the theoretical or practical aspects? Yeah. Certainly. I mean, there's there's great,

organizations like,

TDWI

that, you know, really sit on top of a a lot of great information

and and

industry expertise.

They've they've shows and and a lot of publications,

you know, that that's 1. The vendors themselves,

all of us, that that are, software vendors of any of this stuff, big or small,

will have good stuff on our websites about the, variety of different techniques and and the value

in attaining these. And, you know, I go to any 1 of these vendor solutions page and and see read some of the the use the case studies there and and say, wow. Alright. This this,

case study from 1 of their customers said that they were able get on top of their supplier management in the span of a few months. That that sounds like me. I I wanna do that. Even if even if that's not the vendor you go with, there's just a lot of sort of, you know, general purpose thought leadership information about how to apply the techniques of mastering

to your data for for quick benefit and and huge benefit.

1, 000, 000 and 1, 000, 000 and 1, 000, 000 of dollars here can be saved by,

having better, cleaner, more accessible data and doing so without an army of, of data engineers to to build the infrastructure to support it. And are there any other aspects of data mastering

or applying machine learning to this problem domain or anything else along those lines that we didn't cover yet that you think we should discuss before we close out the show? Yeah. I I, I I think it's important

as as we talk to the market and we talk to prospective customers and our customers, there's, there's a lot of skepticism

remaining

on AI and ML,

partially because the the term

machine learning has been so diluted by a lot of of marketing,

where everybody wants to have some some machine learning in their technology. And then people see what they mean by machine learning, and they they they don't really mean machine learning models.

On the other side, there's really sophisticated machine learning or AI platforms.

And,

people have been so

inculcated with a rules based deterministic approach to solving this problem that they look at that. Maybe that's magic. There's no way that that could possibly work. I mean, how how could you do this in a tenth of the time that I'm doing this now? I I really hope that people

are willing to,

to give it a a shot with some of these approaches and and try it out. It's it's not hard to do this, and you can experiment.

And we we just need sort of, you know, generationally, I think right now with with the folks in,

in the data ops and and,

IT

departments to

embrace this as a a better, faster, cheaper alternative to the ways that we've been working before, to understand that there are conditions under which those legacy,

platforms and approaches are superior,

but, really, to just, to lean in to this, this new breed of technologies. And

very quickly, once you started working with 1 of these,

tools or or approaches, you'll say, oh, well, that's not magic at all. Now I understand how this works, and I understand its limitations, but I also understand it it's, superiority.

That that's just, what what I hope everyone,

will will

rather than just reading about it, we'll we'll try to dive in and and stand up some of

a project

and,

and realize these these benefits without,

don't know, realize these benefits

as quickly and as

fail fastly as as possible. Otherwise, the the alternative is, you know, the opportunity cost of keeping doing what you're doing with some of these legacy technologies,

is is really high.

Right. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management influence

the

quality

of that data.

You you influence the quality of that data. You you build this big infrastructure and out comes data, and then somebody starts working with it and and and, let's say, in Tableau, and they say, wow. That's this is wrong. I I think there's a problem here.

How do they get that feedback

to someone who could do something about it? There there are ways to do this. Right? Drop an email to whoever brokered these data and then hope that they're going to to fix it. In in the sky blue future, while you're sitting in that tool,

you should be able to, say, flag something as being wrong, flag a record or a column as being wrong,

and have that go be fed into

a potentially fed into the rule set or or the machine learning model that's producing these data to make it better. You know, as training data in in a model,

at least have it enter into

a structured sort of backlog for your data engineers and IT professionals to to go through and and, remediate.

I I think that's the biggest thing right now. We're getting better at producing

pretty good data, but we haven't gotten much better

incorporating

when things are broken systemically or systematically

into making those data better. That that feedback loop, I haven't seen much of it. And, that that's really, as I said, the sky blue future for all of us. Alright. Well, thank you very much for taking the time today to talk about the problems and approaches towards managing master datasets. It's definitely

an important issue, and so I appreciate that. And thank you for taking the time, and I hope you enjoy the rest of your day. Thank you very much. Appreciate you having me on. Take

care.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links