Summary
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by establishing a definition of data mastering that we can work from?
- How does the master data set get used within the overall analytical and processing systems of an organization?
- What is the traditional workflow for creating a master data set?
- What has changed in the current landscape of businesses and technology platforms that makes that approach impractical?
- What are the steps that an organization can take to evolve toward an agile approach to data mastering?
- At what scale of company or project does it makes sense to start building a master data set?
- What are the limitations of using ML/AI to merge data sets?
- What are the limitations of a golden master data set in practice?
- Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?
- Are there specific problem domains that are more likely to benefit from a master data set?
- Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data)
- What storage mechanisms are typically used for managing a master data set?
- Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?
- How do you manage latency issues when trying to reference the same entities from multiple disparate systems?
- What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?
- What suggestions do you have to help prevent such a project from being derailed?
- What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Tamr
- Multi-Dimensional Database
- Master Data Management
- ETL
- EDW (Enterprise Data Warehouse)
- Waterfall Development Method
- Agile Development Method
- DataOps
- Feature Engineering
- Tableau
- Qlik
- Data Catalog
- PowerBI
- RDBMS (Relational Database Management System)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle. Skafos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.
Request a demo today at dataengineeringpodcast.com/metis dash machine to learn more about how Metis Machine is operationalizing data science. And if you're attending the Strata Data Conference in New York in September, then come say hi to Metis Machine at booth p 16. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat. Your host is Tobias Macy. And today, I'm interviewing Mark Marinelli about data mastering for modern platforms. So, Mark, could you start by introducing yourself? Sure. My name is Mark Marinelli. I head product here at Tamr.
[00:01:40] Unknown:
I've been in the, data management space for about 20 some odd years now, having having cut my teeth back when multidimensional databases and and Molapp were a thing and and then doing, stem for quite some time in what is now the self-service data prep world. And now here at Tamr, I am am working on technologies that are really bringing machine learning and and AI into the fold here as, over time, we've progressively gotten better at automating a lot of, those data management pipelines.
[00:02:14] Unknown:
And do you remember how you first got involved in the space?
[00:02:18] Unknown:
Yeah. I've always been fascinated by data. Undergrad, it was computer science, and and the data stuff was more interesting than the network stuff or anything else. So when I, when I first started as a software developer, I was working on a data management platform, as I said, in the multidimensional, database space back in the late nineties. And ever since, I've just been really interested in in these variety of technologies that I've outlined in where the rubber meets the road, where those data actually get into people's hands and facilitating getting useful data into people's hands, as quickly as possible, you know, as opposed to the sort of big iron ETL stuff. That's where, you know, self-service data prep or now in this sort of modern data unification machine learning driven thing is, how quickly can we derive value from those data and and get it to people so they can do their biz or their data science or or whatever it is. And so before we get too deep into the topic, can you start by sharing what your definition is of a master dataset and data mastering so that we can use that to build off of? Sure. Yeah. So data mastering or or maybe we would say at Tamr, data unification is a little bit broader than than a loaded term of mastering. But, essentially, you're taking data from a variety of different systems, which are all trying to describe the same thing, you know, the same entity, like a person or an invoice or whatever. We're taking data from across all of those systems, which were not designed to be interoperable.
And unifying the data, putting them all in in 1 place where they're aligned to a common set of attributes, where we've discovered linkage among the data, either because records are linked in some way or duplicated, and then are rendering out, canonical versions of each of these things, you know, a single view of the customer or a single view of a supplier or whatever, as what people would often call a, you know, golden record, so that, the downstream analytical use cases or whatever consumption use cases are, can can work with the latest and greatest, and and most comprehensive set of each of the customers, suppliers, invoices, whatever,
[00:04:33] Unknown:
that they need to. And can you discuss a bit more about how that master dataset or the unified data records get used within the overall
[00:04:42] Unknown:
processing and analytical systems within an organization? Yeah. Sure. So the the attributes of, a problem set we're mastering is essential or where you do have a variety of different views of what's essentially the same thing. So a great example of something we see a lot is, when you're trying to do customer analysis so that you can, track your customer journey to upsell, cross sell, you know, monetize your customers as best as you possibly can. If I've got, 52 different variants of Mark Maranelli, the customer across all of my channels where I'm collecting customer data, Very, very difficult for me to bring all of those, data to to have that unified pictures that I can say, wow. He he's he's buying stuff more online than he used to, and and that informs the way that I'm going to manage that customer. So anywhere where there's that disparity of data that are all describing the same thing, Customer 360, as a lot of people would call it, is is really popular.
Another 1, you know, really different area of a business but has a lot of the same attributes would be dealing, with your suppliers. If you're a big global a global multinational and you are buying a lot of your, components to build your hardware or or software or whatever from a variety of different suppliers, those are all probably going to exist in in a lot of different systems. A lot of those data that you receive from your suppliers are totally not under your control. So they're all describing the supplies that they're selling to you in different ways. That's it's essential for you to have a single view across all of that, procurement or spend footprint so that you know that what 1 person describes as completely different from someone else is actually both 8 and a half by 11 inch printer paper. If if I can't unify and, deduplicate and master these data, then I don't know how much money I'm spending on each of these products on each of with each of these suppliers.
So it makes it very difficult for me to do good inventory management. Enables me once I've gotten on the other side of this, if I do have a single view of each of my suppliers and I'm no longer treating the 15 affiliates of a supplier as as 15 different suppliers because I I didn't have this visibility, but now I'm treating them as 1, then I can see my aggregate spend with him is much higher than I thought it was. I should renegotiate my contract or, you know, change my purchasing behavior. So those are those are a couple of examples where having the data really dispersed across the systems, bringing it all together and and having a single way to treat your, you know, your your cross tumors, your business workers, or whatever, is is essential
[00:07:22] Unknown:
to to getting the the right analytical clock. And 1 of the upfront challenges that I was thinking about as you were discussing that is how you would sort of plan ahead to have some sort of unifying attribute that you can use to be able to create this single record of a given entity, whether it's a customer or a vendor or a particular unit of, you know, resource that you're getting from that vendor. So I'm sure that there is, in a lot of cases, some measure of manual effort involved there. But But I'm curious if you can talk a bit about some of the, practical issues and strategies around being able to, plan ahead in either your data model or the way that you're collecting the data to make it easier to create that unified record?
[00:08:09] Unknown:
Sure. Yeah. We we see a variety of different techniques there. I I would quibble a little bit with plan ahead because, we're we we talk a lot about agile mastering, and and see our customers sort of embracing this. And we we'll plan a little bit ahead, and we'll start with something. But know that what this unified attribute set looks like 9 months from now is not necessarily what it's gonna look like 3 weeks from now and and not try to solve that entire problem a preorder, but but rather build on on, you know, something, that's that's a quick win. But, there's a few different ways to do this. 1, there are off the shelf schema for a lot of different entities. There's there's not a lot that differs, for a human being from, you know, company to company if if you're treating with customers. You can get stuff like that, from, you know, public, web there's also probably somewhere in your organization already a a master. You've probably already tried to solve this problem, and and maybe you're struggling with it, but you've got a pretty decent definition of what a customer looks like that everybody has already started to plug into their analytical applications.
Use some subset of that and then map all of the other data sources, which have a variety of different ways to describe those data, map all of them to that same attribute set. That that itself can be really cumbersome because you have rule sets, that are doing these mappings. And then as soon as a new dataset arrives, because you've got a supplier that's sending you different type of data or you've just bought a new company and now they have a a system that has all of these data. As soon as that stuff arrives, it's gonna break all your rules. So you have to go back and retrofit those rules, so that you can map those, source datasets to that that unified schema.
That that's actually an area where we've done a lot of, of research, and and we've we've got software that that can automate that because a machine can actually figure that out pretty well and, and be more resilient in the face of changing data on both sides, both on the the source and the destination schema, than a rule set that somebody's gotta build and maintain and and will, constantly be be behind the 8 ball. So, you know, in in brief, it's it's finding something off the shelf that that does, that already describes the relevant entity or domain.
There's, out of the box stuff that that a lot of platforms bring you that that is already encapsulating some of those end states. There's something maybe sitting around your your business that is, already good enough. And in any 1 of those scenarios, you're gonna wanna be able to modify that schema without incurring, you know, capital d, capital m, data modeling exercise where now we have to, retrofit some entity relationship diagram or anything like that. You you wanna be as, lightweight and agile as possible
[00:11:11] Unknown:
records to canonically describe a given entity, it sounds like you want to establish the minimum number of attributes that you can possibly get away with on that record so that you don't have this issue of the evolving data model than breaking any references to that entity from other systems.
[00:11:30] Unknown:
Absolutely. It's it's always gonna be this this common subset that's just enough. I mean, I think that just sort of philosophically infuses a lot of what I'd have to say. Just get to that small, I will say minimum viable, schema that is the the common subset across all of these datasets that is useful immediately for analytical or operational purposes and then build on top of that as the community that search to use these master data, proliferates and and it applies more requirements. And it's also not 1 size fits all. Every single 1 of these destination attribute sets is gonna be contextual. That that's for golden records. It's for, just the the model that you're using, for the datasets. 1 person is going to want different things from your customer set than another. And and so it's really important for you not to try to just push out 1 sort of, you know, at least common denominator to everybody, but have a 1 to end relationship between your your mastering logic and the way that, that golden record or or any of the data are surfaced to different constituents,
[00:12:35] Unknown:
which itself could be really tricky. So 1 strategy seems like you might want to have the capability of layering additional attributes where you have your sort of core minimum number of entities, or maybe you just have a unifying ID, and then you have other systems that are adding degrees of information. So that if somebody wants to have all of the records, they would maybe go to 1 interface. Whereas if they want to infer their own information, they would just go to that, sort of minimal set of records to be able to reduce the amount of churn at the core set, but at layer in additional information as you go sort of further afield from that? Yeah. It's it's a subset of these records. It's a subset of these attributes. And, sometimes it's a different
[00:13:20] Unknown:
connection in those records. 1 use case may define, a common thing that we see is individuals versus households. 1 data set is going to they're gonna have exactly the same column set, but 1 of them is gonna consider my father and myself as 2 different people, and another one's gonna consider us as 1 household if if we were to if we live together. Having each endpoint, analytical endpoint, or or consumer constituency be able to impose their own model on this stuff and their own, say, ruleset, but but oftentimes, it's a machine learning model. Push that control over how the data are consumed as far out to the consumers as you possibly can, which is really not the way that we've been doing this for the better part of 30 years. Instead, you sit down with the business user and they try to explain what they're looking for and you codify their their business logic in a series of, you know, ETL transformations or or MDM logic or an EDW.
And, then you get some of it wrong because you have to make some assumptions, and then you go back to them and rinse, repeat, and that there's a really protracted cycle there where you're working on their behalf if if and and a lot of this, the modern toolset allows these individuals to have a bit more agency over how the data are mastered and consumed, that that's the best way, to do it. Of course, that that has governance implications to make sure that not everyone every single individual is going off, you know, creating their own view. But but there are ways to balance that, need for autonomy, versus the the chaos that can ensue if, you know, if everybody's looking at things entirely
[00:15:02] Unknown:
differently. And so when I first said the plan ahead, and you took issue with that and what you're describing here of this sort of waterfall process of defining the master schema and the way that this information is going to be codified and captured and contrasting that with the more agile methodology that you're promoting and that's become more popular in software engineering as well. I'm curious, what about the current landscape of both businesses and the technology platforms that they're running on top of makes that waterfall approach impractical and enables us to move towards this more agile approach to managing these golden records? Well, a a tough 1 for any waterfall,
[00:15:47] Unknown:
environment is is the shifting sands of data, and and so much data is being collected now. A lot of it externally to companies where they have no control over the the format and quality of those data. The speed with which data are arriving, the variety of, formats in which these data need to be assimilated so that businesses can leverage all of these data for their, you know, competitive advantage or whatever their analytical use cases are, that just breaks. You can't wait 2 months for someone to incorporate a dataset that that may be ephemeral, and you can't have to bring in your data engineering staff every time a new dataset arrives. That's that's a little bit or or maybe appreciably different from the stuff that you've seen before. So I I think that, the the industry has focused a lot of its fire in the last couple of year decade, let's say, on solving the volume problem, big data, how to do Spark. That's all wonderful.
Not so much on the variety problem, or or applying as much, you know, automation and and simplicity to that. So there are areas where, waterfall, I think, is just fine. If you've got a credit card processing, transaction ETL pipeline that's just humming along and and those data aren't going to the format of those data aren't gonna change all that much, then then fine. Do a yearly release of of your update of that platform. But when you're trying to, be really nimble and constantly accumulate new data and validate whether those data are even useful, if you have to stand up a year long project where where you don't even know if it's cost justified to do it because you're not sure if these data are useful, you're just not gonna do it. And so for somebody who is trying to move toward having a
[00:17:40] Unknown:
centralized sort of master set of records for being able to model these entities that are important to their business if they don't already have a system in place to be able to capture and integrate that information, and they don't want to get caught in this trap of having this multi month or multi year project to build it up. What are some of the steps that they can take to move toward having that capability without having to do a sort of stop the world approach of not doing any other feature development or halting other projects where they can just sort of integrate this into their existing workflows?
[00:18:15] Unknown:
Yeah. There's I I think there's kind of 3 different layers to adopting an agile approach. 1 of them is mindset, agile mindset. Everybody in the organization needs to be okay with some of, the risk of doing things faster and maybe not as as completely as they would in a waterfall project, knowing that they're going to realize immediate or or short term benefit that is going to be, important to their business. So there are certain areas where waterfall is the way to go. I'm I'm I'm happy that the people who build large commercial airplanes are not, doing, you know, testing an MVP on me as a passenger, and they took this approach. And so there are things in data management where the accuracy of those data has to be so complete and trustworthy, that you may build in a lot of these checks and and a lot of the the big iron scaffolding. There are other areas where if, let's say we're gonna run a marketing campaign for our customers, and if I received the law the wrong mailer that was intended to someone else because they they got my golden record wrong or or they incorrectly mastered me. So what? If that's happening 2% of the time, but in order to get the system stood up in 6 weeks and and do something productive, we had to cut that corner. Everybody's gotta be okay with that, or at least judicious in how they take on projects, with this this agile mindset so they know where agility and probabilistic and potentially a little bit incorrect outcomes are are acceptable versus, the other stuff that needs more rigor. So there's there's a mindset component. That's the most important thing.
Then there's a skill set component where as you're starting to do this work, make sure that you've got these squads of of folks where you have the consumers of the data, the data engineers, the the brokers of these data from these systems all working very closely together in as as close to a sort of agile, scrum methodology as you could get in rapid iterations, collaboratively working together to build up incrementally. And then the last component, you know, mindset, skill set is tool set. You can have a wonderful mindset and and want to adopt all of this agility and and be open to it. But if you're using legacy systems that really cannot adapt, cannot introduce new functionality easily, you're just not gonna get anywhere. So we at Tamer definitely, have been proponents of what we call a a data ops approach to this, where instead of, taking 1 monolithic tool that supplies the entire data supply chain, from raw data out through this analytical ready data.
Instead, just know from the beginning that you should be pulling together a suite of interoperable technologies that are best of breed. Each 1 of them is going to contribute the best of breed. 1 will be the best of breed for mastering. Another will be the best of breed for cataloging. Another 1 for governance. And layer those capabilities on because you don't have to get all of them working in their fullness before any useful data drips out the other side. So if you start with a small mastering project, with technologies, like our own and and put in a plug for Tamer where, we can we can get very far very quickly because we're offloading a lot of the, historical data modeling and and rules creation onto machine learning models. We can get there really quickly if if you take on something, take in all of your customer data, start doing interesting things with the customer data. Were your risk profiles okay and then layer in another technology for cataloging these data now that you've built a lot of these relationships and a lot of this metadata. And then layer in something for governance so that you have controlled consumption and controlled access to these data. Just don't try to take it on all at once and make sure that the tools that you choose are, each very interoperable with the others. And it's gonna be very API first design. You're gonna have to do some work to to stitch them together, but you don't have to do it all upfront. And the benefit of layering this stuff in and keeping it disaggregated and decoupled is that if somebody comes along and builds a better mouse trap for any 1 of these components, toss your your existing 1 out, throw the other 1 in with with minimal disruption. And is there
[00:22:43] Unknown:
a particular size or scale of a company or project at which there's a tipping point where it makes sense to start building the master dataset, or do you think that it's something that anybody at any scale can start integrating into their platform?
[00:22:58] Unknown:
The latter. If you've got more than 1 platform that or 1 know, dataset that contains overlapping or similar data, you've gotta do mastering. And and that so the the important thing is to make it economical for people to be able to do mastering at at small scale and not only when you've already got 14 ERP systems that you've gotta stitch together, but, you know, a Salesforce instance or 2 Salesforce instances and and some CRM data that you've got in some on prem database, you should be mastering.
[00:23:31] Unknown:
And so 1 of the things that you were discussing as far as being able to accelerate and simplify the efforts of building this master dataset is integrating machine learning and artificial intelligence methods. And I'm curious what you found to be the limitations of that approach for merging the datasets and where it's necessary to have human intervention for being able to build up these catalogs?
[00:23:57] Unknown:
So, in reverse order, the human intervention is is a feature, not a bug, in that with the machine learning approaches we see, it's supervised machine learning. It's really the machine needs to learn from the human understanding of the data. Where, the machine is exceedingly helpful is that it learns pretty quickly. And in over a few iterations of training and and correcting for some of the suggestions that the machine learning provides, you can get something very robust and and trustworthy versus the the sort of long process of of codifying the the rule set that would be an alternative to these model based approaches.
The limitation to that is that if if a human being can't figure out if 2 records belong together or a human being can't figure out certain things about how these, golden records should be constructed, a model is not gonna be able to figure out either. So there needs to be some baseline of, recognizable attributes and structure to the data that, in sort of data science terms, we may have to do some, feature engineering and feature extraction. So if a bunch of unstructured data comes in, you may have to turn that into some attributes on the data that can be recognized both by the humans and the machines. So there's there's oftentimes some preprocessing work and transformation necessary to make it human ready and and machine ready.
And if if that's not the case, you know, rule sets aren't aren't really gonna get to that that far either. But but I definitely say that at this point, it's a supervised machine learning approach, not a, you know, completely unsupervised, just just give me give the model your data, and it's gonna figure it out. It's still supervised and and, thus, is, predicated on, a human's ability to make these distinctions, and then the machine can do it at scale.
[00:25:45] Unknown:
And do you think that it's practical for somebody to build up their own AI models on their own datasets? Or do you think that you at Tamr and similar companies have the advantage of being able to run these models against multiple customers' data for being able to determine what some common approaches and common formats of some of these attributes are for being able to more intelligently merge these disparate datasets?
[00:26:09] Unknown:
Yeah. That's a that's a great question, and it's 1 that we get from our customers and prospects all the time. I I just hired 50 data scientists from the best programs in the country. These algorithms are are well understood. You know, 20 years old now on on some of this stuff. So I'm just gonna do this myself. My my reaction to that is that there's a few different aspects of a product that does this that are gonna be superior to a project that does this. You're you're gonna stand up a project. You get your big brain data scientists to go and and apply some of these models, and we've seen it. Then they move on to the nest next project. New data arrives, and it it perturbs the model, or the model degrades for for some other reason, where are they now? Or, okay, we're gonna bring them back off of that project, and they're gonna hopefully remember what they did in this thing and have really good, development practices so they can quickly update the model. But we're we're gonna be sort of single threaded through them because it is only on on our behalf as data consumers that data scientists were able to build this model.
Better would be a product that allows direct contribution to the upkeep and the quality of those models and the continual training of those models from the people who are consuming the data. So so 1 aspect of product versus project is the way that you continually solicit the training that you need to keep those models, trust worthy. Another is stitching what you've done in with the rest of your ecosystem of platforms as I was describing before. How do we get plugged into our catalog? How do we get plugged into our governance platform? And how do we get plugged into, you know, Tableau or Qlik? Having a project build APIs and and interoperability and and the right layer of of connectivity, that's that's something that most folks are just not gonna take on because they just wanna get it done, and they're not thinking of this as a general purpose component within their data landscape.
So there's another 1 where product is going to be, superior there. And the last is just the the particular ensemble of models or, you know, algorithms in the model that make these models accurate and also computationally acceptable. We've got a lot of compute out there, but these are really n squared or or very, very computationally expensive practices. And there are, ways that you can pull together the the sequence of processing where and and ways that you can constrain the problem that can actually be a far more acceptable compute front footprint, give, things like real time or low response.
That's something where I I'd rather be working with a company that's been doing this for many years across dozens of customers than, than stubbing my toe trying to figure that out myself as as a data scientist or or a data engineer. And in terms of the
[00:29:07] Unknown:
golden masters dataset once it's been created, what have you found to be some of the limitations there in terms of being able to integrate it with other analytical systems or issues with being able to create a canonical entity
[00:29:24] Unknown:
for a particular problem domain. Yeah. That that's it can be kind of tricky because, there's an area where it's always going to be contextual. And as if it's fluid at the same time, if if the contributions to that golden record are changing on a near real time basis or, you know, with any frequency, the folks who are consuming these golden records need to know that. They need to know, they need and they need some agency over that. They need to be able to say freeze this golden record so that even as new data arrive and and the underlying rules may suggest that a a different, record is going to contribute to an address to this record. I I don't want it right now. I know that's the more current record, but I don't want that right now because I needed to sort of freeze things as they were. I also need to know some of the history of of how we got here when these things change. Let's say I I didn't freeze it, but I need to know the, the sort of lineage of how this golden record has changed over time because in my context, that may be really, really important in of its own. So I I think the way I answer your question is that it's essential for golden records, the construction of golden records, the mechanism by which they're, built, and the the tools by which they are maintained to be really transparent in all of the the underpinnings of how this stuff is done and to have both a a concrete and user configurable survivorship capability, but just giving all of the information, all the context that allows, the consumers to set the right rules and and to, you know, make make sure that they're not working with data that are stale and unless they want to. And that brings me to another question that I had in terms of maintaining
[00:31:12] Unknown:
the historical context of a record as it does evolve. Are there practices or platforms that you have worked with that allow for easy versioning of that data so that you can, for instance, go back in time if you need to rerun a historical analysis to say, you know, this customer's address at x point in time was in Nevada versus where they are now in Virginia, for instance.
[00:31:37] Unknown:
Sure. Yeah. There there are, both implicit and explicit ways to do that. There the, you know, implicitly to just keep versioning the data as as new data arrive and anything changes, you know, the deltas from from 1 golden record to another or from 1 cluster of records to another. You do that implicitly and and provide that to people. But you also want to provide some explicit versioning and tagging so that that folks, as I said, can can freeze this thing or to say, I wanna publish these golden records right now, or I wanna publish this rule set and have that be the development rule set and have a different rule set be the production rule set. That there there needs to be a lot of configurability around how this stuff is done. And then, yeah, there's absolutely tools increasingly that are are providing more of this end user friendly management of this stuff. Catalogs also help a ton in retaining and making very useful a lot of this metadata, around lineage and history of data graphically and and in other ways where people can really intuitively figure out, what the the life cycle of these data components are and and how they can traverse back to a different time when the data were more relevant or or better for whatever their purpose is.
Again, it's just there's so many different in in any business, there's so many different constituents for these data that each want a different thing, that you need pretty good workflow and collaboration and configurability around how each of them is going to get what they want. Otherwise, they're gonna take a snapshot. They're gonna put it in Excel, and now they've stranded this data off. It's never gonna get better. And, that that's that's the last thing that we want people doing. Just lower the the friction for somebody getting what they want in that point of time with all of the context they need to interpret it properly.
[00:33:25] Unknown:
And when it comes to the storage of the master records, is there a common, sort of platform type? Is it generally just a relational database system, or is the way that the information is stored as diverse as the people who are using it based on whatever they're using it for. I see more of the latter these days to to publish these datasets out into a lake, because there are going to be myriad different datasets and myriad consumers
[00:33:53] Unknown:
of them. You know, the other side of these datasets is 1 group that's using Tableau and another group that's using Power BI, and they're doing totally different things with the data. Trying to to push push them back into an RDBMS, at least these these final golden record datasets, can be kind of overly expensive, and and you you run the risk of trying to conform everyone to a single least common denominator, set of of records rather than having the diversity of outputs that's necessary to support support a diversity of use cases. But as you're doing this work, there's a lot there's a great forensic outcome of this as you're building these golden records and you're, you know, deduplicating and clustering all of these records together, you're gonna find out that some of these sources are of relative poor quality.
You can, do upstream correction in those data sources and and backfill them with some of the information that came out of the the golden record, affiliate them these clusters of records across different datasets by applying a new ID that, this process of mastering has, has divined or derived from the data so that people can do these federated queries across the datasets. So there's there's also ways to retrofit the source systems from where this stuff comes with the information that you've gained in the process and and make them more useful as well. But but I I do see more of that being pumped out as, analytical datasets in a lake for, you know, as I said, to just weirdly different,
[00:35:27] Unknown:
use cases downstream. Yeah. And 1 of the other questions I had that partially answers is how you manage the, subsequent integration of the master records with the rest of your datasets, particularly when trying to reduce latency issues by, having to call out from multiple systems all to this central location. Whereas if it's all in the lake, then there's a higher probability that it'll be more directly integrated into the existing data.
[00:35:55] Unknown:
Yeah. It's important that any interoperability between a a mastering system and the rest of those systems not only supports incremental and, sort of bulk processing, but also low latency processing. And when when you want to apply the, the golden record in real time, if you say a great example of this would be, I'm I'm in Salesforce, and I'm about to enter a new customer record. I should be able to hit my mastering system and and maybe some, another system in real time so that I can return the, existing potential match for this data and surface to that user. Hey. Hey. Are are you sure you wanna enter another Mark Maranelli? Because I've already got 14 of them in here, And I this what you're about to enter looks all an awful lot like that. That's, it just it satisfies a a set of requirements that can surface an analytical consumption of these data, but in preventing the production of more bad data that you that you end up having to master.
So I I think it's just, it's really important for all of these systems to be able to support, you know, the real time application of mastering or whatever they contribute to that whole supply chain to the data.
[00:37:13] Unknown:
And in terms of the security and auditing and access control concerns for the master records, do you think that it should be a step above what's implemented for the rest of the data platform, or do you think that it should be treated as, homogeneously as the rest of your systems and not have to constrain access or, capacity for those instances.
[00:37:40] Unknown:
Yeah. Governance is, it's tough. It's sort of a a a tax you have to pay. We wanna make sure that not everyone is, we certainly wanna make sure that no 1 is, making bad rules about how to construct our our golden records or do our mastering or anything like that. So there need to be some countermeasures, against mistakes that that will end up, causing causing bad downstream data. But that needs to be balanced against wanting tens or hundreds of people to participate in the construction and curation of the the rule sets that we used to to, build all of these, these outcomes. I I don't have a, you know, a hard and fast rule, from from our experience there, but just say it it really based on on use case, based on the folks who are working on the data, some degree of governance, is going to have to be applied. But just make sure that you, that the governance doesn't get in your way. It it should be collaborate first, govern later rather than govern first, and then hope that people are going to be able to to collaborate.
That's, it's it's it's important. And as I said, it's a it's a tax that you pay on any of this work, but you you just have to make sure that for the vast majority of use cases where governance is not going to be as, as as necessary, that you're not applying the the same governance regime that you would have for, say, in a patient sensitive patient data to something about a marketing campaign for doctors, let's say, if you're you're in a in a healthcare domain.
[00:39:17] Unknown:
And for cases where you have worked with companies and individuals who are trying to build a system and a process for creating and managing these master datasets,
[00:39:30] Unknown:
What have you found to be some of the most common stumbling blocks that they come up against? The biggest thing is not getting quick wins. If if because so many of these, legacy platforms and legacy approaches adhere to a waterfall model, Everybody gets together. The steering committee meets. We we list out our most valuable datasets. We start building this infrastructure. And then months and months months go by before any useful data comes out. That's where people just start to to fall off. They they're alright. Well, this is potentially gonna be a white elephant. I I'm not getting any value from this thing yet. I'm gonna stop showing up at the steering committee meetings, or we're gonna start forking off our own mastering, or somewhat, adjacent to mastering capability and instead of waiting for the, you know, the big public works project to get us what we need. And that's how these things fail, and they often do fail before they, can re recognize any value or or certainly not the the ROI and what is oftentimes a a resource intensive and expensive project. Quick wins are are the way to to stop that from happening to pick small problems that are still valuable in in in measured in a time span of weeks, be able to produce something useful, and then build on that and then go adjacent into another domain using the, the the technologies and the techniques that you've learned from the first 1 and score a quick win there. And then over time, you take on some of the more strategic, but maybe harder problems.
But it it's really just keeping everybody in the boat showing as quickly and as frequently as possible the value that they can derive from the the mastering process.
[00:41:11] Unknown:
And are there any particular references or resources that you recommend people look at when they're starting on the path of building these master datasets or the systems to manage them either from the theoretical or practical aspects? Yeah. Certainly. I mean, there's there's great,
[00:41:29] Unknown:
organizations like, TDWI that, you know, really sit on top of a a lot of great information and and industry expertise. They've they've shows and and a lot of publications, you know, that that's 1. The vendors themselves, all of us, that that are, software vendors of any of this stuff, big or small, will have good stuff on our websites about the, variety of different techniques and and the value in attaining these. And, you know, I go to any 1 of these vendor solutions page and and see read some of the the use the case studies there and and say, wow. Alright. This this, case study from 1 of their customers said that they were able get on top of their supplier management in the span of a few months. That that sounds like me. I I wanna do that. Even if even if that's not the vendor you go with, there's just a lot of sort of, you know, general purpose thought leadership information about how to apply the techniques of mastering to your data for for quick benefit and and huge benefit.
1, 000, 000 and 1, 000, 000 and 1, 000, 000 of dollars here can be saved by, having better, cleaner, more accessible data and doing so without an army of, of data engineers to to build the infrastructure to support it. And are there any other aspects of data mastering
[00:42:46] Unknown:
or applying machine learning to this problem domain or anything else along those lines that we didn't cover yet that you think we should discuss before we close out the show? Yeah. I I, I I think it's important
[00:42:57] Unknown:
as as we talk to the market and we talk to prospective customers and our customers, there's, there's a lot of skepticism remaining on AI and ML, partially because the the term machine learning has been so diluted by a lot of of marketing, where everybody wants to have some some machine learning in their technology. And then people see what they mean by machine learning, and they they they don't really mean machine learning models. On the other side, there's really sophisticated machine learning or AI platforms. And, people have been so inculcated with a rules based deterministic approach to solving this problem that they look at that. Maybe that's magic. There's no way that that could possibly work. I mean, how how could you do this in a tenth of the time that I'm doing this now? I I really hope that people are willing to, to give it a a shot with some of these approaches and and try it out. It's it's not hard to do this, and you can experiment.
And we we just need sort of, you know, generationally, I think right now with with the folks in, in the data ops and and, IT departments to embrace this as a a better, faster, cheaper alternative to the ways that we've been working before, to understand that there are conditions under which those legacy, platforms and approaches are superior, but, really, to just, to lean in to this, this new breed of technologies. And very quickly, once you started working with 1 of these, tools or or approaches, you'll say, oh, well, that's not magic at all. Now I understand how this works, and I understand its limitations, but I also understand it it's, superiority. That that's just, what what I hope everyone, will will rather than just reading about it, we'll we'll try to dive in and and stand up some of a project and, and realize these these benefits without, don't know, realize these benefits as quickly and as fail fastly as as possible. Otherwise, the the alternative is, you know, the opportunity cost of keeping doing what you're doing with some of these legacy technologies, is is really high.
[00:45:07] Unknown:
Right. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management influence
[00:45:26] Unknown:
the quality of that data. You you influence the quality of that data. You you build this big infrastructure and out comes data, and then somebody starts working with it and and and, let's say, in Tableau, and they say, wow. That's this is wrong. I I think there's a problem here. How do they get that feedback to someone who could do something about it? There there are ways to do this. Right? Drop an email to whoever brokered these data and then hope that they're going to to fix it. In in the sky blue future, while you're sitting in that tool, you should be able to, say, flag something as being wrong, flag a record or a column as being wrong, and have that go be fed into a potentially fed into the rule set or or the machine learning model that's producing these data to make it better. You know, as training data in in a model, at least have it enter into a structured sort of backlog for your data engineers and IT professionals to to go through and and, remediate.
I I think that's the biggest thing right now. We're getting better at producing pretty good data, but we haven't gotten much better incorporating when things are broken systemically or systematically into making those data better. That that feedback loop, I haven't seen much of it. And, that that's really, as I said, the sky blue future for all of us. Alright. Well, thank you very much for taking the time today to talk about the problems and approaches towards managing master datasets. It's definitely
[00:46:58] Unknown:
an important issue, and so I appreciate that. And thank you for taking the time, and I hope you enjoy the rest of your day. Thank you very much. Appreciate you having me on. Take
[00:47:11] Unknown:
care.
Introduction to Data Mastering
Defining Master Datasets
Challenges in Data Unification
Agile vs. Waterfall Approaches
Scalability and Practical Steps
Machine Learning in Data Mastering
Real-Time Data Integration
Future of Data Management