Summary
In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data expert and discusses the challenges of reconciling large-scale organizational data. He explains how data silos arise from independent teams and highlights the importance of combining traditional techniques with modern AI to address the nuances of data reconciliation. Dan emphasizes the transformative potential of large language models (LLMs) in creating more natural user experiences, improving trust in AI-driven data solutions, and simplifying complex data management processes. He also discusses the balance between using AI for complex data problems and the necessity of human oversight to ensure accuracy and trust.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data expert and discusses the challenges of reconciling large-scale organizational data. He explains how data silos arise from independent teams and highlights the importance of combining traditional techniques with modern AI to address the nuances of data reconciliation. Dan emphasizes the transformative potential of large language models (LLMs) in creating more natural user experiences, improving trust in AI-driven data solutions, and simplifying complex data management processes. He also discusses the balance between using AI for complex data problems and the necessity of human oversight to ensure accuracy and trust.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us don't miss Data Citizens® Dialogues, the forward-thinking podcast brought to you by Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of Data Citizens® Dialogues, industry leaders unpack data’s impact on the world; like in their episode “The Secret Sauce Behind McDonald’s Data Strategy”, which digs into how AI-driven tools can be used to support crew efficiency and customer interactions. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now! Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.
- Your host is Tobias Macey and today I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the different ways that organizational data becomes unwieldy and needs to be consolidated and reconciled?
- How does that reconciliation relate to the practice of "master data management"
- What are the scaling challenges with the current set of practices for reconciling data?
- ML has been applied to data cleaning for a long time in the form of entity resolution, etc. How has the landscape evolved or matured in recent years?
- What (if any) transformative capabilities do LLMs introduce?
- What are the missing pieces/improvements that are necessary to make current AI systems usable out-of-the-box for data cleaning?
- What are the strategic decisions that need to be addressed when implementing ML/AI techniques in the data cleaning/reconciliation process?
- What are the risks involved in bringing ML to bear on data cleaning for inexperienced teams?
- What are the most interesting, innovative, or unexpected ways that you have seen ML techniques used in data resolution?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on using ML/AI in master data management?
- When is ML/AI the wrong choice for data cleaning/reconciliation?
- What are your hopes/predictions for the future of ML/AI applications in MDM and data cleaning?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Tamr
- Master Data Management
- CERN
- LHC
- Michael Stonebraker
- Conway's Law
- Expert Systems
- Information Retrieval
- Active Learning
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macey, and today, I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business. So, Dan, can you start by introducing yourself?
[00:01:09] Daniel Bruckner:
Yeah. Thanks, Tobias. It's a pleasure to be here. I'm Dan Bruckner. I'm a a cofounder and CTO at Tamer. I've been solving problems in this space for, I don't know, going on 15 years now. And, we we build solutions for master data management using AI and machine learning to simplify and, make MDM projects successful.
[00:01:34] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:37] Daniel Bruckner:
Yeah. It go it goes way back. So I actually started out as a physicist. And my first, my first job out of college was working at CERN on the LHC, and it was in the days before the LHC had actually started. I was getting going. And so most of what I did was actually write code, solve computational problems. In those days, we were doing analysis over large volumes of simulated data and trying to model the system and and get a handle on our expectations for what was gonna happen. So I did that. And as I was doing it and the system wasn't running, I kinda got more interested in the computational problems that I was working on and and the code that I was writing.
And so when I got back to the states, I decided to pivot and move into computer science, started programming, and then got interested in in computer science research. Because of my background, I kinda naturally gravitated into data and large scale data processing, database systems, and, eventually started working with Mike Stonebraker at MIT on research into large scale data integration approached holistically and approached using machine learning techniques and applying those techniques in in ways that scale to, extremely large volumes of data.
[00:02:58] Tobias Macey:
And before we get too much into the application of ML techniques to that challenge of processing data, reconciling it, getting it into a usable state, I'm wondering if you can just start by giving a bit of an overview of some of the different ways that data at the organizational scale becomes unwieldy and some of the challenges that arise from that lack of reconciliation?
[00:03:26] Daniel Bruckner:
Yeah. It's a class of problem that I think is is very common and taken for granted and also not necessarily deeply understood. I like to start from an analogy to software engineering and Conway's Law. Are you familiar with Conway's Law?
[00:03:44] Tobias Macey:
I am. That the software design will eventually reflect the organizational communication patterns for better or worse.
[00:03:52] Daniel Bruckner:
That's, yeah, that's exactly right. So the structure of your organization dictates the structure of your of your software architecture. And the same is true to a large extent in data and data management. The structure of data within a large organization is naturally going to reflect the structure of the teams and the groups and the divisions that created that data. And that can be a very good thing. It means individual teams can kind of operate naturally and independently and use the data that they need to be successful and to do do what needs doing. But it also creates big challenges and and missed opportunities when you start to move up a level and wanna reason about and change and ask questions of the the kinda data, the the data across the whole organization.
Different teams are speaking fundamentally different languages. They often have redundant, duplicated data, And it can be very hard to actually use that data to communicate and to make kinda high level decisions within the org. From a very kinda nuts and bolts perspective, what kinds of issues are are are we talking about? I I as as basically a database guy, I kinda go back to the kinds of problems we're interested in are fuzzy unions putting together schemas across different databases, fuzzy joins, and fuzzy group buys. So, essentially, cases where you would like to treat large, sets of data as a coherent whole single database, but you don't have the keys. You don't have the the common attributes.
You don't have the common identifiers, and so you're not actually able to just directly go and ask the questions you wanna ask. First, you have this problem of just, you know, mechanically getting all the data together, linking it up, getting a coherent picture that you can go and query and and use use for applications, analytics, what whatever it is you're trying to accomplish.
[00:05:53] Tobias Macey:
And given the reflection of Conway's Law in that data ecosystem for the business, what are some of the attributes of either scale or team dynamics that you see being the biggest contributors to that messiness and that lack of cohesion that brings out these problems?
[00:06:15] Daniel Bruckner:
Yeah. So there I mean, depending on the scale of the organization, there could there can be many. But, frequently, the kind of most common case is datasets come from applications. They come from processes, I guess, processes whether they're software based or not that are well established, that are designed not primarily to create data, but to solve some some problem for the business. So sales, marketing, you know, these kind of basic basic things that that companies do. As a side effect, they produce these piles of data. And then the the teams that that work with those processes and applications, they're very vested in kind of the way that things work. If you have other teams come in, data teams most frequently, to do analytics to look across different groups, different parts of the org, there's kind of a natural conflict that arises in terms of, well, we would like it all to look this way. We think it this would solve the problem for the whole organization better. And teams say, no. Like, we that's not how we operate. We can't do that. You can't just come in and change our process, change our data.
You know, we've been doing this for for forever in in this way. The problem gets worse the larger the organization gets and especially for companies that grow through acquisition mergers. You start bringing in, you know, data that's arisen not just from different teams, but completely different organizations, start trying to put it together, consolidate. And those kinds of those kinds of small inconsistencies can can really start to undermine the process of of finding a good way to operate and put all the data together coherently.
[00:07:49] Tobias Macey:
And so that process of reconciling data, bringing it together in a way that makes organizational sense so that you can start to ask those questions across the business is largely called master data management or building golden records. And I'm wondering if you can talk to some of the typical approaches that teams and organizations try to take to be able to actually embark upon that process of building those master records and reconciling that data and some of the scaling challenges that they run into, whether that's in terms of scaling at the compute level or scaling just in terms of time, effort, and human capacity?
[00:08:32] Daniel Bruckner:
Yeah. That's a good question. That's a big question. And so breaking that down a bit, master data management really does cover kind of the heart of of this problem of linking linking different datasets together. There are a number of stages in a successful master data management, like, you kinda have to move through. One one stage even starts just ahead of getting into master data, which is just physically getting the data together up and and, you know, treating the data quality problem, essentially getting a common level of quality, often pulling in third party source data, reference data to to enrich it and and kinda get your base to a good spot.
Then, okay, you have you have a set of sources, different data tables, database systems. You put them in one physical place, and now you you wanna link them together. You you wanna kinda create the the point of reference across common records and and just solve solve that that linkage problem, the the entity resolution problem. Once you've done that, great. Now we have a common identifier that we can use. You're gonna now draw in all the data from these systems and attempt to consolidate it, produce golden records. So now you have you have an identifier that links source data.
And for each identifier, you have a a kinda a golden record, like, this is the best this is the truth about this customer or this supplier or this part in our in our in our organization. So you produce that record. And and now you you wanna start to manage that over time. As you go farther, you're now gonna want to push out more of that to the systems the source systems themselves and to downstream applications, analytic engines. So, essentially, solve this problem of the coexistence of a master data on one hand and all these operational and analytical datasets that exist everywhere in the organization. The physical problem of linking those things together and keeping them consistent become becomes a big challenge as you start to operationalize the master data.
And they're they're kinda different, in different scenarios, different use cases, different folks will focus on kinda different parts of this this, like, journey through master data management. You know, maybe some projects only require getting that identifier. Throw the data together, get the identifier. Great. We can go, that's all we needed. We can go run with it. Maybe you're just doing some some analytics. So you do that as a one off. Every quarter, you produce a report. So we refresh our our our data. We get this high level of integrity with our master data. We generate our report. We're good to go.
As you move farther along and want to actually take that data, operationalize it, use it on an ongoing basis, keep it fresh constantly. You know? So as new data comes into operational systems, it's immediately mastered, immediately incorporated with the master data. You have to start to go farther in this journey of kinda pulling together and closely integrating your your master data system with your operational database systems and and other applications.
[00:11:39] Tobias Macey:
And the canonical example that's often brought to bear in this context is the idea of the customer record where you have this is our customer. This is all of the attributes about them. And then there's the challenge of, well, which system is the one that we actually trust the most to collect that information accurately or different systems collect different pieces of information. And then there's also the challenge of when you're dealing with people, they change locations. So you have to make sure that you have the appropriate address, but you also want to know their old addresses. And so then you have the the issue of historizing that data, and this applies across other business objects beyond just your customers.
And I'm wondering if you can talk to some of the people problems of figuring out what are those decision points, what are the ways that we determine what is the place that we actually trust the most for which pieces of that data and then being able to actually manage the merging of those attributes from the multiple systems to be able to say, this is the thing that we trust the most. That other system over there has different information, so we're going to ignore that, or we actually need to use that in that system. But over here, we're gonna use this. I'm just wondering some of the some of the ways that organizations have to wrestle with that kind of constant decision making about what data to use where, when, and how.
[00:13:04] Daniel Bruckner:
Yeah. Yeah. I I think I think what you're picking up on is a really key observation about master data management as, like, a problem space. It's fun it's it's not just a technical problem. If it were just a technical problem, putting data together, creating, like, a coherent knowledge graph is, like, we know how to do that. We can do that. In real organizations, it's also a political problem. So you're not just trying to get the data to agree. You're actually trying to get these different teams to agree and coexist and have each have their own special view of the data. Because, you know, we the the reason the data silos were created in the 1st place was all of these teams operating independently and efficiently.
Pulling together those silos, you you need to you need to make sure that you don't actually interfere with the independent, happy, trustful operation of of everyone who created it. And what it what it comes down to is solving the master data management problem less from, like, a dictatorial, we will come up with the one standard that will work for everybody kind of approach. And more creating a repository for the linkage and creating a system and a common touchpoint for all of these different silos and applications to touch base and stay closely linked in a in a clean way. We, one of our early customers, very large manufacturer, when we started working with them, they they essentially said, okay, our our history in in master data management, we we have this company with many lines of business, different divisions.
We have 26 different major ERP systems. We have more, but, like, the long tail is too much to to worry about. We have 26 major ERP systems. All All of our parts, all of our suppliers exist across all these 26 systems. We've had several efforts at master data management. And what happens is we go in, we pick some of these systems, the the largest, most most popular ones that we think are the most trustworthy. We we collect the data. We consolidate it. We create this master. We have a new identifier. And at the end, no one wants to use it. We have 27 systems now for all of our supplier data and all of our parts data. And so if if you go and and kinda do the technical work, but don't also do it in a way that, that kinda meets the consumers of the data where they are, then the project can be a failure and and and essentially just make the problem worse.
So it's it's really it's really critical to to find that the way to not just create a standard, but create a system that bridges the gap and maintains the connectivity between all these different consumers and does it in a scalable way. If you take 3 of 20 systems and say, this is the con this is how we're consolidating. Well, what about the 17 other teams? Like, their data's gone now. How are they they don't know it. They have no frame of reference. So you need to you need you need an approach that can scale to handle kind of the the the whole problem.
[00:16:17] Tobias Macey:
The other interesting piece of this is that business intelligence, data warehousing, those have existed in some fashion for at least the past 30 years, give or take. And so you would think that given that time span, this is a problem that would have been solved at least reasonably well by now. And yet even today, it's still a challenge that organizations are tackling and starting new projects on today, tomorrow, next week. And I'm wondering what are some of the evolutionary aspects of the problem that lead us to having to keep revisiting it and keep resolving it across organization after org after organization rather than it being a well established, well understood, more or less solved problem?
[00:17:06] Daniel Bruckner:
Yeah. It's a good question. I'd I'd say master data management is, it's it's about I think it's going on about 3 decades old now. So we've been companies have been building systems to solve this problem for a while. And the traditional systems tend to focus on using sets of rules and strict data models to put together data from source systems, and and they tend to focus more on kinda the operational side. You know, you you do some basic data quality, you do some basic data integration, but, fundamentally, you get your your set of golden records and now, okay, put that in a database, like, let's go and use that. They tend to focus on then, you know, supporting applications downstream, but they don't necessarily do a great job of pulling in lots of data from the organization, linking it together coherently, cleaning it up, enriching it, sort of making sure that the master data itself is actually the best view there is of the data, has the highest possible quality, and has all of this linkage across across organizations.
So currently, what's what's happening is the application of AI and machine learning to these problems actually unlocks much better solutions and the ability to kind of tackle this problem much more holistically and and do it in just a much higher fidelity way than has happened traditionally.
[00:18:34] Tobias Macey:
So to that point of the application of ML and AI in this ecosystem, Machine learning in various forms has been used at various levels of success in this context. You mentioned rules based systems. That's maybe the the expert systems era of AI, which we have largely moved past. And then there have been a lot of different natural language processing techniques used for trying to do some of that entity extraction and entity resolution. And I'm wondering if you can just talk to some of the evolutionary aspects of the application of ML and AI to the problem of master data.
[00:19:12] Daniel Bruckner:
Yeah. Yeah. Absolutely. So, you know, you're exactly right to start start with the rules. Because kinda, you know, I would I don't wanna say that rules are the wrong way to solve this problem. They're actually very good for the right use case. But I think, like, the just for context, like, the fundamental nature of dealing with dirty data is what there's, like, the I think it's Chekhov. It's like a Chekhov quotation that all happy families are happy in the same way, but all unhappy families are unhappy in a very different it's true of data too. All bad data is bad in a different way. And so you you need a lot of tools in your toolkit. So traditionally, rules for the approach. If you come up with a good data model, if we just model the problem well enough, if we model have a customer model and schema that that's good enough, then we can we can put all customer data into that schema, and we can define rules for how it should work.
The reality is that data, you know, data can mean lots of subtly different things and and be used in subtly different ways and pretty much always is. And so you you have to always be ready to account for these slight differences in granularity or slight differences in shade of meaning. And so, yeah, so, essentially, beyond rules, getting into fuzzy fuzzy matching becomes big, and that starts to lead into natural language processing techniques, and especially techniques from information retrieval. And, you know, applying scalable methods from text search goes a very long way in dealing with fuzziness and and solving math fuzzy matching problems.
Beyond that, you start to get into statistical techniques and and kinda traditional machine learning, building models to classify matches between data, to classify groups and taxonomies of data, and to to look for different different characteristics of the data to to perform reconciliation consolidation. And then you you know, once once you've entered into this statistics and machine learning world, the sky's the limit. And and, essentially, techniques techniques from 30 years ago and information retrieval are great, but you can move that all the way through up to, what we have today with large language models and generative AI, and apply that to the problem as well.
[00:21:56] Tobias Macey:
Large language models and generative AI have definitely occluded the overall landscape of ML in recent years where they have, to some degree, become synonymous with AI even though that's not technically accurate. And I'm curious whether you see those capabilities as being a transformative shift in the space of master data management, record reconciliation, entity extraction, or if it's largely an iterative step, you know, maybe it's a large iteration, but not a wholly transformative piece and just a a a step change improvement in what we already had.
[00:22:35] Daniel Bruckner:
Yeah. I I think this is sort of a lame answer, but I think it's a little of both. There are key ways that are incremental in terms of how you how you match records or, enrich data or classify data or parse data, applying language models can it it it adds another really valuable tool in the toolkit. I wouldn't say it completely. Well, actually, there are some some scenarios where it does completely let you throw away a lot of techniques of the past. In schema mapping, for example, large language models without much training are very good at, I give you 2 tables, tell me how to align them, from from a schema perspective.
So for for some problems at a small scale, you know, it's just like it blows it out of the water. For larger scale problems, LLMs can give you a lot and can give you a lot of subtlety that using traditional techniques could be very difficult. So, for example, language model embeddings are extremely good at capturing things like synonyms, synonymous meetings across different terms, abbreviations without having to sort of build lookup tables and additional artifacts, on the side, you you sorta you get a lot of that just, like, the richness of the meaning in the language and how how language works for free. Except it's not actually for free. It there's a there's a cost in terms of compute. And so so it's like it occupies a a interesting spot in the trade off space where if you can figure out how to use it in a cost effective way alongside other cheaper, more scalable techniques, then you can get a tremendous amount of value.
I think where where it's, like, really transformative is creating more natural user experiences and actually working with the data and solving these problems for end users. One of the challenges that that at Tamr we've we've wrestled with and and kinda, you know, learned a ton and be getting better and better at over our 12, 13 year existence is the problem of taking complex, you know, complex data problems and complex machine learning concepts and encapsulating them in a simple user interface and making them understandable to end users who are not PhDs in machine learning, AI, and statistics.
And LLMs can actually come in and explain concepts and complex scenarios in kinda straightforward ways. So a very, very common situation for us is our system is doing some record matching consolidation, presents the user with, maybe an ambiguous case. We've clustered data from 40 different systems. Here's 40 different records describing what we think is all the same the same actual person, one of your customers. And, what do you think? Is this are are these all in fact the same? Looking at a table with 40 records and a lot of columns is, like, kind of an overwhelming experience. And so it can be difficult for a human to parse that and decide, do they do they agree with the machine learning? Do they do they not? Do they think it's you know, did did it hallucinate this?
And what a language model can do is actually give you ways to summarize what's on screen. This is very complex concepts. Tell you know, sort of draw user attention to, you know, hey. Look. All 40 records here, they all share the same last name. The first name for for this person is there are only 3 different values, and it's a common you know, this is a common nickname, so it it looks legit. There are a few different addresses, but it seems like maybe they moved over time. And, you we can kinda put that in context. Can also include, you know, we know what the, what the model is on the back end that produced this suggestion.
We can provide that context to a language model as well, and it can explain. Well, you know, the the reason these got pulled together were the weightings for the you know, were, let's say, this is a b to c customer mastering data product. And in this data product, the model looks for and and puts a lot of weight on Common Alley and the name and the address and the phone number, sort of give users a way to engage with a hard problem, a very niche problem, but do it in a way that's that's sort of easy to understand and accessible.
And why this is so important is trust. One of the one of the interesting things since, since the launch of ChatGPT and kind of the mainstreaming of AI and large language models is it it has it has brought this question of trust. Can you trust the robots to kinda, main, you know, mainstream consciousness? And we we've been dealing with this problem. We've been a machine learning company since the beginning and have always dealt with you know, everyone always loves to be smarter than the computer and the machine. And so there's always a questioning of I see the model is suggesting this. I wanna make sure that model isn't doing something crazy.
With with large language models now being mainstream, people know that these things hallucinate. They come up with nonsense if you if you push them a little too far. And so there's there's this real questioning of if you're using machine learning and AI techniques to to make this data good, can I trust it? Like, is it real? And so being able to kinda take take our results, the results of the modeling, and communicate them in a way that sort of, like, puts the data front and center, but then also provides the context is really important for end users to feel like, yeah. This this is actually solving the problem. This is real.
This is this is data I can trust. Like, this is the truth about our customers, and we should adopt this and share this. And so it's it's kind of ironic that, like, the the technology that's leading to this crisis of trust can also be a big part of the solution of kind of framing it that way. But I I think, you know, as we as we move past the the initial shock of, artificial intelligence, that's that's what we're moving into.
[00:29:15] Tobias Macey:
As a listener of the Data Engineering podcast, you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us, don't miss DataCitizens Dialogs, the forward thinking podcast brought to you by Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of DataCitizens Dialogues, industry leaders unpack data's impact on the world, like in their episode, the secret sauce behind McDonald's data strategy, which digs into how AI driven tools can be used to support crew efficiency and customer interactions. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow DataCitizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. In that space of large language models, their application to the problem of master data management, What are some of the pieces that are missing from the out of the box perspective of, I have a large language model. Now I actually have to stand up pieces x, y, and z before I can even really start to bring it to bear on the problem, particularly in that context of hallucination and trust building.
I'm wondering how you have been working through that challenge of being able to harness the capabilities while mitigating the inherent risks in the problem of being able to actually build trustworthy data within inherently unpredictable utility.
[00:31:04] Daniel Bruckner:
Yeah. Yeah. So I think well, let's see. I'm gonna go back a little to start. When we got started at Tamer, our our initial vision and our initial goal this was in the days when semantic web was really hot and knowledge graphs were becoming a big deal. And our vision was we want a we want a product that can build, produce efficiently produce the the enterprise knowledge graph. So within within your organization, you have this extremely high quality links data describing everything you do, everything you care about, all the key entities, your customers, your suppliers, your your employees, all the parts, and products that you produce, this this whole space.
And so now, like, in in in this context where we are today with, with these these large scale models, that idea of this, like, correct knowledge graph is is still sorta the, I I guess, this guiding principle or or it's really this idea of, like, truth for the enterprise. And to get the to get, like, real value out of large scale artificial intelligence, you need to find ways and architectures to tie it back to that truth. It's very good at syntax and articulating ideas. We just need to do a good job giving it the right content and giving it the right context, depending on what problem it solved.
So there's there's kind of an increased importance of data quality and data linkage and and master data management to be able to produce these common datasets, and and maintain them so that you can point, you know, your your GPT at it and get really good high quality trusted answers from the AI.
[00:33:03] Tobias Macey:
I imagine that the inclination for people who are thinking about bringing AI to bear on this problem is that, oh, well, AI is very sophisticated. It has all of these nifty capabilities. I should be able to just set it loose on my data. It'll solve all of my problems, and I don't really have to worry about it. Maybe just click yes or no a couple of times versus what I imagine to be the reality of you actually want to use all of those manual and statistical techniques that we have been relying on for and developing for the past 30 years to do maybe the 80% case and use the LLM for that 20% case that takes 80% of the time to accelerate the process a bit. And I'm curious how you are guiding organizations on that strategic aspect of how, where, when, and why to actually bring these language models to bear on the problem in conjunction with all of the other techniques that have been developed and that we have established trust and confidence in?
[00:34:11] Daniel Bruckner:
Yeah. We so and that that question has has kind of become central to to everyone these days. The the question of, am I comfortable using generative AI with my data? And, it sort of so, you know, the previous version of this was, am I comfortable putting our, you know, our most business critical important high value data assets on the cloud? That's now shifted. Most organizations are comfortable with the cloud, but now it's well, but can machine learning look at it? What if, God forbid, someone trains a model on our data and shares that model?
I think there's there's a certain amount of just feeling out the the right level of security around these things. And I I I don't wanna go into that too deeply. But just for the purposes of solving problems in this space, there are big opportunities to improve quality of matching and mastering, using these new models. But they they need to be they need to be harnessed. There's there's been a a lot of research over the last few years applying large language models to these sorts of fuzzy database problems, fuzzy joins, group by schema mapping. And what they basically find is large language models perform with very little configuration and engineering, perform as well as a lot of state of the art techniques that that existed previously.
The challenge is putting those techniques into a system and into a product where they're used intelligently in conjunction with other lower cost techniques of kinda more traditional machine learning, and also in conjunction with rules and and human feedback. One of one of our sort of founding principles at Tamr is that, the machine is never right a 100% of the time. You need humans in the loop to be able to review, to address complex cases, and to assess how well things are going. And so you need a system that kind of incorporates all all of these pillars of human input, rules based input, rules.
And people users love rules. People love rules. The you know, they're sort of the most if you just say, like, a match means the Social Security number value is equal. Everyone loves that. They they understand what that means. Given the truth about data quality and how things exist in the real world, that rule might actually be wrong a large proportion of the time in in, you know, some a a real particular case. But people love that when it's wrong, they they sort of get why as opposed to if if some machine learning model is wrong, and they wanna know why. It's like, well, let me show you this random forest decision tree and talk about that.
[00:37:14] Tobias Macey:
Let me explain carefully that. Yeah.
[00:37:18] Daniel Bruckner:
So you need all of these things, and and they they all kinda, come together to to create a a coherent solution that doesn't doesn't have an absolutely overwhelming cost. Like, one one thing we haven't talked much about is if you wanna solve these problems at scale, it can become very expensive very quickly. By its nature, these are all matching problems. They're quadratic. Naively, you would be comparing everything to everything else. So if you take a naive approach, you're you're gonna, you're gonna burn a lot of compute and you're gonna spend a lot of money and you probably won't get the best results. So you you need you need a way to sort of identify the easy parts of the problem, solve them easily.
The unsolvable part parts of the problem, send them to a human, and then everything in between. And and that's that's where that's where, especially the we're seeing the the biggest boosts from vector embeddings, large language models, and newer newer cutting edge techniques. They can kinda dig into those ambiguous cases and, like, get a lot of value.
[00:38:26] Tobias Macey:
Another complexity of this space, particularly when you're first embarking on the process of trying to reconcile all of your organizational data, is level of expertise in the process of master data management as well as level of familiarity with the data itself, where the person who created the data figured out what the schema should be, decided what attributes to pick, may not even be with the company anymore, so you don't have all of the context, all of the information. And especially when you have a inexperienced team who's just starting on this process, and then you say, hey. Here. Rub some machine learning on it. It's magical.
I'm wondering what are some of the potential pitfalls that you're setting them up for if they don't have an appropriate understanding of what are the actual capabilities and limitations of the techniques so that they can be appropriately skeptical or appropriately confident where where those apply appropriately.
[00:39:31] Daniel Bruckner:
Yeah. That's, touching back on the political problem of master data management. You have you have to get a lot of different kinds of people involved. The the domain experts, the subject matter experts are very rarely the people who are own the project, essentially. They have to be drafted in and, to to and convinced to share their time to to really make make the project successful. And so so, yeah, there there does tend to be skepticism of what what is the, you know, what's the system doing? And if it's a if it's an AI based system, you know, the skepticism is increased, and they'll see these things. They'll look at the data, say, this doesn't make sense. This doesn't make sense. Why is it doing it this way? And so you you kinda you you need a workflow that kind of embraces that uncertainty to an extent or makes users comfortable with the fact that the data is bad and we're not gonna fix it all at once. It's gonna be a process.
And so so something interesting we we we learned over the years, one of our earlier products was very explicitly built designed as a system for end users to train models to master data. And so there were workflows for end users to come in and the system, we used active active learning techniques to surface really high value examples. Users go in, they label the examples, and the model gets gets trained as quickly as possible. And it's kinda make this this, like, really, like, ML practitioner experience available to non ML experts.
And that system, it you know, it's it as easy as we made it, it was still hard to use, and you still had to kinda understand machine learning. And so for subject matter experts is a challenge, and there's there's a lot of hand holding. So we we've we've moved towards more oh, the the models will be will be pretrained, and we have general models that that can apply to your domain and start at a very high level, and then you'll be tuning them. And what we found actually when we first did that, we we no longer had this kind of, like, active feedback workflow. And what we found was that actually damages trust with the end users. They want to give feedback to the machine, and they wanna have that back and forth, have that conversation.
That's really important in kind of, like, gaining trust in the system. Was the system directing how you should be exploring the data, how you should be interacting with it, and, and how you should be understanding it. And so so we've we've kinda brought that back. So even though now it's you're not training, you're actually still, you can go and you can interact with with the system as if you're training it, and and it's it's this positive user experience.
[00:42:26] Tobias Macey:
And in your experience of building Tamer, working with organizations to address this challenge of master data management, and incorporating these ML and the newer generation of generative AI capabilities into that system, what are some of the most interesting or innovative or unexpected ways that you've seen ML and AI techniques used in that context of data resolution and master data management?
[00:42:56] Daniel Bruckner:
Yeah. It's a good question. You know, our our our architecture is is fairly generic. So we primarily work with structured enterprise data, relational database systems, but we can we can extend the system to, work with more complex data types. And so actually, we we support number of years ago, we added support for, geodata and, GeoJSON, essentially polygons. And, the applications that users have for that system are are are really quite interesting and surprising. And, you know, normally, you think about master data management as applying to fairly narrow subset of data, customer data, you know, organizations, people, parts, products.
That's kind of the the sweet spot. But, with with some of these other, you know, data formats that we support, we've seen it applied to things like radar tracks, like keeping track of fuzzy data related to planes and, other kinds of, like, aerial phenomena and stuff like that. And it it it kinda really gets out there. It's it's it's pretty cool to see.
[00:44:13] Tobias Macey:
In your work of building these systems, coming to grips with the constantly evolving landscape of ML and AI techniques, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building a product that harnesses that?
[00:44:31] Daniel Bruckner:
So a big challenge that we frequently have thought that we've solved but sort of continue to find better solutions every couple of years as we you know, each each solution seems to, like, reveal another side of the problem is finding how to connect the 2 modes of solving these sorts of problems. We we talked earlier about, you know, you can you can take some data as a snapshot, as a one off project, put it all together. You're gonna run some large scale batch computation. You're gonna put it all together, and then you're gonna be done. But then it's like, okay. Yep. But the data is not static. What happens when customer x, y, and z all come in and, they they someone moved, their address changed, maybe some customers passed away and you no longer wanna be sending them offers in the mail, maybe 2 companies merge or or, you know, split up.
Things happen in the real world, and you need to manage this on an ongoing basis. And many applications, like, most applications in the real world aren't content to just have a one off solution and kinda need this need to be able to solve the problem in a, you know, in a live ongoing updated in real time kind of fashion. And we found it's it's a major challenge kinda marrying the extremely efficient, high throughput batch oriented solutions to these problems with a more operational live system database to to solve it on an ongoing basis sort of in real time or in a in, like, a a streaming event driven fashion.
And to essentially take the the core of the system, all these techniques that we've been talking about, rules based matching, lessons from natural language processing and information retrieval, and, you know, kind of the the latest AI has to offer, and apply them in consistent ways across 2 different extremely different architectures and marry those together and do it in a way that end users can actually use that and transition from one to the other. So you, you know, can come in to, to the system and load a whole bunch of data, process it in a big way, create that initial starting point for what your master data should look like, and then, like, boom, you're up and running. It's in a live database.
Now you can interact with it directly. You can start to point other systems at it, use it in a way where it's it's not this one off. It's not some extract that's gonna be irrelevant tomorrow and no one's gonna adopt it. It kinda becomes this living piece of of the overall architecture within an organization.
[00:47:31] Tobias Macey:
And for people who are addressing these challenges of master data management, data reconciliation, trying to figure out what is the cross cutting view of their business? What are the cases where ML and AI are the wrong choice for some or all of that problem?
[00:47:51] Daniel Bruckner:
Yeah. Good question. I I think it comes down to simplicity of the problem. Sometimes, you know, it's AI and ML are are bright and shiny. But for smaller scale problems, you know, you maybe you just you're putting together a couple of sources. Or maybe you're you you really just need to do this. You you have a one off. You're trying to create a, you know, this one time presentation. If you don't have a lot of complexity to the problem, then deterministic techniques are gonna are are are likely to win. There's nothing wrong with rules, and applying a rule can be much cheaper than applying AIML to the problem.
So if if if you kinda have a low stakes scenario where you can just 80 20 it and get a good answer quickly, then then, yeah, like, go crazy with deterministic solutions. That that should, you know, and that that that should be the first step really of any approach. Pick the low hanging fruit, then get into the hard problem, and we'll make it really good.
[00:48:57] Tobias Macey:
And as you continue to invest in and keep tabs on this evolving space of large language models and generative AI and its application to the challenge of data cleaning? What are some of the hopes or predictions that you have or any specific techniques or areas of effort that you're keeping a close watch on?
[00:49:22] Daniel Bruckner:
Yeah. So I I think the, the ability to build intelligent agents into existing user workflows is is creating a big opportunity to you know, I like, the the first wave of this AI rollout was, like, put a chatbot in it. You got a product, put a chatbot in it. It's gonna be amazing. And there are some good applications for that. But I think what's what's really coming up next is looking at what are the problems where LLMs are extremely well suited, and then how can you apply those to actual, like, key features within a product and deploy that? Like, really starting to think of it as just like this is a this is this this capability we can productionize.
Like, how do we think about our product road map and where we build that and how we use it and how we adopt it? And I I think the the upshot is a lot of the challenging work in not in solving the master data management problem, but in managing the system and the complexity of it can be automated to a much larger extent. So, you know, there's there's there's a lot of configuration that goes into pulling data from different systems, aligning all the schemas, figuring out how you wanna enrich, how you wanna apply data quality transformations, how you wanna pull in third party source data, kind of just just, like, creating that model of what what you want your master data to look like starting from what all your source data looks like.
And there's there's big opportunities for LLMs to go in and and just simplify that, turn it into you know, give you a very straightforward kinda just, like, basic wizard like experience setting up this extremely complex machine to go and process all this data in complex ways. And then and then to manage it over time. It's sorta putting agents into the system can take the hardest parts of the user experience and either automate them away or, you know, make turn them into a delight for end users. And so we're focused a lot on kinda really simplifying down that that experience and making, making master data management something that isn't kinda this, like, scary thing that sounds like it's doomed to fail and will be very expensive, but it's more like, no. No. No. This is just this is this is something you need. Like, if you're not doing this, you're crazy. All your data could be 10 times better, and you won't be tearing your hair out to get there.
[00:52:00] Tobias Macey:
I think that point too of figuring out what is that common cohesive schema, what is the representation that is going to be useful and applicable and easy to integrate is one of the challenges as well, and maybe the LLMs can help set that initial pass of here or something that it could look like. Because at either end of the spectrum, you have either people who are unable to see the art of the possible because it looks too daunting, or you have people at the other end of the spectrum who ask for the impossible because they think it's easy.
[00:52:35] Daniel Bruckner:
Yeah. Yeah. Absolutely. I think, like, there's there's, like, LLMs are really good at translating. Right? Like, you can speak different languages and they can act as an intermediary and, like, you can and it's and it just works somehow. I think there's there's kinda, like, there there's a a vision for a future here of, like, what if you did master data management and there wasn't even, like, a single master data model? What if everyone got to keep their own model that they wanted from the beginning and there's an LLM in the middle that was just intelligently translating across these things? So everyone thinks they're speaking the same language, but it's really, you know, it's like a tower of Babel situation.
So the the like, that's kind of the promise here. And I I I think it's it's a big opportunity. It's still there's a lot of challenging engineering and product development to to get to that, but that's where we're headed.
[00:53:25] Tobias Macey:
Are there any other aspects of this overall space of master data management and the application of ML and and AI to its execution and implementation that we didn't discuss yet that you'd like to cover before we close out the show?
[00:53:39] Daniel Bruckner:
I feel like I had something more to say about 3rd party data, but, honestly, I I think we might be good.
[00:53:46] Tobias Macey:
Fair enough. Yeah. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:54:04] Daniel Bruckner:
I think it it it feels like it comes back to somehow a location challenge. I don't know. Maybe I'm just thinking about this because this has been on the problem I've been dealing with lately. But, I feel like we haven't really solved the, like, cross cloud problem. There are really good systems on different clouds, and they're they don't translate 1 to 1. So there's a lot of there's a lot of, like, essential technology that's locked up in different proprietary walled gardens. And so it's, like, it's now very easy to build extremely powerful cutting edge data architectures for managing your data.
But you have to make some pretty big decisions at at at the outset and some pretty big bets on on vendors and who you trust in the market. And it's it's gotten a lot harder to kinda remain independent. On the other hand, it's also easier to remain independent. There's a lot of amazing tools that kind of like breaking up the relational database into its component parts and using independent systems to to put it back together. And there's there's, like, at the same time, there there are a lot of these amazing tools in the open source world.
But but it's it's kinda it's, it's difficult for the world to collide and to kinda put it all together into coherent coherent approach. So so yeah. And I I think there's, like, I feel like there's a little bit too much satisfaction with folks thinking if you put all the data into a single physical place that all of your problems are solved and really you're just kinda, like, kicking a bunch of problems down the road for 10 years till you get sick of your vendor and need to go do something dramatically different.
[00:55:57] Tobias Macey:
Now the the data gravity problem is definitely real, and until we are able to circumvent physics, it won't go away. Yeah. Yeah. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience on building these master data management workflows, bringing ML and AI to bear, and some of the ways that their current generation of LLMs and generative AI are adding new capabilities and techniques to that process. So they appreciate all the time and energy that you and your team are putting into bringing that to bear and making it more accessible and easier to apply to this challenge, and I hope you enjoy the rest of your day.
[00:56:38] Daniel Bruckner:
Thank you. Yeah. Thanks so much for having me. This has been fantastic.
[00:56:50] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macey, and today, I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business. So, Dan, can you start by introducing yourself?
[00:01:09] Daniel Bruckner:
Yeah. Thanks, Tobias. It's a pleasure to be here. I'm Dan Bruckner. I'm a a cofounder and CTO at Tamer. I've been solving problems in this space for, I don't know, going on 15 years now. And, we we build solutions for master data management using AI and machine learning to simplify and, make MDM projects successful.
[00:01:34] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:37] Daniel Bruckner:
Yeah. It go it goes way back. So I actually started out as a physicist. And my first, my first job out of college was working at CERN on the LHC, and it was in the days before the LHC had actually started. I was getting going. And so most of what I did was actually write code, solve computational problems. In those days, we were doing analysis over large volumes of simulated data and trying to model the system and and get a handle on our expectations for what was gonna happen. So I did that. And as I was doing it and the system wasn't running, I kinda got more interested in the computational problems that I was working on and and the code that I was writing.
And so when I got back to the states, I decided to pivot and move into computer science, started programming, and then got interested in in computer science research. Because of my background, I kinda naturally gravitated into data and large scale data processing, database systems, and, eventually started working with Mike Stonebraker at MIT on research into large scale data integration approached holistically and approached using machine learning techniques and applying those techniques in in ways that scale to, extremely large volumes of data.
[00:02:58] Tobias Macey:
And before we get too much into the application of ML techniques to that challenge of processing data, reconciling it, getting it into a usable state, I'm wondering if you can just start by giving a bit of an overview of some of the different ways that data at the organizational scale becomes unwieldy and some of the challenges that arise from that lack of reconciliation?
[00:03:26] Daniel Bruckner:
Yeah. It's a class of problem that I think is is very common and taken for granted and also not necessarily deeply understood. I like to start from an analogy to software engineering and Conway's Law. Are you familiar with Conway's Law?
[00:03:44] Tobias Macey:
I am. That the software design will eventually reflect the organizational communication patterns for better or worse.
[00:03:52] Daniel Bruckner:
That's, yeah, that's exactly right. So the structure of your organization dictates the structure of your of your software architecture. And the same is true to a large extent in data and data management. The structure of data within a large organization is naturally going to reflect the structure of the teams and the groups and the divisions that created that data. And that can be a very good thing. It means individual teams can kind of operate naturally and independently and use the data that they need to be successful and to do do what needs doing. But it also creates big challenges and and missed opportunities when you start to move up a level and wanna reason about and change and ask questions of the the kinda data, the the data across the whole organization.
Different teams are speaking fundamentally different languages. They often have redundant, duplicated data, And it can be very hard to actually use that data to communicate and to make kinda high level decisions within the org. From a very kinda nuts and bolts perspective, what kinds of issues are are are we talking about? I I as as basically a database guy, I kinda go back to the kinds of problems we're interested in are fuzzy unions putting together schemas across different databases, fuzzy joins, and fuzzy group buys. So, essentially, cases where you would like to treat large, sets of data as a coherent whole single database, but you don't have the keys. You don't have the the common attributes.
You don't have the common identifiers, and so you're not actually able to just directly go and ask the questions you wanna ask. First, you have this problem of just, you know, mechanically getting all the data together, linking it up, getting a coherent picture that you can go and query and and use use for applications, analytics, what whatever it is you're trying to accomplish.
[00:05:53] Tobias Macey:
And given the reflection of Conway's Law in that data ecosystem for the business, what are some of the attributes of either scale or team dynamics that you see being the biggest contributors to that messiness and that lack of cohesion that brings out these problems?
[00:06:15] Daniel Bruckner:
Yeah. So there I mean, depending on the scale of the organization, there could there can be many. But, frequently, the kind of most common case is datasets come from applications. They come from processes, I guess, processes whether they're software based or not that are well established, that are designed not primarily to create data, but to solve some some problem for the business. So sales, marketing, you know, these kind of basic basic things that that companies do. As a side effect, they produce these piles of data. And then the the teams that that work with those processes and applications, they're very vested in kind of the way that things work. If you have other teams come in, data teams most frequently, to do analytics to look across different groups, different parts of the org, there's kind of a natural conflict that arises in terms of, well, we would like it all to look this way. We think it this would solve the problem for the whole organization better. And teams say, no. Like, we that's not how we operate. We can't do that. You can't just come in and change our process, change our data.
You know, we've been doing this for for forever in in this way. The problem gets worse the larger the organization gets and especially for companies that grow through acquisition mergers. You start bringing in, you know, data that's arisen not just from different teams, but completely different organizations, start trying to put it together, consolidate. And those kinds of those kinds of small inconsistencies can can really start to undermine the process of of finding a good way to operate and put all the data together coherently.
[00:07:49] Tobias Macey:
And so that process of reconciling data, bringing it together in a way that makes organizational sense so that you can start to ask those questions across the business is largely called master data management or building golden records. And I'm wondering if you can talk to some of the typical approaches that teams and organizations try to take to be able to actually embark upon that process of building those master records and reconciling that data and some of the scaling challenges that they run into, whether that's in terms of scaling at the compute level or scaling just in terms of time, effort, and human capacity?
[00:08:32] Daniel Bruckner:
Yeah. That's a good question. That's a big question. And so breaking that down a bit, master data management really does cover kind of the heart of of this problem of linking linking different datasets together. There are a number of stages in a successful master data management, like, you kinda have to move through. One one stage even starts just ahead of getting into master data, which is just physically getting the data together up and and, you know, treating the data quality problem, essentially getting a common level of quality, often pulling in third party source data, reference data to to enrich it and and kinda get your base to a good spot.
Then, okay, you have you have a set of sources, different data tables, database systems. You put them in one physical place, and now you you wanna link them together. You you wanna kinda create the the point of reference across common records and and just solve solve that that linkage problem, the the entity resolution problem. Once you've done that, great. Now we have a common identifier that we can use. You're gonna now draw in all the data from these systems and attempt to consolidate it, produce golden records. So now you have you have an identifier that links source data.
And for each identifier, you have a a kinda a golden record, like, this is the best this is the truth about this customer or this supplier or this part in our in our in our organization. So you produce that record. And and now you you wanna start to manage that over time. As you go farther, you're now gonna want to push out more of that to the systems the source systems themselves and to downstream applications, analytic engines. So, essentially, solve this problem of the coexistence of a master data on one hand and all these operational and analytical datasets that exist everywhere in the organization. The physical problem of linking those things together and keeping them consistent become becomes a big challenge as you start to operationalize the master data.
And they're they're kinda different, in different scenarios, different use cases, different folks will focus on kinda different parts of this this, like, journey through master data management. You know, maybe some projects only require getting that identifier. Throw the data together, get the identifier. Great. We can go, that's all we needed. We can go run with it. Maybe you're just doing some some analytics. So you do that as a one off. Every quarter, you produce a report. So we refresh our our our data. We get this high level of integrity with our master data. We generate our report. We're good to go.
As you move farther along and want to actually take that data, operationalize it, use it on an ongoing basis, keep it fresh constantly. You know? So as new data comes into operational systems, it's immediately mastered, immediately incorporated with the master data. You have to start to go farther in this journey of kinda pulling together and closely integrating your your master data system with your operational database systems and and other applications.
[00:11:39] Tobias Macey:
And the canonical example that's often brought to bear in this context is the idea of the customer record where you have this is our customer. This is all of the attributes about them. And then there's the challenge of, well, which system is the one that we actually trust the most to collect that information accurately or different systems collect different pieces of information. And then there's also the challenge of when you're dealing with people, they change locations. So you have to make sure that you have the appropriate address, but you also want to know their old addresses. And so then you have the the issue of historizing that data, and this applies across other business objects beyond just your customers.
And I'm wondering if you can talk to some of the people problems of figuring out what are those decision points, what are the ways that we determine what is the place that we actually trust the most for which pieces of that data and then being able to actually manage the merging of those attributes from the multiple systems to be able to say, this is the thing that we trust the most. That other system over there has different information, so we're going to ignore that, or we actually need to use that in that system. But over here, we're gonna use this. I'm just wondering some of the some of the ways that organizations have to wrestle with that kind of constant decision making about what data to use where, when, and how.
[00:13:04] Daniel Bruckner:
Yeah. Yeah. I I think I think what you're picking up on is a really key observation about master data management as, like, a problem space. It's fun it's it's not just a technical problem. If it were just a technical problem, putting data together, creating, like, a coherent knowledge graph is, like, we know how to do that. We can do that. In real organizations, it's also a political problem. So you're not just trying to get the data to agree. You're actually trying to get these different teams to agree and coexist and have each have their own special view of the data. Because, you know, we the the reason the data silos were created in the 1st place was all of these teams operating independently and efficiently.
Pulling together those silos, you you need to you need to make sure that you don't actually interfere with the independent, happy, trustful operation of of everyone who created it. And what it what it comes down to is solving the master data management problem less from, like, a dictatorial, we will come up with the one standard that will work for everybody kind of approach. And more creating a repository for the linkage and creating a system and a common touchpoint for all of these different silos and applications to touch base and stay closely linked in a in a clean way. We, one of our early customers, very large manufacturer, when we started working with them, they they essentially said, okay, our our history in in master data management, we we have this company with many lines of business, different divisions.
We have 26 different major ERP systems. We have more, but, like, the long tail is too much to to worry about. We have 26 major ERP systems. All All of our parts, all of our suppliers exist across all these 26 systems. We've had several efforts at master data management. And what happens is we go in, we pick some of these systems, the the largest, most most popular ones that we think are the most trustworthy. We we collect the data. We consolidate it. We create this master. We have a new identifier. And at the end, no one wants to use it. We have 27 systems now for all of our supplier data and all of our parts data. And so if if you go and and kinda do the technical work, but don't also do it in a way that, that kinda meets the consumers of the data where they are, then the project can be a failure and and and essentially just make the problem worse.
So it's it's really it's really critical to to find that the way to not just create a standard, but create a system that bridges the gap and maintains the connectivity between all these different consumers and does it in a scalable way. If you take 3 of 20 systems and say, this is the con this is how we're consolidating. Well, what about the 17 other teams? Like, their data's gone now. How are they they don't know it. They have no frame of reference. So you need to you need you need an approach that can scale to handle kind of the the the whole problem.
[00:16:17] Tobias Macey:
The other interesting piece of this is that business intelligence, data warehousing, those have existed in some fashion for at least the past 30 years, give or take. And so you would think that given that time span, this is a problem that would have been solved at least reasonably well by now. And yet even today, it's still a challenge that organizations are tackling and starting new projects on today, tomorrow, next week. And I'm wondering what are some of the evolutionary aspects of the problem that lead us to having to keep revisiting it and keep resolving it across organization after org after organization rather than it being a well established, well understood, more or less solved problem?
[00:17:06] Daniel Bruckner:
Yeah. It's a good question. I'd I'd say master data management is, it's it's about I think it's going on about 3 decades old now. So we've been companies have been building systems to solve this problem for a while. And the traditional systems tend to focus on using sets of rules and strict data models to put together data from source systems, and and they tend to focus more on kinda the operational side. You know, you you do some basic data quality, you do some basic data integration, but, fundamentally, you get your your set of golden records and now, okay, put that in a database, like, let's go and use that. They tend to focus on then, you know, supporting applications downstream, but they don't necessarily do a great job of pulling in lots of data from the organization, linking it together coherently, cleaning it up, enriching it, sort of making sure that the master data itself is actually the best view there is of the data, has the highest possible quality, and has all of this linkage across across organizations.
So currently, what's what's happening is the application of AI and machine learning to these problems actually unlocks much better solutions and the ability to kind of tackle this problem much more holistically and and do it in just a much higher fidelity way than has happened traditionally.
[00:18:34] Tobias Macey:
So to that point of the application of ML and AI in this ecosystem, Machine learning in various forms has been used at various levels of success in this context. You mentioned rules based systems. That's maybe the the expert systems era of AI, which we have largely moved past. And then there have been a lot of different natural language processing techniques used for trying to do some of that entity extraction and entity resolution. And I'm wondering if you can just talk to some of the evolutionary aspects of the application of ML and AI to the problem of master data.
[00:19:12] Daniel Bruckner:
Yeah. Yeah. Absolutely. So, you know, you're exactly right to start start with the rules. Because kinda, you know, I would I don't wanna say that rules are the wrong way to solve this problem. They're actually very good for the right use case. But I think, like, the just for context, like, the fundamental nature of dealing with dirty data is what there's, like, the I think it's Chekhov. It's like a Chekhov quotation that all happy families are happy in the same way, but all unhappy families are unhappy in a very different it's true of data too. All bad data is bad in a different way. And so you you need a lot of tools in your toolkit. So traditionally, rules for the approach. If you come up with a good data model, if we just model the problem well enough, if we model have a customer model and schema that that's good enough, then we can we can put all customer data into that schema, and we can define rules for how it should work.
The reality is that data, you know, data can mean lots of subtly different things and and be used in subtly different ways and pretty much always is. And so you you have to always be ready to account for these slight differences in granularity or slight differences in shade of meaning. And so, yeah, so, essentially, beyond rules, getting into fuzzy fuzzy matching becomes big, and that starts to lead into natural language processing techniques, and especially techniques from information retrieval. And, you know, applying scalable methods from text search goes a very long way in dealing with fuzziness and and solving math fuzzy matching problems.
Beyond that, you start to get into statistical techniques and and kinda traditional machine learning, building models to classify matches between data, to classify groups and taxonomies of data, and to to look for different different characteristics of the data to to perform reconciliation consolidation. And then you you know, once once you've entered into this statistics and machine learning world, the sky's the limit. And and, essentially, techniques techniques from 30 years ago and information retrieval are great, but you can move that all the way through up to, what we have today with large language models and generative AI, and apply that to the problem as well.
[00:21:56] Tobias Macey:
Large language models and generative AI have definitely occluded the overall landscape of ML in recent years where they have, to some degree, become synonymous with AI even though that's not technically accurate. And I'm curious whether you see those capabilities as being a transformative shift in the space of master data management, record reconciliation, entity extraction, or if it's largely an iterative step, you know, maybe it's a large iteration, but not a wholly transformative piece and just a a a step change improvement in what we already had.
[00:22:35] Daniel Bruckner:
Yeah. I I think this is sort of a lame answer, but I think it's a little of both. There are key ways that are incremental in terms of how you how you match records or, enrich data or classify data or parse data, applying language models can it it it adds another really valuable tool in the toolkit. I wouldn't say it completely. Well, actually, there are some some scenarios where it does completely let you throw away a lot of techniques of the past. In schema mapping, for example, large language models without much training are very good at, I give you 2 tables, tell me how to align them, from from a schema perspective.
So for for some problems at a small scale, you know, it's just like it blows it out of the water. For larger scale problems, LLMs can give you a lot and can give you a lot of subtlety that using traditional techniques could be very difficult. So, for example, language model embeddings are extremely good at capturing things like synonyms, synonymous meetings across different terms, abbreviations without having to sort of build lookup tables and additional artifacts, on the side, you you sorta you get a lot of that just, like, the richness of the meaning in the language and how how language works for free. Except it's not actually for free. It there's a there's a cost in terms of compute. And so so it's like it occupies a a interesting spot in the trade off space where if you can figure out how to use it in a cost effective way alongside other cheaper, more scalable techniques, then you can get a tremendous amount of value.
I think where where it's, like, really transformative is creating more natural user experiences and actually working with the data and solving these problems for end users. One of the challenges that that at Tamr we've we've wrestled with and and kinda, you know, learned a ton and be getting better and better at over our 12, 13 year existence is the problem of taking complex, you know, complex data problems and complex machine learning concepts and encapsulating them in a simple user interface and making them understandable to end users who are not PhDs in machine learning, AI, and statistics.
And LLMs can actually come in and explain concepts and complex scenarios in kinda straightforward ways. So a very, very common situation for us is our system is doing some record matching consolidation, presents the user with, maybe an ambiguous case. We've clustered data from 40 different systems. Here's 40 different records describing what we think is all the same the same actual person, one of your customers. And, what do you think? Is this are are these all in fact the same? Looking at a table with 40 records and a lot of columns is, like, kind of an overwhelming experience. And so it can be difficult for a human to parse that and decide, do they do they agree with the machine learning? Do they do they not? Do they think it's you know, did did it hallucinate this?
And what a language model can do is actually give you ways to summarize what's on screen. This is very complex concepts. Tell you know, sort of draw user attention to, you know, hey. Look. All 40 records here, they all share the same last name. The first name for for this person is there are only 3 different values, and it's a common you know, this is a common nickname, so it it looks legit. There are a few different addresses, but it seems like maybe they moved over time. And, you we can kinda put that in context. Can also include, you know, we know what the, what the model is on the back end that produced this suggestion.
We can provide that context to a language model as well, and it can explain. Well, you know, the the reason these got pulled together were the weightings for the you know, were, let's say, this is a b to c customer mastering data product. And in this data product, the model looks for and and puts a lot of weight on Common Alley and the name and the address and the phone number, sort of give users a way to engage with a hard problem, a very niche problem, but do it in a way that's that's sort of easy to understand and accessible.
And why this is so important is trust. One of the one of the interesting things since, since the launch of ChatGPT and kind of the mainstreaming of AI and large language models is it it has it has brought this question of trust. Can you trust the robots to kinda, main, you know, mainstream consciousness? And we we've been dealing with this problem. We've been a machine learning company since the beginning and have always dealt with you know, everyone always loves to be smarter than the computer and the machine. And so there's always a questioning of I see the model is suggesting this. I wanna make sure that model isn't doing something crazy.
With with large language models now being mainstream, people know that these things hallucinate. They come up with nonsense if you if you push them a little too far. And so there's there's this real questioning of if you're using machine learning and AI techniques to to make this data good, can I trust it? Like, is it real? And so being able to kinda take take our results, the results of the modeling, and communicate them in a way that sort of, like, puts the data front and center, but then also provides the context is really important for end users to feel like, yeah. This this is actually solving the problem. This is real.
This is this is data I can trust. Like, this is the truth about our customers, and we should adopt this and share this. And so it's it's kind of ironic that, like, the the technology that's leading to this crisis of trust can also be a big part of the solution of kind of framing it that way. But I I think, you know, as we as we move past the the initial shock of, artificial intelligence, that's that's what we're moving into.
[00:29:15] Tobias Macey:
As a listener of the Data Engineering podcast, you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us, don't miss DataCitizens Dialogs, the forward thinking podcast brought to you by Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of DataCitizens Dialogues, industry leaders unpack data's impact on the world, like in their episode, the secret sauce behind McDonald's data strategy, which digs into how AI driven tools can be used to support crew efficiency and customer interactions. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow DataCitizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. In that space of large language models, their application to the problem of master data management, What are some of the pieces that are missing from the out of the box perspective of, I have a large language model. Now I actually have to stand up pieces x, y, and z before I can even really start to bring it to bear on the problem, particularly in that context of hallucination and trust building.
I'm wondering how you have been working through that challenge of being able to harness the capabilities while mitigating the inherent risks in the problem of being able to actually build trustworthy data within inherently unpredictable utility.
[00:31:04] Daniel Bruckner:
Yeah. Yeah. So I think well, let's see. I'm gonna go back a little to start. When we got started at Tamer, our our initial vision and our initial goal this was in the days when semantic web was really hot and knowledge graphs were becoming a big deal. And our vision was we want a we want a product that can build, produce efficiently produce the the enterprise knowledge graph. So within within your organization, you have this extremely high quality links data describing everything you do, everything you care about, all the key entities, your customers, your suppliers, your your employees, all the parts, and products that you produce, this this whole space.
And so now, like, in in in this context where we are today with, with these these large scale models, that idea of this, like, correct knowledge graph is is still sorta the, I I guess, this guiding principle or or it's really this idea of, like, truth for the enterprise. And to get the to get, like, real value out of large scale artificial intelligence, you need to find ways and architectures to tie it back to that truth. It's very good at syntax and articulating ideas. We just need to do a good job giving it the right content and giving it the right context, depending on what problem it solved.
So there's there's kind of an increased importance of data quality and data linkage and and master data management to be able to produce these common datasets, and and maintain them so that you can point, you know, your your GPT at it and get really good high quality trusted answers from the AI.
[00:33:03] Tobias Macey:
I imagine that the inclination for people who are thinking about bringing AI to bear on this problem is that, oh, well, AI is very sophisticated. It has all of these nifty capabilities. I should be able to just set it loose on my data. It'll solve all of my problems, and I don't really have to worry about it. Maybe just click yes or no a couple of times versus what I imagine to be the reality of you actually want to use all of those manual and statistical techniques that we have been relying on for and developing for the past 30 years to do maybe the 80% case and use the LLM for that 20% case that takes 80% of the time to accelerate the process a bit. And I'm curious how you are guiding organizations on that strategic aspect of how, where, when, and why to actually bring these language models to bear on the problem in conjunction with all of the other techniques that have been developed and that we have established trust and confidence in?
[00:34:11] Daniel Bruckner:
Yeah. We so and that that question has has kind of become central to to everyone these days. The the question of, am I comfortable using generative AI with my data? And, it sort of so, you know, the previous version of this was, am I comfortable putting our, you know, our most business critical important high value data assets on the cloud? That's now shifted. Most organizations are comfortable with the cloud, but now it's well, but can machine learning look at it? What if, God forbid, someone trains a model on our data and shares that model?
I think there's there's a certain amount of just feeling out the the right level of security around these things. And I I I don't wanna go into that too deeply. But just for the purposes of solving problems in this space, there are big opportunities to improve quality of matching and mastering, using these new models. But they they need to be they need to be harnessed. There's there's been a a lot of research over the last few years applying large language models to these sorts of fuzzy database problems, fuzzy joins, group by schema mapping. And what they basically find is large language models perform with very little configuration and engineering, perform as well as a lot of state of the art techniques that that existed previously.
The challenge is putting those techniques into a system and into a product where they're used intelligently in conjunction with other lower cost techniques of kinda more traditional machine learning, and also in conjunction with rules and and human feedback. One of one of our sort of founding principles at Tamr is that, the machine is never right a 100% of the time. You need humans in the loop to be able to review, to address complex cases, and to assess how well things are going. And so you need a system that kind of incorporates all all of these pillars of human input, rules based input, rules.
And people users love rules. People love rules. The you know, they're sort of the most if you just say, like, a match means the Social Security number value is equal. Everyone loves that. They they understand what that means. Given the truth about data quality and how things exist in the real world, that rule might actually be wrong a large proportion of the time in in, you know, some a a real particular case. But people love that when it's wrong, they they sort of get why as opposed to if if some machine learning model is wrong, and they wanna know why. It's like, well, let me show you this random forest decision tree and talk about that.
[00:37:14] Tobias Macey:
Let me explain carefully that. Yeah.
[00:37:18] Daniel Bruckner:
So you need all of these things, and and they they all kinda, come together to to create a a coherent solution that doesn't doesn't have an absolutely overwhelming cost. Like, one one thing we haven't talked much about is if you wanna solve these problems at scale, it can become very expensive very quickly. By its nature, these are all matching problems. They're quadratic. Naively, you would be comparing everything to everything else. So if you take a naive approach, you're you're gonna, you're gonna burn a lot of compute and you're gonna spend a lot of money and you probably won't get the best results. So you you need you need a way to sort of identify the easy parts of the problem, solve them easily.
The unsolvable part parts of the problem, send them to a human, and then everything in between. And and that's that's where that's where, especially the we're seeing the the biggest boosts from vector embeddings, large language models, and newer newer cutting edge techniques. They can kinda dig into those ambiguous cases and, like, get a lot of value.
[00:38:26] Tobias Macey:
Another complexity of this space, particularly when you're first embarking on the process of trying to reconcile all of your organizational data, is level of expertise in the process of master data management as well as level of familiarity with the data itself, where the person who created the data figured out what the schema should be, decided what attributes to pick, may not even be with the company anymore, so you don't have all of the context, all of the information. And especially when you have a inexperienced team who's just starting on this process, and then you say, hey. Here. Rub some machine learning on it. It's magical.
I'm wondering what are some of the potential pitfalls that you're setting them up for if they don't have an appropriate understanding of what are the actual capabilities and limitations of the techniques so that they can be appropriately skeptical or appropriately confident where where those apply appropriately.
[00:39:31] Daniel Bruckner:
Yeah. That's, touching back on the political problem of master data management. You have you have to get a lot of different kinds of people involved. The the domain experts, the subject matter experts are very rarely the people who are own the project, essentially. They have to be drafted in and, to to and convinced to share their time to to really make make the project successful. And so so, yeah, there there does tend to be skepticism of what what is the, you know, what's the system doing? And if it's a if it's an AI based system, you know, the skepticism is increased, and they'll see these things. They'll look at the data, say, this doesn't make sense. This doesn't make sense. Why is it doing it this way? And so you you kinda you you need a workflow that kind of embraces that uncertainty to an extent or makes users comfortable with the fact that the data is bad and we're not gonna fix it all at once. It's gonna be a process.
And so so something interesting we we we learned over the years, one of our earlier products was very explicitly built designed as a system for end users to train models to master data. And so there were workflows for end users to come in and the system, we used active active learning techniques to surface really high value examples. Users go in, they label the examples, and the model gets gets trained as quickly as possible. And it's kinda make this this, like, really, like, ML practitioner experience available to non ML experts.
And that system, it you know, it's it as easy as we made it, it was still hard to use, and you still had to kinda understand machine learning. And so for subject matter experts is a challenge, and there's there's a lot of hand holding. So we we've we've moved towards more oh, the the models will be will be pretrained, and we have general models that that can apply to your domain and start at a very high level, and then you'll be tuning them. And what we found actually when we first did that, we we no longer had this kind of, like, active feedback workflow. And what we found was that actually damages trust with the end users. They want to give feedback to the machine, and they wanna have that back and forth, have that conversation.
That's really important in kind of, like, gaining trust in the system. Was the system directing how you should be exploring the data, how you should be interacting with it, and, and how you should be understanding it. And so so we've we've kinda brought that back. So even though now it's you're not training, you're actually still, you can go and you can interact with with the system as if you're training it, and and it's it's this positive user experience.
[00:42:26] Tobias Macey:
And in your experience of building Tamer, working with organizations to address this challenge of master data management, and incorporating these ML and the newer generation of generative AI capabilities into that system, what are some of the most interesting or innovative or unexpected ways that you've seen ML and AI techniques used in that context of data resolution and master data management?
[00:42:56] Daniel Bruckner:
Yeah. It's a good question. You know, our our our architecture is is fairly generic. So we primarily work with structured enterprise data, relational database systems, but we can we can extend the system to, work with more complex data types. And so actually, we we support number of years ago, we added support for, geodata and, GeoJSON, essentially polygons. And, the applications that users have for that system are are are really quite interesting and surprising. And, you know, normally, you think about master data management as applying to fairly narrow subset of data, customer data, you know, organizations, people, parts, products.
That's kind of the the sweet spot. But, with with some of these other, you know, data formats that we support, we've seen it applied to things like radar tracks, like keeping track of fuzzy data related to planes and, other kinds of, like, aerial phenomena and stuff like that. And it it it kinda really gets out there. It's it's it's pretty cool to see.
[00:44:13] Tobias Macey:
In your work of building these systems, coming to grips with the constantly evolving landscape of ML and AI techniques, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building a product that harnesses that?
[00:44:31] Daniel Bruckner:
So a big challenge that we frequently have thought that we've solved but sort of continue to find better solutions every couple of years as we you know, each each solution seems to, like, reveal another side of the problem is finding how to connect the 2 modes of solving these sorts of problems. We we talked earlier about, you know, you can you can take some data as a snapshot, as a one off project, put it all together. You're gonna run some large scale batch computation. You're gonna put it all together, and then you're gonna be done. But then it's like, okay. Yep. But the data is not static. What happens when customer x, y, and z all come in and, they they someone moved, their address changed, maybe some customers passed away and you no longer wanna be sending them offers in the mail, maybe 2 companies merge or or, you know, split up.
Things happen in the real world, and you need to manage this on an ongoing basis. And many applications, like, most applications in the real world aren't content to just have a one off solution and kinda need this need to be able to solve the problem in a, you know, in a live ongoing updated in real time kind of fashion. And we found it's it's a major challenge kinda marrying the extremely efficient, high throughput batch oriented solutions to these problems with a more operational live system database to to solve it on an ongoing basis sort of in real time or in a in, like, a a streaming event driven fashion.
And to essentially take the the core of the system, all these techniques that we've been talking about, rules based matching, lessons from natural language processing and information retrieval, and, you know, kind of the the latest AI has to offer, and apply them in consistent ways across 2 different extremely different architectures and marry those together and do it in a way that end users can actually use that and transition from one to the other. So you, you know, can come in to, to the system and load a whole bunch of data, process it in a big way, create that initial starting point for what your master data should look like, and then, like, boom, you're up and running. It's in a live database.
Now you can interact with it directly. You can start to point other systems at it, use it in a way where it's it's not this one off. It's not some extract that's gonna be irrelevant tomorrow and no one's gonna adopt it. It kinda becomes this living piece of of the overall architecture within an organization.
[00:47:31] Tobias Macey:
And for people who are addressing these challenges of master data management, data reconciliation, trying to figure out what is the cross cutting view of their business? What are the cases where ML and AI are the wrong choice for some or all of that problem?
[00:47:51] Daniel Bruckner:
Yeah. Good question. I I think it comes down to simplicity of the problem. Sometimes, you know, it's AI and ML are are bright and shiny. But for smaller scale problems, you know, you maybe you just you're putting together a couple of sources. Or maybe you're you you really just need to do this. You you have a one off. You're trying to create a, you know, this one time presentation. If you don't have a lot of complexity to the problem, then deterministic techniques are gonna are are are likely to win. There's nothing wrong with rules, and applying a rule can be much cheaper than applying AIML to the problem.
So if if if you kinda have a low stakes scenario where you can just 80 20 it and get a good answer quickly, then then, yeah, like, go crazy with deterministic solutions. That that should, you know, and that that that should be the first step really of any approach. Pick the low hanging fruit, then get into the hard problem, and we'll make it really good.
[00:48:57] Tobias Macey:
And as you continue to invest in and keep tabs on this evolving space of large language models and generative AI and its application to the challenge of data cleaning? What are some of the hopes or predictions that you have or any specific techniques or areas of effort that you're keeping a close watch on?
[00:49:22] Daniel Bruckner:
Yeah. So I I think the, the ability to build intelligent agents into existing user workflows is is creating a big opportunity to you know, I like, the the first wave of this AI rollout was, like, put a chatbot in it. You got a product, put a chatbot in it. It's gonna be amazing. And there are some good applications for that. But I think what's what's really coming up next is looking at what are the problems where LLMs are extremely well suited, and then how can you apply those to actual, like, key features within a product and deploy that? Like, really starting to think of it as just like this is a this is this this capability we can productionize.
Like, how do we think about our product road map and where we build that and how we use it and how we adopt it? And I I think the the upshot is a lot of the challenging work in not in solving the master data management problem, but in managing the system and the complexity of it can be automated to a much larger extent. So, you know, there's there's there's a lot of configuration that goes into pulling data from different systems, aligning all the schemas, figuring out how you wanna enrich, how you wanna apply data quality transformations, how you wanna pull in third party source data, kind of just just, like, creating that model of what what you want your master data to look like starting from what all your source data looks like.
And there's there's big opportunities for LLMs to go in and and just simplify that, turn it into you know, give you a very straightforward kinda just, like, basic wizard like experience setting up this extremely complex machine to go and process all this data in complex ways. And then and then to manage it over time. It's sorta putting agents into the system can take the hardest parts of the user experience and either automate them away or, you know, make turn them into a delight for end users. And so we're focused a lot on kinda really simplifying down that that experience and making, making master data management something that isn't kinda this, like, scary thing that sounds like it's doomed to fail and will be very expensive, but it's more like, no. No. No. This is just this is this is something you need. Like, if you're not doing this, you're crazy. All your data could be 10 times better, and you won't be tearing your hair out to get there.
[00:52:00] Tobias Macey:
I think that point too of figuring out what is that common cohesive schema, what is the representation that is going to be useful and applicable and easy to integrate is one of the challenges as well, and maybe the LLMs can help set that initial pass of here or something that it could look like. Because at either end of the spectrum, you have either people who are unable to see the art of the possible because it looks too daunting, or you have people at the other end of the spectrum who ask for the impossible because they think it's easy.
[00:52:35] Daniel Bruckner:
Yeah. Yeah. Absolutely. I think, like, there's there's, like, LLMs are really good at translating. Right? Like, you can speak different languages and they can act as an intermediary and, like, you can and it's and it just works somehow. I think there's there's kinda, like, there there's a a vision for a future here of, like, what if you did master data management and there wasn't even, like, a single master data model? What if everyone got to keep their own model that they wanted from the beginning and there's an LLM in the middle that was just intelligently translating across these things? So everyone thinks they're speaking the same language, but it's really, you know, it's like a tower of Babel situation.
So the the like, that's kind of the promise here. And I I I think it's it's a big opportunity. It's still there's a lot of challenging engineering and product development to to get to that, but that's where we're headed.
[00:53:25] Tobias Macey:
Are there any other aspects of this overall space of master data management and the application of ML and and AI to its execution and implementation that we didn't discuss yet that you'd like to cover before we close out the show?
[00:53:39] Daniel Bruckner:
I feel like I had something more to say about 3rd party data, but, honestly, I I think we might be good.
[00:53:46] Tobias Macey:
Fair enough. Yeah. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:54:04] Daniel Bruckner:
I think it it it feels like it comes back to somehow a location challenge. I don't know. Maybe I'm just thinking about this because this has been on the problem I've been dealing with lately. But, I feel like we haven't really solved the, like, cross cloud problem. There are really good systems on different clouds, and they're they don't translate 1 to 1. So there's a lot of there's a lot of, like, essential technology that's locked up in different proprietary walled gardens. And so it's, like, it's now very easy to build extremely powerful cutting edge data architectures for managing your data.
But you have to make some pretty big decisions at at at the outset and some pretty big bets on on vendors and who you trust in the market. And it's it's gotten a lot harder to kinda remain independent. On the other hand, it's also easier to remain independent. There's a lot of amazing tools that kind of like breaking up the relational database into its component parts and using independent systems to to put it back together. And there's there's, like, at the same time, there there are a lot of these amazing tools in the open source world.
But but it's it's kinda it's, it's difficult for the world to collide and to kinda put it all together into coherent coherent approach. So so yeah. And I I think there's, like, I feel like there's a little bit too much satisfaction with folks thinking if you put all the data into a single physical place that all of your problems are solved and really you're just kinda, like, kicking a bunch of problems down the road for 10 years till you get sick of your vendor and need to go do something dramatically different.
[00:55:57] Tobias Macey:
Now the the data gravity problem is definitely real, and until we are able to circumvent physics, it won't go away. Yeah. Yeah. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience on building these master data management workflows, bringing ML and AI to bear, and some of the ways that their current generation of LLMs and generative AI are adding new capabilities and techniques to that process. So they appreciate all the time and energy that you and your team are putting into bringing that to bear and making it more accessible and easier to apply to this challenge, and I hope you enjoy the rest of your day.
[00:56:38] Daniel Bruckner:
Thank you. Yeah. Thanks so much for having me. This has been fantastic.
[00:56:50] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Dan Bruckner
Challenges in Data Management
Master Data Management Approaches
Scaling Challenges in MDM
Evolution of MDM and AI
Impact of Large Language Models
Trust and AI in Data Management
Human Interaction and Machine Learning
Operational Challenges in MDM
Future of AI in Data Management