Summary
Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the goals of a data hub architecture?
- What are the elements of a data hub architecture and how do they contribute to the overall goals?
- What are some of the patterns or reference architectures that you drew on to develop this approach?
- What are some signs that an organization should implement a data hub architecture?
- What is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access?
- What are the features or attributes of an individual hub that allow for them to be interconnected?
- What is the interface presented between hubs to allow for accessing information across these localized repositories?
- What is the process for adding a new hub and making it discoverable across the organization?
- How is discoverability of data managed within and between hubs?
- If someone wishes to access information between hubs or across several of them, how do you prevent data proliferation?
- If data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity?
- How are access controls and data masking managed to ensure that various compliance regimes are honored?
- In addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs?
- Given that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs?
- How do you address issues of data loss or corruption within those transformations?
- How is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.?
- How do you manage tracking and reporting of data lineage within and across hubs?
- For an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub?
- What are some of the considerations and useful technologies that would assist in creating and connecting hubs?
- Should the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface?
- What are some of the considerations and useful technologies that would assist in creating and connecting hubs?
- When is a data hub architecture the wrong approach?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- CluedIn
- Futurama
- Kubernetes
- Zookeeper
- Data Governance
- Data Lineage
- Data Sovereignty
- Graph Database
- Helm Chart
- Application Container
- Docker Compose
- LinkedIn DataHub
- Udemy
- PluralSight
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/lunode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC, and Data Council.
Upcoming events include Strata Data in San Jose and Pycon US in Pittsburgh. Go to data engineering podcast.com/conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Tim Ward about using an architectural pattern called data hubs that allows for scaling data management across global businesses. So, Tim, can you start by introducing yourself?
[00:01:36] Unknown:
Sure. My name is, Tim Ward. I have, had the pleasure of being on this podcast before, but my background is mainly in software engineering. In the last 6 years, I've been focusing more in data engineering. So combined, that's about 14 years of, working in the software space. I am based out of Copenhagen, Denmark. I have my wife here. I have a little boy called Finn. I have a dog that looks like Chewbacca that, called Seymour. So for those fans, of Futurama, you'll know where the the reference for that dog is, and it looks exactly like Seymour out of the Futurama episodes. So that's me.
[00:02:17] Unknown:
And so you mentioned that you've only recently been involved in data management. So I'm wondering if you can just, give a bit of a background in terms of how you got involved in that space. Yeah. So at a at a previous job, I was working for quite a large, vendor.
[00:02:31] Unknown:
It wasn't in the data space, but it turned out that the last project that I worked on there was, funnily enough, how could we integrate our customers' 3rd party systems, into our particular platform that that we were building. And turned out that, majority of the team I was working with, we were all tasked with a very similar project but with different we went through that rabbit hole of okay. So looks like we need to buy a couple of different pieces here. We need a a data integration piece. Okay. So I've heard from 1 of my other colleagues that they're investing in a data governance, platform. Yep. That that makes sense for my customer as well. And after you go down that rabbit hole of the umbrella of data management, you've got, you know, potentially 10 different products that you've got to stitch together.
And what got me into data management was that the trend really was every single 1 of these projects, by the way, they didn't go so well, where they failed was actually stitching these different products together. And in some cases, they were, you know, from different vendors. Not the not in my particular customer I was working with, but with some of my colleagues, they were also, you know, having the same struggles even with the same vendors products just following the different, or at least identifying and and, going after the different pillars of of of data management. So I think what got me into this was that idea of, oh, these all didn't go well. Maybe there's something to this. Maybe there's the need to investigate kind of more modern techniques to data management because it it did seem like this was a common request from most enterprise sized businesses anyway. And so as I mentioned at the beginning, we're talking about some of the work that you've done in terms of
[00:04:31] Unknown:
using this data hub architecture. And I'm wondering if you can just start by giving a bit of an overview about the goals of what that architectural pattern is intending to achieve.
[00:04:40] Unknown:
Yeah. Sure. I mean, I think I'll start with the fact that there's multiple different, interpretations of what a data hub is. If if you go to, for example, some of the the analysts, analyst firms in our space, the way that they've interpreted is this concept of they're making an analogous to something like the the subways in London where, you know, you've got stations that take people from 1 side of London to the other. And, you know, there stops along the way where people can get off, but essentially, there's this whole spaghetti of or or hubs that are going around London to take people from 1 place to the other. And really, what those analyst firms are saying is, well, you can kind of apply the same thing to data whereas, instead of transporting people, you're essentially taking data from a source system and making it available in some type of target system. For example, your BI platform of choice, maybe your data science platform of choice, you name the tool. But wherever typically you want to do something with the data when it's prepared.
Now our interpretation of of Data Hub is still similar, but I would say it has its differences. So, really, the the goals of the Data Hub architecture came from a simple question that we had from 1 of our customers, which was we are a globally distributed business. We have a headquarters. We have regional offices. We have localized offices. And this is probably a similar case with a majority of of enterprise, businesses, especially those that are that are globally distributed. And the simple question was, I need to put in a data governance, strategy.
And how on earth am I going to manage from a top down approach all the possible permutations of policies and rules and regulations for localized or regional offices and also take care of the headquarter as well. And as you probably, you know, are aware as well, Tobias, what we typically do in the engineering field when we have a big problem like that, the first thing we look to is how do we break this down into smaller challenges? And when you do that, you've gotta have some way to be able to instrument it, orchestrate it, bring it together once those components have actually been broken down into smaller chunks. So in fact, you know, our interpretation, the way that we've been working with the data hub architecture in with our customers and with our platform have really been focused around, how on earth, even for 1 of the pillars of data management like data governance, how how are you going to manage this from a top down approach with what could be localized data policies on what data they are allowed to have, localized retention policies on how long they can have it for. And then when you start to add in the other pillars of data management like data quality, what are the quality levels or metrics for or KPIs for 1 region might be different to the other? And this is what this is what forced us into explore a new way, a new architecture to support this complexity of globalized rules.
[00:07:59] Unknown:
And so in terms of the actual architecture itself, what are some of the core elements and abstractions that are incorporated into it to allow for this global distribution of data with localized control? And what are some of the other patterns or reference architectures that you drew on in order to,
[00:08:21] Unknown:
arrive at this approach? Yeah. So, I mean, our team have been working with many different types of technologies. 1 of the ones that we've been focusing on and we use quite heavily within our our platform is a graph database. So this this really this concept of a network. And I think we've been spoiled by by working with it because, essentially, it offers offers us the highest fidelity data structure that we have to be able to map data. And playing with that, it kind of gave us some inspiration on what are the elements of this this hub architecture. And so the first obvious 1 would be your hubs.
So in a graph world, this would be nodes. Those hubs are essentially responsible for, managing, as you put it, the localized rules. So they don't they they're not interested in knowing what does global think about this? What does what does that other local, hub think about this? What about my parent hubs like a region hub? No. No. No. They wanna manage the the specific rules for their location. And so when you have this hub, you need obviously some way for hubs to talk to each other, and that's where this concept or this element of a a hub orchestrator comes in. And that's actually probably not too hard to see this analogy played out in other areas. A good example would be we use Kubernetes to orchestrate our containers.
We use, in the Apache world or open source world, we use often zookeeper to orchestrate our different, data stores or to, keep things in some type of sync between different environments. So there there are many other ways that, in fact, we use this type of or at least it's available to us this day, this type of, architecture pattern. So, obviously, if you've got these hubs, the hub orchestrator is responsible for, okay, well, how do I talk from 1 hub to another? And there's a a really good analogy I like to use here which is travel. Now when I travel from Copenhagen to the US, there is a border set up which essentially says for you to enter, you need to do these things. Well, first of all, you need a valid passport. K?
You potentially need a visa. You potentially need to have a work visa, for example, depending on what you're doing. And you're not allowed to be in any of these lists such as you're not allowed to be on any blacklists or embargo lists that we have in our company in our country. And then if you check all those boxes, feel free to come in. And you can really start to apply that same kind of concept, but replace people with data. So how does data travel from 1 hub to another? Well, the localized hub sets up rules such as if I'm gonna talk with another hub, I need to make sure that the data completeness or the completeness of records is at a certain level. Maybe let's just for an example say 90%.
So if you wanna share your records with my hub, you need to make sure your records are 90% complete. And if they're not, the the hub orchestrator is responsible for saying, I'll give you tools to elevate that level of completeness. And whether that's plugging in enrichment services, whether that's, for example, if we're missing addresses, we might plug in something like the Google Places API. If we're missing company data, we might plug in Dunson and Bradstreet or Open Corporates or, you know, name the the the the local business registries that exist in in a majority of the the countries. But I'm gonna give you that tooling so that you can enter. And so that type of pattern actually was a huge, inspiration and reference for an architecture that worked in I guess I can't use the word nature, but worked is is already working in another part of the world. Now whether you'd like to think it works or not, that that's completely up to you. I'm sure many people will complain about some essentially, I'm not speaking from experience, but essentially, the concept, it works.
And the great thing is is often when you bring in concepts from the real world and then you say, okay. Well, let's put it into a machine to handle this. A lot of the issues that you find with, for example, scaling humans to solve a problem. A lot of the time, we can actually remove those disadvantages when we move it into a much more technical product, technical architecture. So there's some of the patterns that we, drew upon for to to develop this approach, but, I mean, let's be honest. I've already mentioned 2 examples where orchestration is common and that pretty much most enterprises are using, and that's Kubernetes to orchestrate deployment of containers, and scalability and health checks and liveness checks.
And then you've got, tools like ZooKeeper that have been around for years years years to orchestrate, you know, syncing of environments, syncing of dictionaries and configurations. So it's not a fast stretch to actually see how this pattern really came together.
[00:13:57] Unknown:
And in terms of determining when it's worth going down this path of establishing this architecture of these various hubs and the orchestration between them and the rules that exist for determining what data is allowed to egress and in what fashion. What are some of the signs that an organization should actually go down that road and pay the complexity cost of actually implementing that and adding all these different constraints to their different localized data regimes.
[00:14:28] Unknown:
Yeah. It's a really good way to put it because, you know, 1 1 of the 1 of the side effects or I guess potentially a disadvantage of an architecture like this is are you over engineering for the situation? You know, 1 of the the big signs immediately would be, are you even a globally distributed business or you don't have to set up your hubs in in a geographical architecture. It can break down by different components, but, essentially, that's the commonality that we typically see with our customers. So the first thing you would need to ask yourself is that, are you a globally distributed business? The reason why I use geography as an example is that often when we're talking about banks, insurance companies, they're regulated differently in different countries and locations, and therefore, the rules around what data they can have, how long they can retain it for is a good common case to set up this type of architecture.
So the next thing you've gotta ask yourself is, are you actually operationally a complex business? And if not, this I would argue is overkill, but it doesn't mean that we should necessarily shut the option off of growing your business. I mean, I guess that's really the goal of a lot of businesses is, well, I want to expand into other locations, and typically that comes with complexity. So I want an architecture pattern that upfront I don't need to subscribe to the complexity, but I also don't want to shut off that path. And the data the data hub architecture, I would argue, has this designed into it. In fact, you can imagine that a globally distributed business, really, they're often growing new locations quite often, whether this is through mergers and acquisitions or just natural growth of their business. And therefore, really, the data hub architecture not only is designed to help with this, but what we don't we're not wanting is if we've got 300 of these hubs, as soon as we add another hub, we don't wanna say, that's n factorial links that I have to sync up on rules and you can imagine the complexity of how do these rules overlap. Do they contradict each other? That's just never going to work. So the data hub architecture actually instinctively, and I think it's the use of the the graph and network model that helped with this. You know, you can start small. You can start with a 1 hub, the global headquarters. And as you grow out, you might say, okay. Well, we just have an an office in London.
I don't wanna put in the over the infrastructure cost of setting up new technology and new infrastructure and a new and data center or in a new region in our cloud provider, let's just shove that all within the head in the headquarters. Now, of course, what that means is that once you do move to that hub architecture, the complexity would be splitting that up, and that can be complex. So really, I guess, the signs that you would need to look at is is growth. Do you see it in the foreseeable future? And if so, maybe start out with the concept of not closing yourself off from an architecture like this.
[00:17:54] Unknown:
And for an organization that already has an existing data platform, what are some of the migration strategies that they might look to for being able to move into a data hub app architecture and open up their data for being able to be integrated or distributed across these localized hubs. And then also, 1 of the other possible uses for a data hub architecture that I can envision is if there's some sort of a data collective and being able to share data across organizations
[00:18:25] Unknown:
or being able to open up certain datasets for public access? Yeah. Yeah. Good question. So, I mean, this question reminds me of another thing that we've done in software engineering over the last few years, which is moving off the monolith, moving off the monolith to microservices. And what what did we do? What did every business do? So, essentially, what we did is we said, okay. Let's break up the monolith. Let's break it up into components, but then I need to put something in between as a kind of a mesh or a I mean, 1 of the examples that was very common was an enterprise service bus. Putting in an enterprise service bus where we pass around discrete small little messages to so that these different components can kind of talk to each other in a a little bit of a decoupled way. And so, really, this is kinda similar to what the migration path looks like for moving off a classic data platform into a little bit more of a distributed governance environment.
And I'll just reiterate, governance is just 1 of those pillars. Right? And I think the reason why I'm choosing that is because it's usually 1 of the first things that people wanna do, put in governance strategies in in their business. And you can also just imagine that, okay, this can get really complex as I add the extra pillars of of data management in. But essentially, the kind of the kind of experience with the migration, as I alluded to before, it's a gradual move. It's it's really saying, okay, how are we gonna start splitting up this monolith? And typically, the way we did it in software engineering was say, let's identify the main components of the platform.
So we would split logging out to its own service. We would split maybe, jobs out to its own service. We would split something like data access potentially out to its own service. Now other architectures in in microservices would say, no. No. No. Like, split the service out to have all of its components sitting behind it. So logging will have a database. It'll have a data layer because that's all it's responsible for. Now whichever way you did it, it's still kind of it's still going to help in this example, but really the the the approach that we've seen our customers take because majority of our customers started out with the monolith. They said, okay. We've got 1 central hub of data.
It's got all of our data in 1 place. We've got all the governance rules in there. Oh, got it. I didn't realize how complex the governance rules were across the different businesses. There has to be some way to split this up. So I would say the migration path, if you've done that before in breaking up the monolith, it's very similar to that type of situation. Now to answer the second part of it, which is about making this data available to the rest of the business, whether this is just internally or potentially to the public. Once again, there's there's multiple parts of that. Governance plays into it, of course, access control and and and things like that play into it as well. But what the hub architecture basically does, like it does with the other components is it takes what is typically a top down approach, which is how from the my white ivory tower do I set the overarching rules for my global business on how we share data? Now 1 of the beauties of the hub architecture is that it instinctively has a hierarchy within it. So a graph is just a higher fidelity version of a tree. So for example, if I'm wanting that top down approach where global says, listen, We have a global policy on sharing data that we can't give out anything that's personal. So in fact, the hub orchestrator can be responsible for saying, got it. So in fact, there's some policies that the local hubs don't have any control over. They've been in they've inherited these rules from the global headquarters or from the regional hub, and they're enforced by default onto the, localized hub. Now whether that localized hub can then say, I'd like to override those rules, that's kind of the responsibility of the hub architecture to to to know if that's even allowed or not.
But what this means is that at least the the higher level parents can say, listen, you might have some extra ways that you want to share data in your local hub, but from a global business, we're handing down these rules. Whether you've got rules to add on top of that or override, that's up for the hub orchestrator to actually figure out if those rule rules can actually are compliant with each other or not. And 1 of the other interesting things in this
[00:23:26] Unknown:
architectural approach is understanding what the useful topologies are. Is it something where you would have a flat network where you have maybe every hub interconnected with each other or more of a hub and spoke model or something that's more along the lines of a DAG where you might have different levels of linking between different nodes and different, maybe, sort of, self contained subnetworks that all interact with maybe a hub network and how the overall transit across those different hubs. But, you know, particularly if you have a complex topology where there might be multiple different nodes of traversal, how that works into things like latency and issues of being able to discover and query across all of these different data repositories.
[00:24:15] Unknown:
Mhmm. It's 1 of the things that, often comes when we separate data out when we I mean, 1 of the ironies here is you could argue, did we just bring all the data in to then just silo it again? And of course, that's not the the kind of intention. In fact, the hub orchestrator really acts as a global journal of what data is available across the entire network. And then, of course, we'll give you control in saying, well, hub 1 and hub 2, they're allowed access to this, but if dictated, I can inform them that, you know, there are these other datasets that are available and and all they need to do is really ask to be able to get access to that data. So essentially, the the the the hub orchestrator is kinda like this this transaction log or this journal of I'm writing everything about every hub, and I'm gonna orchestrate who has access to what as you mentioned in what hub, but also potentially in what subtrees of that network. And so some of these architectures can or or or at least these relationships can get pretty complex.
What we're really recommending in this is to use this analogy I mentioned before of of countries and cities and states and use that kind of way to build up this network. And in that way, the what the reason why this works is because typically, that's how a lot of businesses are distributed as well via locations. Therefore, it kind of makes it a little bit easier to manage. But I I won't take away from the fact that, yeah, these architectures could get could get quite complex in in how these relationships and how data is shared between them. So it's it's definitely, I would say, 1 of the the discovery endpoints with our customers on how far are they taking this and where the complexity is actually starting to arise where initially maybe the the upfront, we wouldn't realize that these kind of things would come up. And then another
[00:26:24] Unknown:
issue, particularly if you have a deeply layered topology, is how you handle the transformations between hubs where they have different rules in terms of how the records should be represented or data quality or cleanliness issues, and being able to handle issues of potential data loss across those different nodes. Yeah. Exactly. So I'll give you a couple of good examples.
[00:26:50] Unknown:
So and, you know, using that analogy before of, like, border control, it's a good way to kind of conceptualize what's happening here. So each hub is responsible for saying, if you're gonna talk to me, you need to have this checklist. Right? Not the other way around. So it's not responsible for saying, hey. If I give my data to you, it needs to meet these standards. No. That's the part where it wouldn't scale because suddenly you're taking a top down approach where you're saying I'm now responsible for my own hub, but also every other hub in the network as well. So when it comes to things like data quality levels, when it comes to things like transforms let's just take a simple example. You've got 1 hub that says, hey. My data that I have, it looks like this. It's a nested JSON object and, I'm I've been instructed from the, the hub orchestrator to take data from London hub and move it to or transfer it to the global hub. And so the global hub has a rule that says, no. No. No. I need flat data. That's what I'm wanting. And the kind of, for lack of a a better word, but the names of properties or columns or attributes, whatever you wanna call them, I need them to be called in a specific way. Right? That's up for the hub orchestrated to figure out. Now figuring that out is not the hardest part because essentially what you're doing is you're on entry or potential entry into your target. You're just analyzing a list of rules that are only local to that hub. So from a management perspective, you don't need to know about the whole world.
But what you do need to do is then give those hub owners tools to say, well, I'm not here to just tell you that we can't talk. It's my goal to give you tools to address that issue. And whether it's giving them tools to say, I'll give you a tool that transforms JSON or nested JSON objects into flat representations. And maybe that's 1 approach. I'll give you an example with the analogy I'm using. You know, if I want to travel into the US, it's not like I get there and someone says, I can't help you. Like, I don't I don't know what to do for you to enter this country. They say, here is a form.
Fill it out and that's how you get into this country. So, really, it's about facilitating the hub orchestrator that identifies, okay, there's an issue. There's a clash. We can't talk. And then there's the the role of maybe maybe you could call it like a facilitator which says, and here is the form you need to fill in
[00:29:33] Unknown:
to make it so we can talk. And then to use an example, say you have a name field in a particular record, and in 1 region, it's just a, flat text field. You put in whatever characters you want, whatever types of spacing or hyphenation. But then in another hub, you have a name field that requires a first and the last name, which obviously doesn't fit a number of different localities and their particular, approach to how naming is handled. So how do you approach that type of challenge where 1 hub is enforcing just a flat text field that supports Unicode, the other 1 supports username you know, first and last name and maybe only accepts ASCII and being able to handle transformations. Is it a case where you can say that I refuse to make this transformation, and so this data won't be transmitted? And if so, how do you signal that to the person who's trying to consume that data for a particular analysis?
[00:30:30] Unknown:
Exactly. It's a really good example. So first of all, I guess the first thing would there would be an attempt by the hub orchestrator to say, okay. I'm going to give you a tool that allows you to convert between character character, sets. So there are some character sets that they have so low fidelity that you can't increase them. For example, if you've used a UTF in encoding that, ruins your, accented characters like we have here in Denmark, for example, then it's sometimes hard to reconstruct that using tools. So you're you're you have already answered the question which is there are some times where it'll just say, no. I'm sorry, but we can't do this. Right? There's no I can't give you the tool to do it. You've either got to manually address it yourself, but I'm just going to reject this. I mean, using my analogy again with travel, I'm sure there are many times where a person travels into a country and when they get there, it's just I can't help you. There's no form you can fill out.
It's just we're not compatible. I I can't, I can't take data from you. And a good example for this would be, you know, I would imagine and I I I have experienced at least that most big large enterprises and actually expand this out to most businesses in general, probably want a global view of their data. They wanna say, yes. Even though I'm across, 20 countries, I need to do global reporting. So you could probably imagine the global hub would then say, okay. So, all hubs below me start to send me your data. And the hub orchestrator will then say, okay. Got it. I can go and help you with the the looking up of that data and the transport. But the global hub's going to have some rules and policy set. A good example would be data quality scores. So, I'm not gonna accept anything below 90% completeness. I'm not gonna accept anything below 95% accuracy.
And, if, the data coming in doesn't meet that, I will then give those hubs the tools to be able to get to that level and that will take time. But, essentially, if I'm going to accept that data, it needs to at least meet these levels. And you can probably imagine the same could be applied to things like, data policies for governance or compliance, such as if you send me data and it's personally identifying, I'm just not going to take it. So I'm going to give you tools that allow you to not only identify why I'm not taking that data, but to be able to rectify the situation as well. So that's probably like a good example of a case where many times that communication or that orchestration that the global headquarters is asking for, you get to a point where it just says, nope. I'm sorry. I can't help you. I will just downright reject that data. And in this architecture, is it primarily just as a means of storing and communicating about and transmitting data? Or do the individual hubs also provide capacity for computation
[00:33:40] Unknown:
where in the instance that we were saying of, for instance, the name where there is no way to cleanly convert between 1 representation to the other where you can push your analysis down to that hub to perform the computation that you want and then return the, results back to that in sort of a scatter gather approach? Yeah. I mean,
[00:34:00] Unknown:
I'll be honest. I haven't even thought about that, but when you think about the architectures that we use, including we use Kubernetes as our orchestration framework for our containers and for deployment and all the the goodness that comes with that. And so it's not hard it's not a hard stretch to then say, well, can all of those nodes just, represent a node pool? And then can I use Kubernetes actually just to register those, as node pools? And, you know, if I need that extra resource, I could look out to Kubernetes, and to be honest, Kubernetes does this for us, is say, what pods are available? I've got someone, demanding 4 gig of RAM and t CPUs.
Who's got it? And you could, in a way, utilize that as a farm to distribute out or at least balance the load of infrastructure and try to get the efficiencies on on not only the infrastructure, but the costs associated with it. So I'm sure you you you could, bend this to to represent, something that could that could do something like that. And another question
[00:35:07] Unknown:
about this data hub architecture that can potentially lead to challenges in data governance and management is the question of data proliferation, where if you're copying information between the different hubs, how do you track what records are being used in what and particularly in the case where you have an update to a set of information, how do you then replicate that across all the different areas where it has been copied to? Or in the event that a record is deemed, false or in GDPR where somebody has requested deletion of their data, how do you control handling removing that information from all the different places where it's been replicated to? Let's just start with this. That's it's really complex. Right? It's a really
[00:35:52] Unknown:
complex challenge because what we're doing is now we're just now we're really exploring. We're all the I wouldn't even say edge cases, but we're really pushing this architecture to its limit. So, you know, for example, in including we have this idea of a mesh API, which essentially says, hey. If, if I've pulled in all the data from all the operational systems and then someone identifies that Tobias' job title is wrong, well, because I know where it came from, I also have that lineage of, okay, if I did change it, what records in the source systems need to be updated? Now, the way our mesh API works is is is in 2 ways. First of all, you can you can automate that process behind a workflow. So, of course, you can imagine if I've got a record in from, I don't know, a CRM system like Salesforce, it has an ID associated with it. Great. That's been ingested into our platform and hey, I've got another record on Tobias from our HR system Workday.
It's got an ID as well. So in the included platform, you know, we figured out, these are duplicates. Let's merge them. So now I've got 1 record on Tobias, but 2 pointers to the the source systems. So that's in 1 hub. So then really to span the the multiple hubs, essentially, what you're doing is just adding extra layers of mapping. So instead of saying, hey, I'm updating the, job title of, Tobias and there are 2 source keys, 1 from Salesforce and 1 from Workday that need to be updated. What you're doing is you're just putting an extra layer of abstraction or misdirection in there, which is, hey, I'm over here called the hub orchestrator.
I actually know where all data is located. Now each hub won't have access to this journal. Right? They'll only be able to look up what they've got access to. Essentially, on the journal that says, got it. So these hubs are actually where I originally got that data from. So that essentially what you're doing is when you're integrating data, you're mapping up to some type of core, what we like to call vocabulary or schema. And then on the reverse, you're unraveling that. You're saying, cool. So my goal across the business is I don't really care or want to know if in Salesforce, we call it the job title, and if in Workday, it's called job position. I I wanna map that up to kind of some type of standardized vocabulary or lexicon, and this is quite normal in in in the data governance area. But really when we're starting to write back or we're needing to delete records, so Tobias comes in and says, delete everything across all hubs.
Essentially, what you're just doing is unraveling that vocabulary mapping that you've done back to the source hubs to delete those, those records, and it's really hard. There's so many challenges involved in it. I'll give you another good example. There's this concept of data sovereignty, which is essentially where is the data located. Now it also will typically depending on who you talk to, they'll also include, okay, from a location perspective, where is that data also traveling? So obviously, the hub concept or the the network kind of instinctively has this in its design that a hub would usually set up in the region that you're wanting to host that hub in. So in something like AWS, you would go in, US east or US west, the same in Asia, for example. And, what becomes complex with this is also these rules of sovereignty.
I'll give you 1 country that's quite complex, and that's Germany. So Germany is very strict on where it can host data. And so for example, if I was a hub in the US and I said, give me the data from Germany, you can't just easily or legally transfer that data to US service. So then the mesh framework or so the data hub framework becomes really interesting because if we've got the backing of something like Kubernetes from our infrastructure perspective that says, hey. If, you need any more hubs, I can spin those up. That's just an extra node pool and, you know, your helm charts have already given me instructions of what, a node pool looks like. You can start to think of situations where you might spin up new hubs on demand and tear them down when the processing has been done. So for example, there might be a rule that says, yeah, you can send data to Germany, but we can't do it the other way around. Got it. So the hub orchestrator knows that rule or at least you map that into the hub orchestrator.
Let's say, that's the direction that these, hubs can actually talk. So if a global, let's just say they are based in US, says I need to globally report. It says not a problem. But to crunch the German data, you either need to use the German infrastructure to do that and send data from the US to the German infrastructure hub or we've come up with other examples we've seen before where maybe you have to spin up data in a sovereign country, I e take something like Switzerland. So instead of US sending to Germany and Germany sending to US, we spin up infrastructure in Switzerland. Both the sources send their data to Switzerland. That data is crunched, and potentially, that hub is completely torn down. And as you probably already know, Tobias, this is something that, you know, Kubernetes does well. You know, spinning up different environments and tearing them down when when they're not necessary. And
[00:41:41] Unknown:
for somebody who's actually interested in building out their own implementation of this data hub architecture, what are some of the technologies that are useful for being able to implement things like the hub orchestrator and the data storage layer and some of these automated transformations? And what are some of the considerations or edge cases that they should be aware of as they're starting to plan out some sort of deployment like that? Yeah. Good question. So I'll start with this. Having a hub architecture
[00:42:11] Unknown:
in a traditional on premise environment is quite hard as you can imagine, especially if we're doing location based hubs. And, you know, most enterprise businesses still even though they might be in the cloud will quite often have a v an internal VMware cluster or or some VM, environment set up. And so 1 of the the the kind of technologies and bases would be that if you are a completely on premise, business, the data hub architecture becomes a little bit more complex to achieve. Not unachievable, but much more complex. I've already mentioned a couple of the kind of core elements that or technologies that you would want to think about. Kubernetes, of course, and, coming with that containers just in general is 1 of them. Why? Because you might want to spin up different environments or as you used in your example that you could use the data architecture for is, hey, I'm a hub in London and I'm doing absolutely nothing. But Germany is processing data and it's overloaded.
Give me your work, and obviously, if those sovereignty rules are set up that we're allowed to do that, we're allowed to transfer data between those hubs, then fantastic. So the thing is that if you've played with, Kubernetes before and and and containers, you're actually a majority of the way there. All we're really doing is applying the same methodologies, the same ideas to the hosting of data where a container is not necessarily a product. So we're not talking about, a container like Spark or or Kafka. But actually the container is more you could represent that as a hub where it's like not just about technology, but it's also the data. And it's the policies of how different Docker containers work with each other. So what does that, bring in? Isn't that why we have Docker Compose? Docker Compose is there to say, this is how I compose my hub network. And then we use something like helm to say, oh, and this is, how all of the charts are set up for Kubernetes.
So, you know, things like the amount of RAM and CPU that we allocate to the different nodes in the cluster or the pods. So then you really start to get into, if I can understand that, it's really very similar to the data hub architecture itself. And then the other question
[00:44:38] Unknown:
of the hubs and how they relate to each other is whether you find it useful or necessary for each of the different hubs to be implemented in a homogeneous fashion where everything is using the same set of technologies? Or because of the abstraction layer of the hub orchestrator, is it possible to have more of a heterogeneous implementation where each of the different hubs can use the technologies that are most well suited for the particular teams who are maintaining them?
[00:45:08] Unknown:
So being completely transparent, I have not even humored the thought. But when you when you think about it, if I use the the the technologies that I was, talking about in the previous questions, you would argue that kind of if your application is built well, you could argue that and and and the proper abstraction layers have been put in. You could say that, hey. At any point, if I wanna switch out SQL Server for MySQL, I should be able to do that. Which kind of plays into the role of got it. So if I've got a hub that's set up with my own custom implementation, all I'm really doing is I'm using the hub architecture as interfaces or some type of schema or some type of agreement that I'm saying, listen. To play in this network, to play in this hub, you need to fill in the following things which is you need to tell me rules on data quality. You need to give me policies, and you can imagine that the hub architecture or the hub orchestrator that we use, it's very technology agnostic.
It's essentially YAML files that have policies. So there's no vendor specific or technology binding besides that, I guess, you could say that, ubiquitous language or at least something that's not bound to a particular database type or a particular programming language. So then it then it makes it kind of a little bit easier to grasp to the fact that these hubs don't need to be the same technology. Rather, they just need to adhere to what the hub orchestrator is telling them to adhere to so that it can do its job and tell these different systems how to do their job. Now here's where it gets a little bit more complex is that if it's also the role of the hub orchestrator to say I'm going to give you the tooling so that we can talk. I e your data is not complete enough. Your data is not accurate enough.
And so, you know, good luck with that. So then you would prompt the question. Okay. Is it the job of the hubs to have those tools or is it the job of the hub orchestrator to provide kind of ubiquitous tools that also aren't necessarily technology specific? And I find that hard potentially to achieve because I would argue that if you're not binding yourself at least to some consistent tooling around hubs talking to each other, you run the risk of saying there's 20 different ways to, solve this challenge. And every time I wanna talk to a new hub, I'm given a new tool to do it.
So it feels like 1 of those things that needs to have more discovery around it. And maybe, really, the hub orchestrator is responsible for giving that consistent tooling so hubs can chat. And when is the data hub approach the wrong idea,
[00:48:10] Unknown:
and when would it be easier to use just a a different style overall, whether it's just a monolithic So, you know, the funny thing is I run into we work with a lot of customers,
[00:48:24] Unknown:
So, you know, the funny thing is I run into we work with a lot of customers where even though they're large and that can be measured in multiple ways. It could be revenue or employee count, but I'm in this case, I'm talking to employee account. And, you know, when they talk to us about their business, well, some of the first things they'll say is, Tim, we're not actually a complex business. Right? We, we for example, we are in facility services where we clean people's offices or we, we do the catering at people's offices. So inherently, we're maybe not necessarily a complex business like shipping that requires lots of logistics and good timing and schedules and things like that. And so I would definitely argue that with any win that you take from engineering, you always take some losses. And so really, if you find that your business is not complex, something like this is is overkill or over engineering. And, of course, as I used the example before, I would argue not necessarily the goal of business is to become more complex. It's definitely the goal of businesses to expand and grow and hopefully go into other countries, and there's inherently some complexity that comes with that. But it has. If you find that it's you're not a complex business, there's not, you know, hundreds and hundreds of rules that change per region, and maybe change in different departments.
Something like this seems like it would be absolute overkill.
[00:49:57] Unknown:
And are there any other aspects of the data hub architecture or some of the ways that it's being used or the benefits that it provides that we didn't discuss yet that you think we should cover before
[00:50:09] Unknown:
before we close out the show? I think the the 1 thing I'd like to just reiterate is that if you were to Google Data Hub, there are multiple interpretations different things. And if you see our interpretation, you could have easily called it, oh, that's a data network or that's a data mesh or that's a data service mesh, and I would back that up. I would say, yeah, those are 2 other possible ways of of interpreting what a data hub is for. So I would say when you're doing your exploration, just be wear wary of that people are interpreting this quite different. And I believe that there's also a
[00:50:49] Unknown:
recent, project out of LinkedIn called Data Hub that they're using for their data discovery engine as well just to muddy the waters a bit further. Yeah. Exactly. And it makes sense, doesn't it? Like, the hub, that's where I go to discover where the the data is. And so,
[00:51:03] Unknown:
yeah, I agree. It it it it's it's maybe a term that in 2, 3 years, we'll all congregate on 1 interpretation
[00:51:10] Unknown:
of it. And I I I think that would help clear up the situation. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today. So I
[00:51:31] Unknown:
think, education. Education is a big thing. And here's why I say it, is I I see a lot of businesses taking the traditional approaches to data. It makes sense. Right? It's what worked at my last workplace. Let's just replace and let's just do that same thing again. Maybe with a different vendor, maybe with a little bit more modern technology. 1 of the things I think is really missing is this education on the data management space itself. And, of course, you've got, you've got great, services like Udemy and Pluralsight that are going out there to to to train. But I actually think we don't have enough education that's cracked this concept of, well, how do you solve some of the real complexities like we talked about today? You know, we often see things like, hey. Here's how you build a pipeline.
Here's how you migrate data from source to to target. But what you what you don't really see are these deep dives into yeah. Let's go after the big challenges Because quite often what we're hearing from our customers is I'm okay with the solving challenges of of data if it's 2 systems, but I mean, that was me 50 years ago. I've got 200 systems. I've got 300 systems that I know of. Education around how to solve those large data management challenges is where I think there's a there's a huge gap. Well, thank you very much for taking the time today to join me and share your thoughts on the approach that you're using for being able to enable
[00:52:57] Unknown:
data management across global businesses and handling these issues of regional compliance and the challenges of data sovereignty and things like that. It's definitely a big problem as you put and, something that is worth exploring a bit deeper. So I appreciate all of your time and efforts on that front, and I hope you enjoy the rest of your day. Perfect. Thanks, Tobias. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Tim Ward and Data Hubs
Tim's Journey into Data Management
Challenges in Data Management Projects
Goals of Data Hub Architecture
Core Elements of Data Hub Architecture
When to Implement Data Hub Architecture
Migration Strategies to Data Hub Architecture
Topologies and Latency in Data Hub Architecture
Handling Transformations and Data Quality
Computation and Data Proliferation Challenges
Technologies for Implementing Data Hub Architecture
When Data Hub Architecture is Overkill
Different Interpretations of Data Hub
Biggest Gap in Data Management Technology