Scaling Data Governance For Global Businesses With A Data Hub Architecture

Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/lunode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC,

and Data Council.

Upcoming events include Strata Data in San Jose and Pycon US in Pittsburgh.

Go to data engineering podcast.com/conferences

to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Tim Ward about using an architectural pattern called data hubs that allows for scaling data management across global businesses. So, Tim, can you start by introducing yourself?

Sure. My name is, Tim Ward.

I have, had the pleasure of being on this podcast before,

but my background is mainly in software engineering.

In the last 6 years, I've been focusing more in data engineering. So combined, that's about 14 years of,

working in the software space.

I am based out of Copenhagen, Denmark. I have

my wife here. I have a little boy called Finn.

I have a dog that looks like Chewbacca that, called Seymour. So for those fans, of Futurama, you'll know where the the reference for that dog is, and it looks exactly like Seymour out of the Futurama episodes. So that's me.

And so you mentioned that you've only recently been involved in data management. So I'm wondering if you can just, give a bit of a background in terms of how you got involved in that space. Yeah. So at a at a previous job, I was working for quite a large, vendor.

It wasn't in the data space, but it turned out that the last project that I worked on there was, funnily enough,

how could we integrate

our customers' 3rd party systems,

into

our particular platform that that we were building.

And turned out that,

majority of the team I was working with,

we were all tasked with a very similar project but with different

we went through that rabbit hole of okay. So looks like we need to buy a couple of different pieces here. We need a a data integration piece.

Okay. So I've heard from 1 of my other colleagues that they're investing in a data governance,

platform. Yep. That that makes sense for my customer as well. And after you go down that rabbit hole of the umbrella of data management, you've got, you know, potentially 10 different products that you've got to stitch together.

And what got me into data management was that the trend

really was

every single 1 of these projects, by the way, they didn't go so well,

where they failed

was actually

stitching these different products together. And in some cases, they were, you know, from different vendors.

Not the not in my particular customer I was working with, but with some of my colleagues, they were also, you know, having the same struggles even with the same vendors products just

following the different,

or at least identifying

and and,

going after the different pillars of of of data management. So I think what got me into this was that idea of, oh, these all didn't go well. Maybe there's something to this. Maybe there's the need to investigate kind of more modern techniques to data management because it it did seem like this was a common request from most enterprise sized businesses anyway. And so as I mentioned at the beginning, we're talking about some of the work that you've done in terms of

using this data hub architecture. And I'm wondering if you can just start by giving a bit of an overview about the goals of what that architectural pattern is intending to achieve.

Yeah. Sure. I mean, I think I'll start with the fact that there's multiple different,

interpretations

of what a data hub is. If

if you go to, for example, some of the the analysts,

analyst firms in our space,

the way that they've interpreted is this concept of

they're making an analogous to something like the the subways in London where, you know, you've got stations that take people from 1 side of London to the other.

And, you know, there stops along the way where people can get off, but essentially, there's this whole spaghetti of or or hubs that are going around London to take people from 1 place to the other. And really, what those analyst firms are saying is, well, you can kind of apply the same thing to data whereas,

instead of transporting people, you're essentially taking data from a source system and making it available in some type of target system. For example,

your BI platform of choice, maybe your data science platform of choice, you name the tool. But wherever typically you want to do something with the data when it's prepared.

Now our interpretation of of Data Hub is still

similar, but I would say it has its differences. So,

really, the the goals of the Data Hub architecture came from a simple question that we had from 1 of our customers, which was we are a globally distributed business. We have a headquarters.

We have

regional offices.

We have localized

offices.

And this is probably a similar case with a majority of of enterprise,

businesses, especially those that are that are globally distributed. And the simple question was, I need to put in a data governance,

strategy.

And

how on earth am I going to

manage from a top down approach

all the possible permutations

of policies

and rules and

regulations

for

localized or regional offices

and also take care of the headquarter as well. And as you probably, you know, are aware as well, Tobias, what we typically do in the engineering field when we have a big problem like that, the first thing we look to is how do we break this down into smaller

challenges?

And when you do that, you've gotta have some way to be able to

instrument it, orchestrate it, bring it together once those components

have actually been

broken down into smaller chunks. So in fact, you know, our interpretation, the way that we've been working with the data hub architecture

in with our customers and with our platform

have really been

focused around,

how on earth,

even for 1 of the pillars of data management like data governance,

how how are you going to manage this from a top down approach

with what could be localized data policies on what data they are allowed to have,

localized retention policies on how long they can have it for. And then when you start to add in the other pillars of data management like data quality, what are the quality levels or metrics for or KPIs for 1 region might be different to the other? And this is what

this is what forced us into explore a new way, a new architecture to support this

complexity of globalized rules.

And so in terms of the actual architecture itself, what are some of the core elements and abstractions

that are incorporated into it to allow for this global distribution of data with localized control?

And what are some of the other patterns or reference architectures that you drew on in order to,

arrive at this approach? Yeah. So, I mean, our team

have been working with many different types of technologies. 1 of the ones that we've been focusing on and we use quite heavily within

our our platform is a graph database. So this this really this concept of a network.

And I think we've been spoiled by by working with it because, essentially, it offers offers us the highest fidelity data structure that we have to be able to

map data.

And playing with that, it kind of gave us some inspiration on what are the elements of this this hub architecture.

And so the first obvious 1 would be your hubs.

So

in a graph world, this would be nodes. Those hubs are essentially responsible for,

managing, as you put it, the localized rules. So they don't they they're not interested in knowing what does global think about this? What does what does that other local, hub think about this? What about my parent hubs like a region hub? No. No. No. They wanna manage the the specific rules

for their location. And so when you have this hub, you need obviously some way for hubs to talk to each other, and that's where this concept or this element of a a hub orchestrator comes in. And that's actually probably not too hard to see

this analogy

played out in other areas. A good example would be we use Kubernetes to orchestrate

our containers.

We use,

in the Apache world or open source world, we use often zookeeper

to orchestrate our different,

data stores or

to,

keep things in some type of sync between different environments. So there there are many other

ways that, in fact, we use this type of or at least it's available to us this day, this type of,

architecture pattern. So, obviously, if you've got these hubs,

the hub

orchestrator is responsible for, okay, well,

how do I talk from 1 hub to another? And there's a a really good analogy I like to use here which is travel. Now when I travel from

Copenhagen

to the US,

there is

a border set up which essentially says for you to enter, you need to do these things. Well, first of all, you need a valid passport. K?

You potentially need a visa. You potentially need

to have a work visa, for example, depending on what you're doing. And you're not allowed to be in any of these lists such

as you're not allowed to be on any blacklists or embargo lists that we have in our company in our country. And then if you check all those boxes, feel free to come in. And you can really start to apply that same kind of concept,

but replace people with data.

So how does data travel from 1 hub to another? Well, the localized hub sets up rules

such as if I'm gonna talk with another hub,

I need to make sure that the data completeness

or the completeness of records

is at a certain level. Maybe let's just for an example say 90%.

So if you wanna share your records with my hub,

you need to make sure your records are 90%

complete.

And if they're not, the the hub orchestrator is responsible for saying,

I'll give you tools

to elevate that level of completeness. And whether that's plugging in enrichment services,

whether that's, for example, if we're missing addresses, we might plug in something like the Google Places API.

If we're missing company data, we might plug in Dunson and Bradstreet or Open Corporates or, you know, name the the the the local

business registries that exist in in a majority of the the countries. But I'm gonna give you that tooling so that you can enter. And so

that type of pattern

actually was a huge,

inspiration and reference for an architecture

that worked in I guess I can't use the word nature, but worked is is already working in another part of the world. Now whether you'd like to think it works or not, that that's completely up to you. I'm sure many people will complain about some

essentially,

I'm not speaking from experience,

but essentially,

the concept,

it works.

And the great thing is is often when you bring in concepts from the real world

and then you say, okay. Well, let's put it into a machine

to handle this.

A lot of the issues

that you find with, for example,

scaling humans

to solve a problem. A lot of the time, we can actually remove those disadvantages

when we move it into a much more technical product,

technical architecture.

So

there's some of the patterns that we, drew upon for to to develop this approach, but, I mean, let's be honest. I've already mentioned 2 examples where orchestration

is

common

and that pretty much most enterprises are using, and that's Kubernetes to orchestrate

deployment of containers,

and

scalability and health checks and liveness checks.

And then you've got, tools like ZooKeeper that have been around for years years years to orchestrate,

you know, syncing of environments,

syncing of dictionaries and configurations.

So it's not a fast stretch to actually see how this pattern really came together.

And in terms of determining when it's worth going down this path of establishing

this

architecture of these various hubs and the orchestration between them and the rules that exist for determining what data is allowed to egress and in what fashion.

What are some of the signs that an organization

should actually go down that road and pay the complexity cost of actually implementing that and adding all these different constraints to their different localized data regimes.

Yeah. It's a really good way to put it because, you know, 1 1 of the 1 of the side effects or I guess potentially a disadvantage

of an architecture like this is are you over engineering for the situation?

You know, 1 of the the big signs immediately would be, are you even a globally distributed

business or you don't have to set up your hubs in in a geographical

architecture.

It can break down by different components, but, essentially, that's the commonality that we typically see with our customers.

So the first thing you would need to ask yourself is that, are you a globally distributed business?

The reason why I use geography as an example is that often when we're talking about banks,

insurance companies,

they're regulated

differently

in different countries and locations, and therefore, the rules around what data they can have, how long they can retain it for

is a good common case

to set up this type of architecture.

So the next thing you've gotta ask yourself is,

are you actually operationally a complex business? And if not,

this

I would argue is overkill, but it doesn't mean

that we should necessarily

shut the option off of growing your business.

I mean, I guess that's really the goal of a lot of businesses is, well, I want to expand into other locations,

and typically that comes with complexity. So I want an architecture pattern that upfront I don't need to subscribe to the complexity,

but I also don't want to shut off that path. And the data

the data hub architecture,

I would argue,

has this designed into it. In fact, you can imagine that a globally distributed business,

really, they're often growing new locations

quite often, whether this is through mergers and acquisitions or just natural growth of their business. And therefore, really, the data hub architecture not only is designed to help with this, but what we don't we're not wanting is if we've got 300 of these hubs,

as soon as we add another hub,

we don't wanna say,

that's n factorial

links that I have to sync up on rules and

you can imagine the complexity of how do these rules overlap. Do they contradict each other? That's just never going to work. So the data hub architecture actually instinctively,

and I think it's the use of the the graph and network model that helped with this. You know, you can start small. You can start with a 1 hub, the global headquarters. And as you grow out,

you might say, okay. Well,

we just have an an office in London.

I don't wanna put in the over the infrastructure cost of setting up new technology and new infrastructure and a new and

data center or in a new region in our cloud provider,

let's just shove that all within the head in the headquarters.

Now, of course, what that means is that once you do move to that hub architecture,

the complexity would be splitting that up, and that can be complex. So really, I guess, the signs that you would need to look at is

is growth. Do you see it in the foreseeable future? And if so, maybe start out with the concept of not closing yourself off from an architecture like this.

And for an organization that already has an existing data platform,

what are some of the migration strategies that they might look to for being able to move into a data hub app architecture

and open up their data for being able to be integrated or distributed across these localized hubs. And then also,

1 of the other

possible uses for a data hub architecture that I can envision is if there's some sort of a

data collective and being able to share data across organizations

or being able to open up certain datasets for public access? Yeah. Yeah. Good question. So, I mean, this question reminds me of another

thing that we've done in software engineering over the last few years, which is moving off the monolith, moving off the monolith to microservices.

And what what did we do? What did every business do? So, essentially, what we did is we said,

okay. Let's break up the monolith. Let's break it up into components,

but then I need to put something

in between as a kind of a mesh or a I mean, 1 of the examples that was very common was an enterprise service bus. Putting in an enterprise service bus where we pass around discrete small little messages to so that these different components can kind of talk to each other in a a little bit of a decoupled way.

And so, really, this is kinda similar to what the migration path looks like for moving off a classic

data platform

into

a little bit more of a distributed

governance environment.

And I'll just reiterate,

governance is just 1 of those pillars. Right? And I think the reason why I'm choosing that

is because it's usually 1 of the first things that people wanna do, put in governance strategies

in in their business.

And you can also just imagine that, okay, this can get really complex as I add the extra pillars of of data management in. But essentially, the kind of the kind of experience with the migration, as I alluded to before,

it's a gradual move. It's it's really saying, okay,

how are we gonna start splitting up this monolith? And typically, the way we did it in software engineering was say, let's identify the main components

of the platform.

So we would split logging out to its own service. We would split maybe,

jobs out to its own service. We would split something like data access potentially out to its own service. Now other architectures in in microservices would say, no. No. No. Like,

split the service out to have all of its components sitting behind it. So logging will have a database. It'll have a data layer because that's all it's responsible for. Now whichever way you did it, it's still kind of

it's still going to help in this example, but

really the

the the approach

that we've seen our customers take because majority of our customers started out with

the monolith. They said, okay. We've got 1 central hub of data.

It's got all of our data in 1 place. We've got all the governance rules in there. Oh, got it. I didn't realize how complex

the governance rules were across the different businesses.

There has to be some way to split this up. So I would say the migration path, if you've done that before in breaking up the monolith, it's very similar to that type of situation.

Now to answer the second part of it, which is about making this data available to the rest of the business, whether this is just internally or potentially to the public. Once again, there's there's multiple parts of that. Governance plays into it, of course, access control and and and things like that play into it as well. But what the hub architecture basically does,

like it does with the other components is it takes what is typically a top down approach, which is how from the my white ivory tower

do I set the overarching rules for my global business on how we share data? Now 1 of the beauties of the hub architecture

is that it instinctively has a hierarchy within it. So a graph

is just a higher fidelity version of a tree. So for example, if I'm wanting

that top down approach where global says, listen,

We have a global policy on sharing data that we can't give out anything that's personal. So in fact, the hub orchestrator can be responsible for saying,

got it. So in fact, there's some policies

that the local hubs don't have any control over. They've been in they've inherited

these rules

from the global headquarters or from the regional hub, and they're enforced

by default

onto the, localized hub. Now whether that localized hub can then say, I'd like to override those rules, that's kind of the responsibility of the hub architecture to to to know if that's even allowed or not.

But what this means is that

at least the the higher level parents can say, listen,

you might have some extra ways that you want to share data in your local hub, but from a global business,

we're handing down these rules.

Whether you've got rules to add on top of that or override,

that's up for the hub orchestrator to actually figure out if those rule rules can actually are compliant with each other or not. And 1 of the other interesting things in this

architectural approach is understanding

what the useful topologies

are. Is it something where you would have a flat network where you have maybe every hub interconnected with each other or more of a hub and spoke model or something that's more along the lines of a DAG where

you might have different levels of linking between different nodes and different, maybe, sort of,

self contained subnetworks that all interact with maybe a hub network and how the

overall transit

across those different hubs. But, you know, particularly if you have a complex topology where there might be multiple different nodes of traversal,

how that works into things like latency

and

issues of being able to

discover and query across all of these different data repositories.

Mhmm. It's 1 of the things that, often comes when we

separate data out

when we

I mean, 1 of the ironies here is you could argue, did we just bring all the data in

to then just silo it again? And of course, that's not the the kind of intention. In fact, the hub orchestrator

really acts as a global journal of what data is available

across the entire network.

And then, of course, we'll give you control in saying, well, hub 1 and hub 2, they're allowed access to this, but if dictated,

I can inform them that, you know, there are these other datasets that are available and and all they need to do is really ask to be able to get access to that data. So essentially, the the the the hub orchestrator is kinda like this this transaction log or

this

journal of I'm writing

everything about every hub, and I'm gonna orchestrate who has access to what as you mentioned in what hub, but also potentially in what subtrees

of that network. And so some of these architectures can or or or at least

these relationships

can get pretty complex.

What we're really recommending in this

is to use this analogy I mentioned before of of countries

and cities

and states

and use that kind of way to build up this network.

And in that way, the what the reason why this works is because typically,

that's how a lot of businesses are distributed as well via locations. Therefore, it kind of makes it a little bit easier to manage. But I I won't take away from the fact that, yeah, these architectures

could get

could get quite complex in in how these relationships and

how data is shared between them. So it's it's definitely, I would say, 1 of the the discovery

endpoints with our customers on how far are they taking this and where the complexity is actually starting to arise

where initially maybe

the the upfront,

we wouldn't realize that these kind of things would come up. And then another

issue, particularly if you have a deeply layered topology,

is how you handle the transformations

between hubs where they have different rules in terms of how the records should be represented or data quality or cleanliness issues,

and being able to handle issues of potential data loss across those different nodes. Yeah. Exactly. So I'll give you a couple of good examples.

So and, you know, using that analogy before of, like, border control,

it's a good way to kind of conceptualize what's happening here. So each hub is responsible for saying,

if you're gonna talk to me,

you need to have

this checklist.

Right? Not the other way around.

So it's not responsible for saying, hey. If I give my data to you,

it needs to meet these standards. No. That's the part where it wouldn't scale because suddenly you're taking a top down approach where you're saying I'm now responsible for my own hub, but also every other hub in the network as well.

So when it comes to things like data quality levels, when it comes to things like transforms let's just take a simple example. You've got 1 hub that says, hey. My data that I have, it looks like this. It's a nested JSON object

and, I'm I've been instructed from the, the hub orchestrator to take data from London

hub and move it to

or transfer it to the global hub. And so the global hub has a rule that says, no. No. No. I need flat data. That's what I'm wanting. And the kind of,

for lack of a a better word, but the names of properties or columns or attributes, whatever you wanna call them, I need them to be called in a specific way. Right? That's up for the hub orchestrated to figure out. Now

figuring that out is not the hardest part because essentially what you're doing is you're

on entry or potential entry into your target. You're just analyzing a list of rules that are only local to that hub. So from a management perspective, you don't need to know about the whole world.

But what you do need to do is then give

those hub owners tools to say,

well,

I'm not here to just tell you that we can't talk. It's my goal to give you tools

to address that issue. And whether it's giving them tools to say, I'll give you a tool that

transforms JSON or nested JSON objects into flat representations.

And maybe that's 1 approach.

I'll give you an example with the analogy I'm using. You know, if I want to travel into the US,

it's not like I get there and someone says,

I can't help you. Like, I don't

I don't know what to do for you to enter this country. They say, here is a form.

Fill it out and that's how you get into this country. So, really, it's about facilitating

the hub orchestrator

that identifies,

okay, there's an issue. There's a clash. We can't talk.

And then there's the the role of maybe maybe you could call it like a facilitator which says, and here is the form you need to fill in

to make it so we can talk. And then to use an example, say you have a name field in a particular record, and in 1 region, it's just a,

flat text field. You put in whatever characters you want, whatever types of spacing or hyphenation.

But then in another

hub, you have a name field that requires a first and the last name, which obviously doesn't fit a number of different localities and their particular,

approach to how naming is handled. So how do you approach that type of challenge where 1 hub is enforcing just a flat text field that supports Unicode, the other 1 supports username you know, first and last name and maybe only accepts ASCII

and being able to handle transformations. Is it a case where you can say that I refuse to make this transformation, and so this data won't be transmitted? And if so, how do you signal that to the person who's trying to consume that data for a particular analysis?

Exactly. It's a really good example. So

first of all,

I guess the first thing would there would be an attempt by the hub orchestrator to

say,

okay.

I'm going to give you a tool that allows you to convert between character character,

sets.

So there are some character sets that

they have so low fidelity that you can't increase them. For example, if you've used a UTF in encoding that,

ruins your,

accented characters like we have here in Denmark, for example,

then it's sometimes hard to reconstruct that using tools. So you're you're you have already answered the question

which is there are some times where it'll just say,

no. I'm sorry, but we can't do this. Right? There's no I can't give you the tool to do it. You've either got to manually address it yourself, but I'm just going to reject this. I mean, using my analogy again with travel, I'm sure there are many times where a person travels into a country and when they get there, it's just I can't help you. There's no form you can fill out.

It's just we're not compatible. I I can't, I can't take data from you. And a good example for this would be, you know, I would imagine

and I I I have experienced at least that most

big large enterprises

and

actually expand this out to most businesses in general, probably want a global view

of their data. They wanna say, yes. Even though I'm across,

20 countries,

I need to do global reporting.

So you could probably imagine the global hub would then say, okay. So,

all hubs below me

start to send me your data. And the hub orchestrator will then say, okay. Got it. I can go and help you with the the looking up of that data and the transport.

But the global hub's going to have some rules and policy set. A good example would be data quality scores. So,

I'm not gonna accept anything below 90% completeness. I'm not gonna accept anything below 95%

accuracy.

And,

if, the data coming in doesn't meet that, I will then give those hubs the tools to be able to get to that level and that will take time. But, essentially, if I'm going to accept that data, it needs to at least meet these levels. And you can probably imagine the same could be applied to things like,

data policies for governance or compliance, such as if you send me data and it's personally identifying,

I'm just not going to take it. So I'm going to give you tools that allow you to not only identify

why I'm not taking that data, but to be able to rectify the situation

as well. So that's probably like a good example of a case where

many times

that communication

or

that orchestration that the global headquarters is asking for, you get to a point where it just says, nope. I'm sorry. I can't help you. I will just downright reject that data. And in this architecture, is it primarily just as a means of storing and communicating about and transmitting data? Or do the individual hubs also provide capacity for computation

where in the instance that we were saying of, for instance, the name where there is no way to cleanly convert between 1 representation to the other where you can push your analysis down to that hub to perform the computation that you want and then return the,

results back to that in sort of a scatter gather approach? Yeah. I mean,

I'll be honest. I haven't even thought about that, but when you think about the architectures that we use,

including we use Kubernetes as our orchestration framework for our containers and for deployment and all the the goodness that comes with that. And so it's not hard it's not a hard stretch to then say, well, can all of those nodes just, represent

a node pool?

And then can I use Kubernetes

actually just to register those,

as node pools? And, you know, if I need

that extra resource, I could look out to Kubernetes, and to be honest, Kubernetes does this for us, is say, what pods are available? I've got someone,

demanding 4 gig of RAM and t CPUs.

Who's got it? And you could, in a way, utilize that as a farm to distribute out or at least balance the load of infrastructure and try to get the efficiencies on on not only the infrastructure, but the costs associated with it. So I'm sure you you you could,

bend this to to represent,

something that could that could do something like that. And another question

about this data hub architecture

that can potentially lead to challenges in data governance and management

is the question of data proliferation, where if you're copying information between the different hubs, how do you track what records are being used in what

and particularly in the case where you have an update to a set of information,

how do you then replicate that across all the different areas where it has been copied to? Or in the event that a record is deemed,

false or in GDPR where somebody has requested deletion of their data, how do you control handling

removing that information from all the different places where it's been replicated to? Let's just start with this. That's it's really complex. Right? It's a really

complex challenge because what we're doing is now we're just now we're really exploring.

We're all the I wouldn't even say edge cases, but we're really pushing this architecture to its limit. So, you know, for example, in including we have this idea of a mesh API, which essentially says, hey. If,

if I've pulled in all the data from all the operational systems and then someone identifies that Tobias'

job title is wrong, well, because I know where it came from,

I also have that lineage of, okay, if I did change it, what records in the source systems need to be

updated? Now,

the way our mesh API works is is is in 2 ways. First of all, you can you can

automate that process behind a workflow. So, of course, you can imagine if I've got a record in from, I don't know, a CRM system like Salesforce,

it has an ID associated with it. Great. That's been ingested into our platform and hey, I've got another record on Tobias from our HR system Workday.

It's got an ID as well. So in the included platform,

you know, we figured out, these are duplicates. Let's merge them. So now I've got 1 record on Tobias, but 2 pointers

to the the source systems.

So that's in 1 hub.

So then really to span the the multiple hubs, essentially, what you're doing is just adding extra layers

of mapping.

So instead of saying, hey,

I'm updating the,

job title of, Tobias and there are 2 source keys, 1 from Salesforce and 1 from Workday

that need to be updated.

What you're doing is you're just putting an extra layer of

abstraction or

misdirection

in there, which is, hey, I'm over here called the hub orchestrator.

I actually know where all data is located. Now each hub won't have access to this journal. Right? They'll only be able to look up what they've got access to. Essentially, on the journal that says,

got it. So these hubs are actually where I originally got that data from. So that essentially what you're doing is when you're integrating data, you're mapping up to some type of core,

what we like

to call vocabulary

or schema. And then on the reverse, you're unraveling that. You're saying, cool. So my goal

across

the business is

I don't really care or

want to know

if in Salesforce,

we call it the job title, and if in Workday, it's called job position. I I wanna map that up to kind of some type of standardized

vocabulary

or lexicon,

and this is quite normal

in in in the data governance area. But really when we're starting to write back or we're needing to delete records, so Tobias comes in and says, delete everything

across all hubs.

Essentially, what you're just doing is

unraveling

that vocabulary mapping that you've done back to the source hubs to delete those, those records, and it's really hard. There's so many challenges involved in it. I'll give you another good example. There's this concept of data sovereignty, which is essentially where is the data located. Now

it also will typically depending on who you talk to, they'll also include, okay, from a location perspective, where is that data also traveling? So obviously, the hub concept or the the network

kind of instinctively

has this in its design that a hub would usually set up in the region that you're wanting to host that hub in. So in something like AWS, you would go in,

US east or US west, the same in Asia, for example. And,

what becomes complex with this is also these rules of sovereignty.

I'll give you 1

country that's quite complex, and that's Germany.

So Germany

is very strict on where it can host data. And so for example, if I was a hub in the US and I said, give me the data from Germany, you can't just easily

or legally

transfer that data to US

service. So then the mesh framework or so the data hub framework becomes really interesting because if we've got the backing of something like Kubernetes

from our infrastructure perspective that says, hey. If, you need any more hubs,

I can spin those up. That's just an extra node pool and, you know, your helm charts have already given me instructions of what, a node pool looks like. You can start to think of situations

where you might spin up new hubs on demand

and tear them down when the processing has been done. So for example,

there might be a rule that says,

yeah,

you can send data to Germany,

but we can't do it the other way around. Got it. So the hub orchestrator

knows that rule or at least you map that into the hub orchestrator.

Let's say, that's the direction

that these, hubs can actually talk. So if a global,

let's just say they are based in US, says I need to globally

report.

It says not a problem. But to crunch the German data, you either need to

use the German infrastructure to do that and send data from the US to the German infrastructure hub or we've come up with other examples we've seen before where

maybe you have to spin up data in a sovereign country, I e

take something like Switzerland. So instead of US sending to Germany and Germany sending to US, we spin up infrastructure in Switzerland. Both the sources send their data to Switzerland. That data is crunched, and potentially, that hub is completely torn down. And as you probably already know, Tobias, this is something that, you know, Kubernetes does well. You know, spinning up different environments and tearing them down when when they're not necessary. And

for somebody who's actually interested in building out their own implementation of this data hub architecture,

what are some of the technologies

that are useful for being able to implement things like the hub orchestrator and the data storage layer and some of these automated transformations?

And what are some of the considerations

or edge cases that they should be aware of as they're starting to plan out some sort of deployment like that? Yeah. Good question. So I'll start with this. Having a hub architecture

in a traditional on premise environment

is quite hard as you can imagine,

especially if we're doing location based

hubs.

And, you know, most enterprise businesses still

even though they might be in the cloud will quite often have a v an internal VMware cluster or or some VM,

environment set up. And so 1 of the the the kind of technologies

and bases would be that if you are a completely on premise,

business, the data hub architecture becomes a little bit more complex to achieve. Not unachievable,

but much more complex. I've already mentioned a couple of the kind of core elements that or technologies that you would want to think about. Kubernetes,

of course, and, coming with that containers just in general

is 1 of them. Why? Because you might want to spin up different environments or as you used in your example that you could use the data architecture for is, hey, I'm a hub in London and I'm doing absolutely nothing. But Germany is processing data and it's overloaded.

Give me your work, and obviously,

if those

sovereignty rules are set up that we're allowed to do that, we're allowed to transfer data between those hubs, then fantastic.

So the thing is that if you've played with, Kubernetes before and and and containers, you're actually a majority of the way there. All we're really doing

is applying the same methodologies,

the same ideas

to the hosting of data where a container is not necessarily

a product. So we're not talking about,

a container like Spark or or Kafka.

But actually the container is more you could represent that as a hub

where it's like not just about technology, but it's also the data. And it's the policies of how

different Docker containers work with each other. So what does that, bring in? Isn't that why we have Docker Compose? Docker Compose is there to say, this is how I compose my hub network. And then we use something like helm to say, oh, and this is, how all of

the charts are set up for Kubernetes.

So, you know,

things like the amount of RAM and CPU that we allocate to the different nodes in the cluster or the pods.

So then you really start to get into,

if I can understand that, it's really very similar to the data hub architecture itself. And then the other question

of the hubs and how they relate to each other is whether

you find it useful or necessary

for each of the different hubs to be implemented in a homogeneous fashion where everything is using the same set of technologies?

Or because of the abstraction layer of the hub orchestrator,

is it possible to have more of a heterogeneous

implementation where each of the different hubs can use the technologies that are

most well suited for the particular teams who are maintaining them?

So

being completely transparent, I have not even humored the thought. But when you when you think about it, if I use the the

the technologies that I was, talking about in the previous questions, you would argue that kind of if your application is built well,

you could argue that and and and the proper abstraction layers have been put in. You could say that, hey. At any point, if I wanna switch out SQL Server

for MySQL, I should be able to do that. Which kind of plays into the role of got it. So if I've got a hub that's set up with my own custom implementation,

all I'm really doing is I'm using the hub architecture

as

interfaces

or some type of schema

or some type of agreement that I'm saying, listen. To play in this network, to play in this hub, you need to fill in the following things which is you need to tell me rules on data quality. You need to give me policies, and you can imagine that the hub architecture or the hub orchestrator that we use, it's very

technology

agnostic.

It's essentially

YAML files that have policies.

So there's no

vendor specific

or technology binding besides that, I guess, you could say that, ubiquitous

language or at least something that's not bound to a particular database type or a particular programming language. So then it then it makes it kind of a little bit easier to grasp to the fact that these hubs don't need to be the same technology. Rather,

they just need to adhere to what the hub orchestrator

is

telling them to adhere to

so that it can do its job

and tell these different systems how to do their job. Now here's where it gets a little bit more complex is that if it's also the role of the hub orchestrator to say I'm going to give you the tooling

so that we can talk. I e your data is not complete enough. Your data is not accurate enough.

And so, you know, good luck with that. So then you would prompt the question. Okay. Is it the job of the hubs to have those tools or is it the job of the hub

orchestrator to provide

kind of ubiquitous

tools that also aren't necessarily technology specific? And I find that hard potentially to achieve because I would argue that if you're not binding yourself at least to some consistent tooling around hubs talking to each other, you run the risk of saying there's 20 different ways to,

solve this challenge.

And every time I wanna talk to a new hub, I'm given a new tool to do it.

So it feels like 1 of those things that needs to have more discovery around it. And maybe, really, the hub

orchestrator is responsible for giving that consistent tooling so hubs can chat. And when is the data hub approach the wrong idea,

and when would it be easier to use just a a different style overall, whether it's just a monolithic So, you know,

the funny thing is I run into we work with a lot of customers,

So, you know, the funny thing is I run into we work with a lot of customers where even though they're large and that can be measured in multiple ways. It could be revenue or employee count, but

I'm in this case, I'm talking to employee account. And, you know, when they talk to us about their business, well, some of the first things they'll say is, Tim, we're not actually a complex business. Right? We, we for example,

we are in facility services

where

we clean people's offices

or we,

we do the catering at people's offices. So

inherently,

we're maybe not necessarily

a complex business like

shipping that requires lots of logistics

and good timing and schedules and things like that. And so I would definitely argue that with any win that you take from engineering, you always take some losses. And so really, if you find that your business is not complex, something like this is is overkill or over engineering. And, of course, as I used the example before,

I would argue

not necessarily the goal of business is to become more complex. It's definitely the goal of businesses to expand and grow and hopefully go into other countries, and there's inherently some complexity that comes with that. But it has. If you find that it's you're not a complex business,

there's not,

you know, hundreds and hundreds of rules that change per region, and maybe change in different departments.

Something like this seems like it would be absolute overkill.

And are there any other aspects of the data hub architecture

or some of the ways that it's being used or the benefits that it provides that we didn't discuss yet that you think we should cover before

before we close out the show? I think the the 1 thing I'd like to just reiterate is that if you were to Google Data Hub, there are multiple interpretations

different things. And if you see our interpretation, you could have easily called it, oh, that's a data network or that's a data mesh or that's a data service mesh, and I would back that up. I would say, yeah, those are 2 other possible ways of of interpreting what a data hub is for. So I would say when you're doing your exploration,

just be wear wary of that people are interpreting this quite different. And I believe that there's also a

recent, project out of LinkedIn called Data Hub that they're using for their data discovery engine as well just to muddy the waters a bit further. Yeah. Exactly. And it makes sense, doesn't it? Like, the hub, that's where I go to discover where the the data is. And so,

yeah, I agree. It it it it's it's maybe a term that in 2, 3 years, we'll all congregate

on 1 interpretation

of it. And I I I think that would help clear up the situation. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today. So I

think, education. Education is a big thing. And here's why I say it, is I I see a lot of businesses

taking the

traditional approaches to data. It makes sense. Right? It's what worked at my last workplace. Let's just replace

and let's just do that same thing again. Maybe with a different vendor,

maybe with a little bit more modern technology.

1 of the things I think is really missing is this

education on the data management space itself. And, of course, you've got, you've got great, services like Udemy and Pluralsight

that are going out there to to to train. But I actually think we don't have enough education that's cracked this concept of, well, how do you solve some of the real complexities like we talked about today? You know, we often see things like, hey. Here's how you build a pipeline.

Here's how you migrate data from source to to target. But what you what you don't really see are these deep dives into yeah. Let's go after the big challenges Because quite often what we're hearing from our customers is I'm okay with the solving challenges of of data if it's 2 systems, but I mean, that was me 50 years ago. I've got 200 systems. I've got 300 systems that I know of. Education around how to solve those large data management challenges is where I think there's a there's a huge gap. Well, thank you very much for taking the time today to join me and share your thoughts on the approach that you're using for being able to enable

data management across global businesses and handling these issues of regional compliance

and the challenges of data sovereignty and things like that. It's definitely a big problem as you put and, something that is worth exploring a bit deeper. So I appreciate all of your time and efforts on that front, and I hope you enjoy the rest of your day. Perfect. Thanks, Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links