Revisiting The Technical And Social Benefits Of The Data Mesh

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm welcoming back Zhamak Dehghani to talk about her work on the data mesh book and some of the lessons that she's learned over the past couple of years since she first introduced the idea. So Zhamak, for anybody who hasn't already listened to the past episode that we did, can you just give a bit of an introduction?

Hi, Tobias. It's great to be back. Yeah. I work at Thoughtworks as the director of emerging technologies. For the last few years, I've been busy.

I guess it's initially hypothesizing about data mesh, and then over the course of last few years, implementing it, refining it, and validating it with our clients globally. Do you remember how you first got involved in working in the area of data? I had a bit of a chance, I guess, to think about where it was when I first, you know, started with data, and it was when it wasn't cool at all. So it was probably the first time was about a couple of decades ago, embarrassingly.

At university, my major was expert systems, like AI based expert systems, which looked very different from what we call AI today. And then, I guess, mid career, I had an opportunity to work on a fascinating

distributed system

product.

Basically, what you call today a data intensive application.

It was gathering information from all sorts of systems across large networks from routers and databases

and ATMs and what you can imagine

to be able to monitor. So I got engaged in building time series databases,

streaming, building a query language on top,

aggregation system, analytics on top. So the full stack was, you know, developed greenfield, and that's before the cloud and all of the cloud services we have. That was a fascinating

experience, which gives the depth of what is involved working with data intensive applications. And more recently, the last few years has been kind of my focus has been on

how do we do data sharing at a large scale in complex environments where, you know, we work today in large organizations?

And so for people who aren't familiar with the ideas of the data mesh and haven't listened to the episode that we did almost 2 years now

and haven't read the book. Can you just give a bit of a recap about the principles

of the data mesh and some of the story behind how you came across the ideas and principles and some of the potential benefits that could be realized by organizations

who implement it. Well, data mesh was born really as a hypothesis,

as an answer to the challenges I was observing,

you know, a few years back working with large technology and data forward clients at the time.

The challenges were, you know, in the discord between

kind of the problem space and solution space. What do I mean by that? I was working with organizations that were complex. They had many functions, many domains. They had, you know, big data aspirations. So they had large number of use cases for how they wanted to use data in

ML. And because of their complexity, they had data coming from many different sources, from many touch points with their, you know, customers, partners, ecosystem,

players.

So the solutions that they had wasn't really meeting

the needs and aspirations and the complexity of their environment.

And the solutions back then, you know, a few years back, it was still early days of a lot of companies moving to cloud. So some of them had already moved out. They had data lakes or data warehouse on cloud. Some were still running their solutions in house.

So

I looked at the solution space and seeing what were the bottlenecks that was stopping these organizations to getting value from their data. So on 1 hand, know, they were spending a large amount of investment,

but not getting on the other hand, they weren't really getting results. So data mesh at a very high level is a decentralized

socio technical approach

in

accessing, managing, sharing data essentially

for ML analytics

use cases.

Then you asked about the principles. So if I deduce it to its 4 principles,

basically, 4 very generic principles

underpinning that. Of course, each of those principles leads to

very specific implementations and, you know, a technical kind of manifestation we can talk about. But at a high level, the principles are,

1st and foremost, the idea of distribution of data ownership

and data sharing,

and architecture

for data sharing,

2 independent,

autonomous

domain oriented teams.

Basically,

following the seams of your organization, the way that organization and business decomposes itself

to get scale, usually these are your business domains, business functions.

And following that with

teams,

technical and data, I suppose,

kind of informed or capable teams

that not only build applications that support for microservices or applications, legacy systems that support those business functions, but also

the technology required to share data

for analysts for use cases. So then if you follow that principle, the principle of domain oriented ownership,

you may say that, well, that may not look very nice because, you know, the teams may end up kind of siloing their data in their own databases. How how is that gonna work? So then the second principle of data as a product

tries to change our relationship with the data to say, well, data is not for you to just hoard and collect in your little database for your own use cases. Data is there to share as a product

with the rest of the organization.

And then, you know, if I follow that principle, you might say, well, how is that possible that each of these domain teams

can

have the capabilities

to build all of this data infrastructure that's needed to share data at scale or to peer to peer use data at scale for their analytics purposes.

And that's the principle of a new look at the self serve infrastructure

and the platform

to give autonomy to these teams, to make it

feasible and cost effective for generalist

kind of tech developers or app developers

be able to work with data. So self serve data infrastructure is the 3rd principle.

And the 4th principle was principle was introduced later on because you know, the the data governance

fears

chaos.

So how do we make sure in this decentralized world,

there's still some sort of a global harmony and interoperability between these data products that are being developed by different teams? How do we make sure

the privacy still respect the legal compliance is still applied?

So the first principle of federated

kind of computational governance

tries to introduce a governing model, an operating model, as well as kind of technological

solutions to allow embedding

policies and policy execution in every single data product. So we can still have that, you know, balance and equilibrium between the autonomy of the teams sharing data or using data

and the global

interoperability

and the harmony

of the policies that need to be applied cross cutting to all of them. Sorry. I didn't take your rest. It's a long, long, long explanation, but these are the 4 principles

on dependent data mesh.

And so when we first talked, it was in July of 2019,

and it was, I believe, fairly close to when you had the initial

posting kind of posited these ideas and this thesis about the data mesh and some of the potential benefits for being able to

manage this domain ownership of these data products. And now that there's been more time for you to be able to reflect on it for people in the ecosystem to be able to experiment with the ideas and implement them and figure out how the tooling aligns to these principles,

What are some of the ways that your thinking around the problem has shifted or evolved? And what do you see as some of the kind of core elements that have

been proven out, and what are the pieces that you see as

having been in need of for the refinement?

Yes. When we talked, I think you were probably 1 of the very first people that I talked to after I wrote the first article. So I'm grateful for your platform,

letting my voice heard, I suppose.

I guess at the principle level, the main change

from the first article I wrote was the introduction of the 4th principle. Like, very early on,

I think the idea of decentralization of data ownership

as a principle hasn't changed. The idea of the data

as a product and sharing didn't change. And then the platform at the principle level, those 3 existed in the very first writing. And the reason for it, that

they lasted through the test of time, was that

these are not novel new ideas. These are

the principles

underpinning

scaled solutions

that we have created in the operational space for the last 2 decades to respond to the complexity of digitalization of organizations.

So I didn't do anything novel there. I just thought, how would we apply those to the data world? But the principle that I had to introduce and I introduced later on in a second article I wrote around kind of principles in logical architecture was this

idea of the governance. Like, what does governance look like? And I was very much as a technologist, of course, I was very much focused on the automation automation automation. I initially said, oh, this is an automated let's make governance invisible and make it automate everything.

And that's part of the kind of computational

aspect of the governance. How can we think about,

you know, the the

infrastructure that's running these data

products. So that part kind of evolved and became so on and forth. And then I realized

that 1 of the concerns and challenges with governance

are actually to do with operating model. The roles, the responsibilities,

who does what, what's the role of governance.

And I kind of, I guess, we live in the US. We have somewhat the federated governance model. Or if you look at Europe, the same thing, or UN.

So I hypothesized on application of federated decision making model

that we've seen, I guess, in the world in an or to an organization.

And again, I think these are areas to be yet refined further.

But the federated

operating model and then computational governance became its own,

I guess, important enough to be a principle.

On the technical front, a lot has changed because we've been building this for the last few years, and

every implementation of it looks different. So, yeah, so I think these these principles are just

getting more refined

as we implement it and well understood.

And 1 of the

core

kind of pain points that you called out in the initial posting and that has held true is the fact that

data teams have typically been very

sort of underleveraged, but overworked because of the fact that it's generally oriented around a centralized team of people who are responsible for all data across the organization.

And so you lose a lot of the context and domain expertise as it traverses these various boundaries, and you have these very

labyrinthine

point to point connections between various systems.

1 of the things that we talked about in the previous episode was the idea of still having a kind of data platform team that's responsible for the technical implementation,

but building it in a way that each of the different domain owners

is able to self serve on that. And

1 of the biggest changes that's been happening over the past couple of years is this rise of the so called modern data stack where democratization of access and giving everybody the power to have some measure of control or participation in the data of the business, it has become kind of paramount. And I'm wondering what you see as some of the ways that this modern data stack and the decomposition

of all of the different layers has either enabled or potentially hindered people who are trying to realize this idea of the data mesh in their own organizations?

I think the way I see it is that, of course, all of the advances we've had in the modern data stack

has been great. Right? They try to make the life of a data engineer easier. So as a result,

it's great for data mesh.

What I would say is that

we are

still missing a very crucial

element

in really being able to mobilize

the largest population

of, technologists,

a lot of app developers

that are essentially the source of the data and the end consumers of the data, right?

Applications today, many of these kind of data driven organizations,

their applications are augmented

with ML train models and so on. So they need to have access to the data.

And many of these applications are generating the data that finally trains those models. So the feedback loop between the app development and analytics has become tighter and tighter.

However,

many of the data tooling that I see today are still

assuming

a division of responsibility

and a division of

infrastructure

between what we call a modern app development stack and a modern data stack.

And some of the tools are actually

increasing that gap between the 2, and some closing the gap. So

for data mesh to really take off, I think 1 of the pieces is a kind of a platform. And I know we use the platform as 1 thing, but it's not 1 thing. Right? It's a collection of tools that play really nicely with each other. I think the collection of tools need to imagine the life cycle of data product from generation creation

to consumption,

to creation of ML models and then consumption back to applications as 1

feedback loop, as 1 journey lifecycle.

And look at

kind of how the platform that enables this cross now and cross functional team

needs to play nicely

with modern application stack, needs to integrate nicely

with how data is emitted from these applications

or how they don't fit back to these applications.

The solutions we have today, they have made an assumption,

rightly so, because that's how the world was when they were created,

that,

you know, application data will get to us somehow. We build these pipelines, and it will get to us. That's that's not the focus. The focus is once we got this data, what we're gonna do with it? How are we gonna model it? How are we gonna enable the downstream, that last line to the data scientists and data analysts?

So I think the applications that would work nice, or the solutions or platform capabilities that work nicely with Data Mesh are the ones that

close that gap. They really like, very simple example, and it's no criticism for any particular vendor. But a lot of application developers

these days are very well familiar with, you know, containerizations

and running a Kubernetes cluster, not running them, but running their applications on such a cluster,

monitoring their applications with certain observability tools. But if this team, this cross functional team is now

need to provide

data products, they need to completely shift to the other direction

and go and, I don't know, run a VM based cluster of Spark and then run Spark jobs somewhere else and then monitor those somewhere else.

And even simple standards, like the open lineage on the text, the part of that group,

forgets that this tracing that happens for the data, it it needs to start from application. So we've got a set of open tracing standards for the application tracing and another standard for open lineage for data,

are we thinking about connecting it to? Perhaps

not as much as we should. So,

yes, I think it's great we're investing a lot of money in creating data platform, modern data stacks, but not paying enough attention. How does that play nicely with

modern app development stack? Yeah. I definitely agree that

the continued division of software and application development and delivery

and development of data pipelines and data analytics is still

too segmented, and there needs to be a much more native integration between the 2 where, you know, 1 of the challenges

of application development is that it has become increasingly

sophisticated and complex, and so

the responsibilities of application developers is continually growing. That'd add another responsibility onto their pile of, as you're designing and building the application, you need to be considering as a first class concern, how are you actually going to expose the

analytical information that's necessary for other consumers. And so

I think that that needs to become part of the standard set of requirements

for the application delivery and the definition of done before

we can really be able to fully realize the proper value of data within our organization.

Otherwise, we're in the situation where we are now where the application developers

build the applications, they generate all of this data, and then they just kind of drop it in the database and say, good luck. Not my problem anymore. I did what I'm supposed to do. I have an application that my end users can take advantage of. But

I think that as the users of these applications become more sophisticated and data literate as well,

that also continues to drive the need for these analytical products, which then feeds back into the need for these applications to be able to

embed that as part of the experience. And we've been seeing that with some of these embedded analytics solutions. I recently talked to the folks

from cube dotjs, which is a way to be able to actually expose an API of your analytical data that you can, you know, power some of these charting libraries with. You know, there's the idea of reverse CTL or operational analytics where you need to feed all of the information from your data warehouse back into the SaaS platforms that your business users are relying on. So I do think that

that there is that foundational need of

software systems to

have analytical use cases

embedded as a first class priority in their design and delivery.

Absolutely. You said the right word. Like, I love your thoughts here because

unless the day comes that we are intrinsically unless

that day

comes, and that

day only comes when Unless that day comes, and that day only comes when

analytics ML is embedded into the application,

embedded in every business function,

none of this will work. We will be in this, you know, hypocrisy of putting data driven all over our mission statements,

and yet externalizing

the responsibility

of anything to do with data to a different team away from

where the action really is happening. Right? So I completely agree with your statement about embedding

intelligence, whether it's analytics or ML,

into everything we do, including the applications we build.

And so the idea of productization of data is at the core of the idea of the data mesh. You know, software applications

and products are

intrinsically

delivered as an overall experience. But

the kind of driving force that initially led to this schism between

online transactional systems and

sophisticated analytics is the

fact that the data technologies that application developers rely on in order to be able to have

responsiveness and flexibility

in how they can build and deliver the experience

are

not conducive to being able to

execute these heavyweight

analytical processes on them. So that's why we ended up with the, you know, row oriented relational databases

and column oriented data warehouses

for these 2 different concerns.

And I'm wondering what you see as the potential for being able to bridge that kind of fundamental divide in terms of the data access

requirements,

while being able to

still design and build

systems that are able to natively work across those boundaries without having to have this very painful and error prone

system of point to point integrations that are the responsibility of, you know, some third party, whether it's external or within the organization?

That's a really good question. I think where we are in the arc

of kind of innovation,

database innovation, I completely

agree with you that the under, you pinning kind of technology for

OLTP or transactional systems and the modes of access for building a database for a microservice that's running your website, which has lots of lots of lots of reads and writes,

small reads and writes, versus, you know, building

a storage technology that runs analytical workflows, perhaps not as many, but 1 or 2 reads, not many writes, but very heavy

reads, large scale reads. At a physical level, there are 2 different sets of technologies that support those. And I do not think that,

oh, we have to have 1 universal database to solve our analytical data problem.

I think what we can do, though, we can still respect. And in my mind, DataMesh, at this point in time, respects those differences. Say, at the physical level, at the infrastructure level, yes, you do have different

data storage.

Even data modeling, the modeling of the data for your, you know, ecommerce

application, as you said, grow based, relational databases for, atomic transactions.

That database should be optimized for that application,

should not be optimized for a logical workflow.

But how can we

enable

extension of these applications in a way that now they do externalize

data and access modes for analytical purposes.

So at this point in time, I think what we have implemented and what I'm kind of hypothesized is this idea of kind of data quantums. You know, something

that an architectural component that encapsulates

the storage,

the modeling,

the serving of the data, as well as the policies that govern that data

designed for analytical access.

And that kind of data quantum sits next to or close to

to the source applications.

In some cases, data quantums are very close to the application that is the source of their data. And in some cases, they're not. You know, so some cases, they're downstream

aggregates of multiple upstream data, quantum data. And or even further downstream, they are providing the output of

a machine learning model perhaps. So

in the case that we were just discussing, in the case that data quantums or data products are more aligned to the source

being an application,

then

how can we assure

that

the integration between

those 2 are a bit closer? And we don't have

forest of pipelines

putting through down

these pipelines,

way away from the application developers. And I think the very first step

is have the same team to be responsible for both of these. Just by the fact that the people that are sitting next together

have the same objective,

the objective is serving this domain's function, and part of the domain's function is sharing its data.

Sitting together, I think, is the first step

to get the knowledge of data modeling

and list of data sharing

close to the knowledge of the data modeling for the application, which is the source. And then from the technical perspective, I think the technology is already there. Like, I mean, we have event streaming, which can,

you know, stream out kind of the events out of the system as it's updating its own data storage.

It can provide domain events that then the data quantum will capture, will summarize, will transform

to whatever, you know, columnar format or whatever format that the modes of access it supplies.

We have if it's a legacy application, you know, we have to change data capture tools to do that. So I think the integration between those 2, once it's done by 1 team and the life cycle of that integration is managed by 1 team,

the technology exists. I mean, it can certainly be improved, but the technology exists.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in DataFold

integrates

with all major data warehouses as well as frameworks, DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Yeah. I think that to your point that it doesn't need to be all within the confines of the application that the analytics has also done, but we do need to have much more

well designed and well enforced contracts at these boundary layers with the application

and the downstream analytics consumers. And,

you know, the database schema is not a firm enough contract because it's subject to change at the needs of the application developer, or you're the business

of restricting the amount of change that's possible because it will break other downstream systems, which hampers innovation in the application.

And so

as we were just discussing this here, 1 of the things that comes to mind as to how to make this an easier lift for application developers is, you know, as software engineers, we rely a lot on different frameworks to handle a lot of the, you know, boilerplate

common requirements. So Django, Rails, what have you, have become widely popular because of the fact that it handles all of the concerns of being able to terminate web requests, handle, you know, cookies and sessions and gray parameters, database connections.

And I think that 1 of the ways that this can become

an easier solution for people is if some of these frameworks or new frameworks evolve that have

these analytical outputs as a first class concern within them so that it's just a natural part of the development cycle so that I don't have to think about, okay. I need to build a new API, and I need to think about, okay, what are the access patterns? I need to be able to do both bulk reads and incremental reads and be able to maintain the watermark of when this client last read so that I know what new data to send to them, having that just be baked in as part of the kind of standard boilerplate, you get this out of the box kind of a thing. But to the point too about these sort of enforceable contracts, I think that that's 1 of the pieces that's still very weak in the data ecosystem, and it's becoming more prevalent, but

it's not easy enough yet. It requires a lot of upfront work and design and, you know, collaboration across the organizations. And it's not just a framework where you say, okay. This is what I'm going to do. This generates the contract. So now as a downstream consumer, I say, okay. This is the shape of the data. I can generate the schema from that. Now I'm going to transform it, and, you know, this is the new contract for this other downstream consumer.

I completely agree with you. I mean, I love both of those points that you raised. 1 was the extension to

the kind of application development frameworks that allow at least emitting

and externalizing

the application

data to this collaborative

kind of data quantum that then can consume that and turn that into

a analytical data contract. Right? I think that's a fabulous idea. Hopefully, someone is listening right now, and they go and build this thing, or the first generations of it. So I think that's absolutely 1 of the missing pieces. Because when I think about

kind of the parts of the solution that can be productized, and I do think about this a lot. The part that I always come back to is that, okay, how do you bootstrap data mesh? And bootstrap data bootstrapping data mesh in enterprises has many different entry points. But 1 of those entry points is

the the applications. Right? The application developers to start kind of emitting

their data products or providing their at least source aligned, what I call source aligned data products.

And to bootstrap that, it's not that easy to just buy a product and plug it in off the shelf. Right? So you need to

change the application logic. You need to work closely. Or those application developers need to work on their system to be able to externalize

the data that then can be consumed by the data product and then, you know, change into analytical analytical

data. And if we can reduce that overhead,

if we can reduce the overhead of an application developer now

providing those new contracts,

I think we have a much better chance

of getting adoption of data mesh because you are accelerating bootstrapping

something. And I know that I've talked to a lot of executives, and, like, 1 of the suggestions I've heard from some of them is that we want to inject,

in fact, product development, data product, or data quantum creation into the build pipeline of an application.

So at some point in that, from build to deploy pipeline, you create that. And I think, well, that's a good idea for some applications, but it wouldn't be the first thing I would focus on. Because as an organization that you're trying to

kind of figure out how this data mesh thing works for you,

probably you don't want to

focus on a factory light generation of

lots of data products that are hardly used. Perhaps it's an optimization. At the end of it, I think it's an optimization

that we can work on once the organization has figured out what's on my upgrading model, what does data product mean to me. And then you can kind of try to, I guess, have a put a machinery in place that makes the data product creation faster and bootstrap that. And the other end that you mentioned, which was the contracts for data sharing,

I think that is such a weak area, and I don't think it's the job of 1 vendor or the other to invent 1 because I think these are the glues that connect these pieces of the mesh together.

And these glues need to be standardized de facto standards or however.

These standards should be open.

And these standards should respect

the nature of the analytical data sharing

and the native

modes of accessing analytical data, which is not 1 mode today. Like, where you have the analyst, the evergreen SQL like kind of modes of access. You've got the data scientist, as you mentioned, the kind of the columnar feature based data access. You've got, you know, the event folks and the data intensive kind of event based access. So then what are the standards

for each of these native modes of access

for sharing data? But not only sharing data, but also be able to run

analytical workloads on the data

storage, on the source. Right? How do they express what workload they want to run? How do they express what's the identity or the permissions? The identity of the agent, the client agent for access. So, so I think we have a lot of work to do in standardizing. And if I go back to the history of, like, why microservices took off, microservices was also this complex, hairy,

you know, system of decomposing

solution across 100 and thousands of services.

Why that took off? Well, it came as a very interesting moment in time

where we had to standardize

on basic Internet protocols as a way of communicating. We had, you know, kind of converged on REST

back then, and, of course, later on GraphQL and others. But

those

convergence and those de facto standards that we started kind of adopting

created the interoperability

of across these kind of disparate services and compose them into beautiful and complex solutions. And I think unless we have that in place for adequate data sharing,

building systems that are full of friction. I like that you brought us around to microservices

because that is definitely the very native allegory

in the application development space. And I think it was the opening sentence of your book you wrote, data mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale with scale being

emphasized. And as with everything, scale is very subjective.

And in the microservices

ecosystem,

you know, you never really want to go directly to microservices

because you need to build the monolith first to see what are the actual natural boundaries of the logic and the problem domains.

And I'm wondering what you see as some of the useful heuristics for determining

when a data mesh becomes an appropriate solution for an organization or for a problem space and some of the ways to be able to effectively identify those boundaries

points for where it makes sense to actually draw those

dividing lines through, to break out these data quantum?

Yeah. That's a really good question. Definitely,

data mesh is not the right solution for everyone. It's you've got to have the problem of scale. And the problem of scale that data mesh tries to address is the scale of the complexity of an organization. So

if you're an organization that has a lot of different business functions, you have a

multinational,

you know, retailer

or a giant tech producer with a lot of different products from mobile phones to, I don't know, laptops and so on. If you are a health care

institute, whether you're a provider payer, that you have to get data from so many different touch points and so many sources, and you have many domains.

And at the same time,

you have the aspirations,

the use cases, the capability

to actually use that data

in many different diverse

solutions,

use cases. Like, if you're a health care, let's say, provider, you want to do

population health analysis. You want to, you know, personalize your care based on the cohorts of different patients that you have and their very specific personalized needs. Like, if you have all of these aspirations,

and you have the scale, the sources, and the complexity of the organizations, and you're also doing merger acquisitions, and you're growing,

you probably

very likely have been blocked

or hindered

by your previous centralized solutions.

And if you have the pain of, you know, your data team is unhappy,

nobody's happy. Everybody has, like, frowny face on because, you know, the data scientists that want the data are complaining that data is no longer available. The data engineers are under you know, overworked and under a lot of pressure and underappreciated.

They're not really incentivized. They don't know what they're doing in the middle. The application developers are kind of oblivious, but also frustrated that they can't, you know, build those intelligent

solutions

in, and they're complaining about the data sciences. So if you have those

blockers,

you know, centralized

bottlenecks,

and you have the scale,

then think about Data Mesh.

The other question is, okay, at what point in my you know, I'm still exploring my business and now I'm going through that, you know, kind of massive scare of growth and I'm going through some hypergrowth. And now I've grown and I'm

exploiting where in that curve

I should think about data mesh.

I don't have a specific answer to say where and when, because it depends on many, many different factors.

Give you an example.

You might say, okay, you know, the size of organization and diversity of domains, that's probably

a good 1 to look at. And if your business is growing, that's the 1.

You might be actually

a health care startup

that from the get go, you want to consume data from many different partners, from many different sources.

And you probably just provide 1 or 2 very hyper specialized, let's say, cancer detection and images or something along those lines.

But from the very early on, you have this diversity of sources where the sources of data are going to continuously

change, change on a different life cycle, different cadence.

So maybe even from early on,

because

the scale

of access to reliable sources is a differentiating,

you know, strategic

kind of benefit to you? You may wanna think about it earlier

as opposed to maybe organization that doesn't have that

business model. Maybe you wanna think about that once you go through that hypergrowth. So it's a really hard question to answer, but, but simply,

do you have a pain of a bottleneck, a centralized bottleneck? And if you do,

data mesh might be useful to look into.

And I think another interesting element of the idea of the data quantum being the building block of the mesh means that it doesn't necessarily have to be that the mesh exists entirely within the boundaries of your organization. So for a smaller company where it doesn't make sense to split out these domains internally, you might still have this data quantum that is at the boundary of your organization so that you can provide data as a service to other organizations that you're doing business with or to your consumers. And so then the data mesh becomes more than just what is within the walls of your company, but it becomes

what is this ecosystem of data products that is composable

and consumable

across these business boundaries.

I completely agree with you. The way I kind of imagine

data quantum, and I know it's 1 of those big words and it sometimes alienate people, but it there wasn't any other words, so I just use this 1. The way I imagine it is that it's the units

of value exchange.

So if data is the thing that's valuable, the thing that is product and we share,

this is the unit of exchange.

And if you think about it, if your mesh is within a large enterprise, then you're exchanging value between different parts of your organization,

peer to peer.

But if your system is an ecosystem of partners, and there are a lot of closed ecosystems or open

with your partners, you know, then you have some sort of contractual agreements around data sharing.

It's the same concept. You are sharing that you need to value, but this time, you're sharing that across a trust boundary.

So then

when you think backwards, hey, I'm now doing data sharing across trust boundary. What is this data quantum thing that I need to share? Well,

it can't just simply be

those columns or rows of data anymore.

It must be

also encapsulating, in my mind,

the logic and the transformation

that keeps this

data

alive in a way.

Also, the policies that govern this data, the privacy of that data doesn't change. Because we just shook hand. We exchanged this value

to somebody else. So it has to bundle the Plus policies within it. I think that's a fantastic

use case to take us to that extreme case where we are exchanging data across trust boundaries and saying,

what constitutes

data? And I don't call it data. I'd call it data product or data quantum just to force us

to expand

our thinking

beyond just bits and bytes of, you know, information about Tobias or,

you know, whoever we're sharing information about to what else needs to be

bundled

within this thing so I can autonomously

get value out of it and share it? The other part of it is that, okay, if I want to do this data sharing across trust boundaries,

a lot of these big platforms fall apart. I mean, I have seen so many, you know, presentations on data marketplace, that you share your data products on this marketplace. And they do not respect the fact that these are open.

I don't even like to use the word marketplace because, you know, that evokes different kinds of emotion in around data sharing. But these are open ecosystems where

the identity

of the person who would be, you know, using that data or the system

cannot be locked within a single platform.

They are running a different platform. They're running a different identity system. So those sort of data sharing standards that you and I just talked about a minute ago need to include

identity systems that go beyond the bound of a single organization,

need to include standards that go beyond the bound of a single, you know, data storage.

I think we have a lot to build on because we have built internet scale solutions and APIs. We just have to extend those with respect

to very different modes of access to data.

Going back again to the analogy of microservices,

another problem that becomes manifest when you are dealing with all these decoupled components is

understanding what are all the interconnections between them, but also,

you know, if 1 of those elements in the system

becomes overloaded,

then it can become problematic or it may end up becoming a supernode.

Or there's the problem where I have all these microservices, so now I need to query 15 different systems to get all the pieces of data I need to be able to fulfill this request. And so particularly in the context of machine learning where I might be pulling in lots of different data from lots of different sources to be able to build some sort of a composite model,

I might need to interact with 15 different data quantum. And so that becomes potentially problematic as

I try to figure out what are all the data sources available, how do I wanna compose them together, may maybe they're not all providing consistent interfaces, which I know that that's sort of 1 of the requirements that you put forth as being

considered a full fledged data product is that they all expose the same interfaces. But, you know, as a consumer of all of these quantum, how do I make sure that I can find them in the 1st place, and how do I make sure that they're all able to

sort of give me data in consistent formats and in a performant manner

so that I can make sure that I'm upholding my sort of service agreements to my downstream consumers?

That's all engineering. That's all engineering. Like, it's it's I don't think it's an unsolvable problem.

I think you mentioned a few of those. I completely agree that, you know, in case of perhaps APIs and microservices,

until you get to that

top layer. Right? The top front end where you're

creating a journey that stitches

APIs from many different services. You don't have that kind of, as you said, dependencies to a lot and lot and lot of services. Right? But then in the microservices, you do have. But I do appreciate that in the analytical data,

most of the value that we get from that data, most of the interesting use cases are switching data and looking at the data

across many, many different nodes. So there are a few things that must be in place. And I know this is the part of conversation that I lose all of the friends that I make early in the conversation with just talking about principles. Everybody's like, you know, yeah, we're all friends. We're I agree on these vague principles that you have. But when it comes to actual implementations and the hard engineering disciplines that we have to put in place to actually make it work, I think that's where I lose a lot of my friends. But let's talk about some of those hard engineering kind of disciplines that we have to put in place.

Some of it are around the standardization of what's in the gap, which is standardization of the APIs.

Not only just data sharing APIs, but also standardization

of observability

and discoverability.

This data quantum, what information does it need to emit

for a discovery tool? And again, my language is different. I don't use, like, a data catalog because I want us to think about a new generation of solutions.

A search tool

that can

discover

every data product, discover index, and allow to search every data product that exists on this mesh. Because they all

self registered themselves the moment they were created,

And they're providing

continuously

up to date information

around

what data they're providing, what's the timeliness, what's the completeness,

a bunch of other metrics.

So that as you said, a data user,

a data scientist coming up with a hypothesis, and they want to validate it. The first thing they need to know, what data do I have access to to be able to exploit the patterns that might exist within them. Right?

They have a place to go to. They have a way of searching.

They have a discovery tool that allows them to search the whole mesh.

And once they search the whole mesh, they can see

information

that tells them which 1 is the right data provider to even, you know, connect to and start exploring.

Who are using it? What are the documentations?

What's the schema around it? And you don't want every single data product use a different way of documenting itself or use a different language for modeling. You really wanna standardize

some of these aspects so the consumer has this

kind of consistent

experience of the mesh regardless of 10 or 50 or 20 or 100 data products. They they look and feel the same, even though internally their implementation of it, they're modeling a different kind of data. So for that, you know, vision to come true, to have a discovery that gives us a consistency of dimension, to be able to compare data products, which was the right 1 for me, that there is observability, there is discoverability,

there is, you know, blueprints of the data product

that kind of embeds all of these abilities

into every data product right from the moment you initialize or create 1. There are

dimensional experience

capabilities, like a discovery tool or observability

tool.

There is a lot to be done. And then your point around, okay, if I want to query across many disparate data products, is that efficient?

I think, again, the way we should think about performance

is that we should delineate between

physically how the data might get stored, indexed, searched,

from logically,

how do we allow each of these data quantum to have data quantum to have different life cycle. Right?

So logically, they are completely independent, different schema. They can be independently changed, modified,

evolved, and controlled by different teams.

But we may very well choose

the same underlying

storage,

overall indexing for search,

caching, whatever is needed to optimize. So the separation of kind of physical and a logical

layer to give a

autonomous

experience to the providers and users while having the optimization

of kind of storage and access

would be useful.

Again, it's an engineering problem.

Yes. The hardest problems are always the social ones. Engineering problems, you just need to throw enough thinking at it, and it'll figure itself out eventually.

And money. And money too.

Absolutely. And money. There's certainly plenty of that flowing around right now.

And so in terms of the

sort of implementations

of data mesh that you've seen as far as the kind of technical elements and the products that are available in the ecosystem and some of the organizational

constructs that people have built up? What are some of the most

useful and well thought through

examples that you have seen in the time from when you first introduced this idea to where we are now. And

now that people have started to latch on to this idea as it has become much more popular, and people are actually starting to think about how to actually make it work at larger and larger scales. We talked about technology a little bit. I I touched on the social part. I think

the organizations, you know, that they are the early adopters of data mesh are all kinds and forms and colors and shapes. They're not just the scale up. So, you know, just enterprises. It's across the spectrum.

And you see a very different

kind of starting point if an organization that had a traditional, you know, chief data analytics officer with governance and data science engineering under it,

versus an organization that is more enable,

kind of smaller,

digital native.

So you do see different behaviors in those. And

I think the ones that are

most

successful

are the ones that can really challenge their own biases and assumptions and bring a new thinking. So maybe I'll just give some examples here.

The larger, more traditional organizations

that have had well established

let's talk about this governance for a minute. Right? Data governance, they try to

solve

the problem of the problem that we just talked about. It's a complex system. It's a chaotic system. You know, how do you prevent this

chaos?

They try to solve

the challenges of this kind of independent data sharing

with their old methods that they are very much used to. So the old methods are

putting control in place. So I love the, you know, the system thinking and kind of the work of Danilo Meadows and the kind, you know, likes of her on system thinking. And if you think about this as a complex system, the traditional organizations that have adopted, they

still try to fit it into the system thinking that they had for their organization. So introduce

bottlenecks,

basically. So if they are worried about, let's say, duplication of the data

products and how we're going to prevent this chaos and people building these data products. And that comes from, you know, years of having scars of people copying data into different databases.

What they think about is, well, we're gonna put a certification

and validation

and a manual

kind of quality control in the pipeline of the data products.

That just creates synchronization points and a bottleneck. It doesn't work. So I know you asked for good practices. I'm just telling you some of

the bad ones. But just contrasting that

with kind of companies that have

automation,

heavily relying on automation

and platform solutions,

and they're comfortable with chaos to a degree,

and find different ways of

managing

just the same problem, the problem of, let's say, duplicated data products we don't need.

And the thinking that I

work with some of the clients and kind of seen around,

well, how do you apply system again to a complex system to avoid duplicates? Well, you introduce

feedback loops. Right?

So you introduce

a positive or a negative feedback loop in terms of you want to, for example, in a mesh to expose

all of the information about your data products

in a centralized

I know I said the word centralized but centralized global

discovery or search tool. And this search tool will give higher ranking

to the data products that have, you know, better user satisfactions. They have more stars. People

kind of like them more or use them more

and gives lower ranking to the enterprise that seem to be looking exactly the same as the other ones, but they don't have as much usage. So

it's self balancing. The mesh tries to self balance itself by this

just a simple ranking,

positive or negative feedback loops,

Which is a very different approach, which of course relies on observability

and automation and all of that. But it's a very different social system design approach

to solve a problem without putting bottlenecks in place.

The other interesting

aspect that I have seen is around education

and empowering

kind of domain teams and application teams

and data literacy. And people have come up, I think HelloFresh had a presentation on that, which I adored.

How they kind of gamified the education around data and

created different reward programs or incentive programs and educational programs to get everybody on board in terms of valuing data as a piece of kind of product that they build. I'm still discovering and learning

participating in creating

kind of social systems that work, and there's still a lot to be done. 1 of the other interesting things that the idea of the data mesh and some of the ways that it manifests in a company can bring about is

the variance in the types of roles and skills that are necessary, where this has already been going through an evolution because the idea of a data engineer is still relatively new in terms of human history where, you know, it started off, we had database administrators, and we had business intelligence engineers, and, you know, then the rise of the data scientists made everybody realize we needed data engineers to be able to hand over the data sources that were, you know, well groomed and maintained and up to date for the data scientists to work from. And now as that has become a more

recognized

job description, we've also been seeing the rise of the analytics engineer, and now there's data project engineer, machine learning engineer, and there's this, you know, continuing proliferation of titles that all fall within the general category of data professional.

And I'm wondering what you see, you know, some of the impact of data mesh where

these data product concerns are being brought into the application development cycle and

the kind of domain expertise that's necessary,

how that will influence the sort of types of

jobs and positions that are necessary for organizations to be able to

realize the full benefits of the mesh as they start to scale up and scale out and increase their level of sophistication of data usage?

What I might say might be a bit controversial,

and

I do not intend to undermine anybody's skills or talents or, you know, contribution.

But if data mesh is successful at this division I had, that would be just engineers, Tobias. There wouldn't be any, you know, all these, like, you know, rainbow of engineers because

we introduced its intermediary

roles.

And every time we introduce an intermediary

role, we're creating gap between

the producers and the consumers. And we're creating accidental complexity in that system,

as opposed to thinking about how to close that gap and get these people talk to each other directly,

right?

So I think those intermediary

boundary roles of analytics engineer, data engineer, I think that they were needed at the time. If organizations are serious

about

being data driven I mean, data informed, data driven, however you want to call it, as in embed what you just discussed a minute ago. Embed intelligence in every decision making, in every aspects of their application, or at least in many of them.

They have to get rid of these boundary laws. And they have to find a way

to upscale,

cross skill

their engineers

to work with data, to work with ML. And of course, they will still have some specialized

roles.

And those specialized roles are the ones that are, for example,

the PhD graduates of data science. They're really working on the science part. And I do think a lot of the work that we call data science is actually

feature engineering. It's

not really data science, but there is still a science part.

So I think we will still have a smaller number of specialists,

but a large portion of kind of generalists,

at a point in time, they choose to focus more on the data side. Or at a different point in time in their career, maybe they choose to focus on the application part

as opposed to creating these fragments of boundary

rules. And the purpose of next generation self serve platforms is

closing that gap. Right? To allow

the generalist, what we call the generalist acknowledges. And I know there's no such a thing as a generalist technologist. Generalist technologist are technologists that choose to their experts at a different point in time and different thing. But it's possible to move between

these areas of expertise because the learning curve is not as

steep as it is for some of these specialties.

Yeah. So I think that the purpose of that kind of self serve platform and thinking about this platform capabilities and then raising the abstraction

is to not require so many specializations.

In terms of being able to raise that floor of complexity

and, you know, reduce the amount of

specialized knowledge that's necessary to be able to just work at a

surface level with these different data technologies? What do you see as the role of vendors and sort of service companies in being able to help realize the the potential for this future state? The list is long.

So maybe share some of the things on top of my wish list for

vendors. I think, like any good product design,

starts with

focusing on the user experience

and creating these, you know, new personas, which are your generalist developers. Think about generalist engineer or generalist expert engineer, whatever we're gonna call them.

Think about

where they come from, what sort of skill sets they have, you know, what they experience,

what's the most seamless experience

for them to close that cycle

of intelligence, right, from application development to intelligence, to deployment of that intelligence into the application.

And

that's just good again, I'm being cheeky here a little bit, but that's just good product design. Right? Think about the experience of this new persona of users that you have

as opposed to

solutions that are closer to the middle, closer to the machine, and optimized

for the machine. I think a lot of our data solutions, and for good reason, for the last, you know, 2 decades,

mostly have been optimizing

for what you just talked about, for performance.

Right?

For separation of

the data from the compute so we can scale out each differently.

Now that we've solved that problem, we need to raise the bar and focus on

optimization

of the experience

and optimization of a connected experience from after movement back to the full cycle intelligent, you know, digital solution development.

So that would be just the starting point, who

we optimize our solutions for.

And then the second part of it is that

when I look at the data landscape or data solution or vendor landscape

today, I see 2

main categories.

I see a fragmented

world of tiny little solutions that every startup

kind of tries to build and

capture the market for a small section.

And these solutions, very few startups to start with, how does my solution fit in an ecosystem?

Right? How does this connect with the rest of an ecosystem?

So ecosystem

thinking

as building solutions that interconnect

nicely with each other is the not second

second on my wish list. And then the other camp is big platform. I'm gonna give you the soup to nuts of everything, and just buy 1 solution, because you're gonna get everything you need. And that doesn't lead into

a future that I want to be part of. It doesn't necessarily lead to

a future that is conducive to innovation by smaller players,

by disruptive players. So I guess it's a hard ask, but it's an ask about even if you are a big platform company, still try to be a

good citizen of an ecosystem with other players within it. And that means

product product design with interoperability and connectivity in mind from day 1.

And in terms of your experience of working with your clients at Thoughtworks and working with people in the community and writing the book and just interacting with people in general on this idea of the data mesh, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

So I guess the most interesting and exhilarating 1 was the surprise that I got

when I published this content. And content was, like, conference speaking and writing.

My expectation

was, maybe I mentioned this last time,

was sharp objects being thrown at me and me ducking.

But to my surprise,

I said what a lot of people were thinking about already or implementing.

So I just voiced

something, perhaps, that was obvious, but not always spoken of. And I tried to put a system around it in that explanation.

So the most surprising element is that how the industry has embraced

the topic.

Very, very happy about that.

I I guess related to that again, the most surprising and yet challenging

is that the vendors play a big role in this. And while

I'm happily surprised that many vendors has embraced the approach,

I still see a challenge as

I haven't really seen data mesh native solutions to be created.

As you mentioned earlier,

I think the technology challenges

are solvable engineering problems

because there's still quite a bit of a gap that exists. So every

project that starts

does require investment from the adopters to to build the gap until the gap is filled.

But the social aspect of it is still still a big challenge.

I thought that,

you know, the domain application developers

would

be

more

welcoming

to embracing

data. And I just realized

what a big gap there is between

what the mission of the company is as

being data driven at the executive level

and the reality of the on the ground that how

compartmentalized

data is still from the reality of the business and applications.

And that's the biggest challenge we have to solve, and that's what we talked about. What are those

intrinsic motivations to embed intelligence.

And, you know,

this new age of AI become real, real, real, not at executive even as the as the mission statement, but at the grassroot

engineering, business people, BAs, application developers.

We touched on this a little bit already, but what are the cases where data mesh is the wrong approach and somebody might be better suited with just a, you know, throwing everything into the data lake or the data warehouse and building these different, you know, point to point solutions.

I guess I add something to your question. I would say, where are the places where the data mesh is wrong approach today? Because the solution you know, today versus 5 years from now, we may make decisions very differently.

I think today, if your organization,

as we discussed, doesn't have the scale, the scale of sources, the scale of use cases, and you have very modest use cases and modest domain number of domains,

I don't think the animation is for

you. Again, today,

if you don't fit into the innovator,

adopter,

you know, curve of adoption of a new innovation,

as in you're not risk taking,

you're not comfortable as an organization with ambiguity,

you don't have an experimental

attitude towards developing

and walking through the unknown,

perhaps you've got to waste. Because there is a level of experimentation.

There is a level of unknown and ambiguity

and refinement that needs to happen

that innovator and lead adopters are Okay with, but Laggard are not Okay with. So if you're traditionally a Laggard, then I don't think it's the right time for you.

And the 3rd piece that I would say is that, again, because

we are at the point we are today,

there are not many off the shelf technologies for you to get and simply integrate,

and there is a fair bit of investment in building out.

And if you're not a technology

company that has technology,

you know, respect or embrace technology

at its core,

as its main developer or even the driver of the business,

then now may not be the right time because you would need that kind of technical foundation

to build and operate a data mesh.

And as you continue to

work with companies

and help

to formalize these ideas around the Data Mesh, what do you see as the future for the

principles

and your overall involvement in helping to drive it forward? And at which point do you think it will become just the community's responsibility

to adopt and push forward these ideas?

I would like to see community's involvement much sooner than later.

I'm actually very grateful. There are folks in the community

that have led initiatives like, you know, data mesh learning group, you know, flourished over the course of a couple of months from nobody to 1, 000. So I think the community is forming, and there's just a lot of information still figuring out what's the good information, what's the bad information. So a lot of misinformation as well. So so I think the community is evolving,

and I'm very actually happy and grateful with the participation of folks in the market.

As of my role, I think

up to now and up to once this book is out

early next year,

been still acting as an evangelist,

try to be ahead of the curve, been ahead of the curve with Thoughtworks for a few years. So be engaged in the projects and,

the challenges we see on the ground. And come back and share those challenges and learnings with the larger community. I think Thoughtworks has traditionally had the spirit and the culture

of sharing.

We did that with microservices. We did that with continuous delivery and a lot of other, a few other, I guess, major shifts that we've seen our industry.

Are there any other aspects of the ideas around data mesh and its manifestations

and some of the work that needs to be done that we didn't discuss yet that you'd like to cover before we close out the show? I think at the high level, people kind of understand the principles, the motivations behind it. The book, in fact, is structured around

why Data Mesh. So have a way of justifying

this is for you and why it should matter to you or why it shouldn't matter to you, and then what it is at the level of principles. And I think I have all of those principles almost in the early release if people wanna access.

But the piece that we still have to figure out and we haven't figured out yet is really those technical gaps that we have to bring a distributed

data sharing model for analytical

use cases

to life at scale.

And

I have some opinions, some learnings,

some hopes, And these are in the part 3 of the book, which is kind of how to build an architecture.

Some of it's proven. Some are totally

hypothetical and still yet to be proven. So I think we have to still work at that.

And I think as we discussed, where does data mesh fit into an organizational wide data strategy and execution?

I think that's a part 4 of the book that I think we still need to learn and discuss more. So there won't be time for us to go into details of those. But I think the actual technical architecture and implementation,

something with a

refreshed

and outsider, perhaps, perspective.

And also the execution of the data mesh at the organizational level, kind of strategy and executions, these these discussions and topics to be discussed.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you your preferred contact information.

And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I'm probably gonna tell you the same

answer I told you last time. Standards, standards, standards. Right? Whether these are the standards like parquet file and so on or not, but standardizing the gap.

The gap where data is shared across different, what we call them, data product quantum or data sources.

So we still need to standardize analytical data sharing a bit more. And I see some work to be done with kind of data sharing. There are some small steps that the industry is taking, but we need a lot more of those. And we need vendors,

hopefully, data to be incentivized. And hopefully, data mesh is a catalyst

to incentivize

vendors to

share the data, which then hopefully leads to establishing the interoperability standards.

Cast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and co

workers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links