Summary
The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving a brief recap of the principles of the data mesh and the story behind it?
- How has your view of the principles of the data mesh changed since our conversation in July of 2019?
- What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh?
- What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation?
- In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution?
- What are some of the ways that data mesh concepts manifest at the boundaries of organizations?
- While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem?
- What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh?
- What are the technological components that are still missing for a mesh-native data platform?
- How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff?
- How can vendors factor into the solution?
- What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products?
- What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations?
- When is a data mesh the wrong approach?
- What do you think the future of the data mesh will look like?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Data Engineering Podcast Data Mesh Interview
- Data Mesh Book
- Thoughtworks
- Expert Systems
- OpenLineage
- Data Mesh Learning
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm welcoming back Zhamak Dehghani to talk about her work on the data mesh book and some of the lessons that she's learned over the past couple of years since she first introduced the idea. So Zhamak, for anybody who hasn't already listened to the past episode that we did, can you just give a bit of an introduction?
[00:02:12] Unknown:
Hi, Tobias. It's great to be back. Yeah. I work at Thoughtworks as the director of emerging technologies. For the last few years, I've been busy. I guess it's initially hypothesizing about data mesh, and then over the course of last few years, implementing it, refining it, and validating it with our clients globally. Do you remember how you first got involved in working in the area of data? I had a bit of a chance, I guess, to think about where it was when I first, you know, started with data, and it was when it wasn't cool at all. So it was probably the first time was about a couple of decades ago, embarrassingly. At university, my major was expert systems, like AI based expert systems, which looked very different from what we call AI today. And then, I guess, mid career, I had an opportunity to work on a fascinating distributed system product.
Basically, what you call today a data intensive application. It was gathering information from all sorts of systems across large networks from routers and databases and ATMs and what you can imagine to be able to monitor. So I got engaged in building time series databases, streaming, building a query language on top, aggregation system, analytics on top. So the full stack was, you know, developed greenfield, and that's before the cloud and all of the cloud services we have. That was a fascinating experience, which gives the depth of what is involved working with data intensive applications. And more recently, the last few years has been kind of my focus has been on how do we do data sharing at a large scale in complex environments where, you know, we work today in large organizations?
[00:03:47] Unknown:
And so for people who aren't familiar with the ideas of the data mesh and haven't listened to the episode that we did almost 2 years now and haven't read the book. Can you just give a bit of a recap about the principles of the data mesh and some of the story behind how you came across the ideas and principles and some of the potential benefits that could be realized by organizations
[00:04:08] Unknown:
who implement it. Well, data mesh was born really as a hypothesis, as an answer to the challenges I was observing, you know, a few years back working with large technology and data forward clients at the time. The challenges were, you know, in the discord between kind of the problem space and solution space. What do I mean by that? I was working with organizations that were complex. They had many functions, many domains. They had, you know, big data aspirations. So they had large number of use cases for how they wanted to use data in ML. And because of their complexity, they had data coming from many different sources, from many touch points with their, you know, customers, partners, ecosystem, players.
So the solutions that they had wasn't really meeting the needs and aspirations and the complexity of their environment. And the solutions back then, you know, a few years back, it was still early days of a lot of companies moving to cloud. So some of them had already moved out. They had data lakes or data warehouse on cloud. Some were still running their solutions in house. So I looked at the solution space and seeing what were the bottlenecks that was stopping these organizations to getting value from their data. So on 1 hand, know, they were spending a large amount of investment, but not getting on the other hand, they weren't really getting results. So data mesh at a very high level is a decentralized socio technical approach in accessing, managing, sharing data essentially for ML analytics use cases.
Then you asked about the principles. So if I deduce it to its 4 principles, basically, 4 very generic principles underpinning that. Of course, each of those principles leads to very specific implementations and, you know, a technical kind of manifestation we can talk about. But at a high level, the principles are, 1st and foremost, the idea of distribution of data ownership and data sharing, and architecture for data sharing, 2 independent, autonomous domain oriented teams. Basically, following the seams of your organization, the way that organization and business decomposes itself to get scale, usually these are your business domains, business functions.
And following that with teams, technical and data, I suppose, kind of informed or capable teams that not only build applications that support for microservices or applications, legacy systems that support those business functions, but also the technology required to share data for analysts for use cases. So then if you follow that principle, the principle of domain oriented ownership, you may say that, well, that may not look very nice because, you know, the teams may end up kind of siloing their data in their own databases. How how is that gonna work? So then the second principle of data as a product tries to change our relationship with the data to say, well, data is not for you to just hoard and collect in your little database for your own use cases. Data is there to share as a product with the rest of the organization.
And then, you know, if I follow that principle, you might say, well, how is that possible that each of these domain teams can have the capabilities to build all of this data infrastructure that's needed to share data at scale or to peer to peer use data at scale for their analytics purposes. And that's the principle of a new look at the self serve infrastructure and the platform to give autonomy to these teams, to make it feasible and cost effective for generalist kind of tech developers or app developers be able to work with data. So self serve data infrastructure is the 3rd principle.
And the 4th principle was principle was introduced later on because you know, the the data governance fears chaos. So how do we make sure in this decentralized world, there's still some sort of a global harmony and interoperability between these data products that are being developed by different teams? How do we make sure the privacy still respect the legal compliance is still applied? So the first principle of federated kind of computational governance tries to introduce a governing model, an operating model, as well as kind of technological solutions to allow embedding policies and policy execution in every single data product. So we can still have that, you know, balance and equilibrium between the autonomy of the teams sharing data or using data and the global interoperability and the harmony of the policies that need to be applied cross cutting to all of them. Sorry. I didn't take your rest. It's a long, long, long explanation, but these are the 4 principles on dependent data mesh.
[00:09:00] Unknown:
And so when we first talked, it was in July of 2019, and it was, I believe, fairly close to when you had the initial posting kind of posited these ideas and this thesis about the data mesh and some of the potential benefits for being able to manage this domain ownership of these data products. And now that there's been more time for you to be able to reflect on it for people in the ecosystem to be able to experiment with the ideas and implement them and figure out how the tooling aligns to these principles, What are some of the ways that your thinking around the problem has shifted or evolved? And what do you see as some of the kind of core elements that have been proven out, and what are the pieces that you see as having been in need of for the refinement?
[00:09:50] Unknown:
Yes. When we talked, I think you were probably 1 of the very first people that I talked to after I wrote the first article. So I'm grateful for your platform, letting my voice heard, I suppose. I guess at the principle level, the main change from the first article I wrote was the introduction of the 4th principle. Like, very early on, I think the idea of decentralization of data ownership as a principle hasn't changed. The idea of the data as a product and sharing didn't change. And then the platform at the principle level, those 3 existed in the very first writing. And the reason for it, that they lasted through the test of time, was that these are not novel new ideas. These are the principles underpinning scaled solutions that we have created in the operational space for the last 2 decades to respond to the complexity of digitalization of organizations.
So I didn't do anything novel there. I just thought, how would we apply those to the data world? But the principle that I had to introduce and I introduced later on in a second article I wrote around kind of principles in logical architecture was this idea of the governance. Like, what does governance look like? And I was very much as a technologist, of course, I was very much focused on the automation automation automation. I initially said, oh, this is an automated let's make governance invisible and make it automate everything. And that's part of the kind of computational aspect of the governance. How can we think about, you know, the the infrastructure that's running these data products. So that part kind of evolved and became so on and forth. And then I realized that 1 of the concerns and challenges with governance are actually to do with operating model. The roles, the responsibilities, who does what, what's the role of governance.
And I kind of, I guess, we live in the US. We have somewhat the federated governance model. Or if you look at Europe, the same thing, or UN. So I hypothesized on application of federated decision making model that we've seen, I guess, in the world in an or to an organization. And again, I think these are areas to be yet refined further. But the federated operating model and then computational governance became its own, I guess, important enough to be a principle. On the technical front, a lot has changed because we've been building this for the last few years, and every implementation of it looks different. So, yeah, so I think these these principles are just getting more refined as we implement it and well understood.
[00:12:35] Unknown:
And 1 of the core kind of pain points that you called out in the initial posting and that has held true is the fact that data teams have typically been very sort of underleveraged, but overworked because of the fact that it's generally oriented around a centralized team of people who are responsible for all data across the organization. And so you lose a lot of the context and domain expertise as it traverses these various boundaries, and you have these very labyrinthine point to point connections between various systems. 1 of the things that we talked about in the previous episode was the idea of still having a kind of data platform team that's responsible for the technical implementation, but building it in a way that each of the different domain owners is able to self serve on that. And 1 of the biggest changes that's been happening over the past couple of years is this rise of the so called modern data stack where democratization of access and giving everybody the power to have some measure of control or participation in the data of the business, it has become kind of paramount. And I'm wondering what you see as some of the ways that this modern data stack and the decomposition of all of the different layers has either enabled or potentially hindered people who are trying to realize this idea of the data mesh in their own organizations?
[00:13:59] Unknown:
I think the way I see it is that, of course, all of the advances we've had in the modern data stack has been great. Right? They try to make the life of a data engineer easier. So as a result, it's great for data mesh. What I would say is that we are still missing a very crucial element in really being able to mobilize the largest population of, technologists, a lot of app developers that are essentially the source of the data and the end consumers of the data, right? Applications today, many of these kind of data driven organizations, their applications are augmented with ML train models and so on. So they need to have access to the data.
And many of these applications are generating the data that finally trains those models. So the feedback loop between the app development and analytics has become tighter and tighter. However, many of the data tooling that I see today are still assuming a division of responsibility and a division of infrastructure between what we call a modern app development stack and a modern data stack. And some of the tools are actually increasing that gap between the 2, and some closing the gap. So for data mesh to really take off, I think 1 of the pieces is a kind of a platform. And I know we use the platform as 1 thing, but it's not 1 thing. Right? It's a collection of tools that play really nicely with each other. I think the collection of tools need to imagine the life cycle of data product from generation creation to consumption, to creation of ML models and then consumption back to applications as 1 feedback loop, as 1 journey lifecycle.
And look at kind of how the platform that enables this cross now and cross functional team needs to play nicely with modern application stack, needs to integrate nicely with how data is emitted from these applications or how they don't fit back to these applications. The solutions we have today, they have made an assumption, rightly so, because that's how the world was when they were created, that, you know, application data will get to us somehow. We build these pipelines, and it will get to us. That's that's not the focus. The focus is once we got this data, what we're gonna do with it? How are we gonna model it? How are we gonna enable the downstream, that last line to the data scientists and data analysts? So I think the applications that would work nice, or the solutions or platform capabilities that work nicely with Data Mesh are the ones that close that gap. They really like, very simple example, and it's no criticism for any particular vendor. But a lot of application developers these days are very well familiar with, you know, containerizations and running a Kubernetes cluster, not running them, but running their applications on such a cluster, monitoring their applications with certain observability tools. But if this team, this cross functional team is now need to provide data products, they need to completely shift to the other direction and go and, I don't know, run a VM based cluster of Spark and then run Spark jobs somewhere else and then monitor those somewhere else.
And even simple standards, like the open lineage on the text, the part of that group, forgets that this tracing that happens for the data, it it needs to start from application. So we've got a set of open tracing standards for the application tracing and another standard for open lineage for data, are we thinking about connecting it to? Perhaps not as much as we should. So, yes, I think it's great we're investing a lot of money in creating data platform, modern data stacks, but not paying enough attention. How does that play nicely with
[00:18:01] Unknown:
modern app development stack? Yeah. I definitely agree that the continued division of software and application development and delivery and development of data pipelines and data analytics is still too segmented, and there needs to be a much more native integration between the 2 where, you know, 1 of the challenges of application development is that it has become increasingly sophisticated and complex, and so the responsibilities of application developers is continually growing. That'd add another responsibility onto their pile of, as you're designing and building the application, you need to be considering as a first class concern, how are you actually going to expose the analytical information that's necessary for other consumers. And so I think that that needs to become part of the standard set of requirements for the application delivery and the definition of done before we can really be able to fully realize the proper value of data within our organization.
Otherwise, we're in the situation where we are now where the application developers build the applications, they generate all of this data, and then they just kind of drop it in the database and say, good luck. Not my problem anymore. I did what I'm supposed to do. I have an application that my end users can take advantage of. But I think that as the users of these applications become more sophisticated and data literate as well, that also continues to drive the need for these analytical products, which then feeds back into the need for these applications to be able to embed that as part of the experience. And we've been seeing that with some of these embedded analytics solutions. I recently talked to the folks from cube dotjs, which is a way to be able to actually expose an API of your analytical data that you can, you know, power some of these charting libraries with. You know, there's the idea of reverse CTL or operational analytics where you need to feed all of the information from your data warehouse back into the SaaS platforms that your business users are relying on. So I do think that that there is that foundational need of software systems to have analytical use cases embedded as a first class priority in their design and delivery.
[00:20:13] Unknown:
Absolutely. You said the right word. Like, I love your thoughts here because unless the day comes that we are intrinsically unless that day comes, and that day only comes when Unless that day comes, and that day only comes when analytics ML is embedded into the application, embedded in every business function, none of this will work. We will be in this, you know, hypocrisy of putting data driven all over our mission statements, and yet externalizing the responsibility of anything to do with data to a different team away from where the action really is happening. Right? So I completely agree with your statement about embedding intelligence, whether it's analytics or ML, into everything we do, including the applications we build.
[00:21:10] Unknown:
And so the idea of productization of data is at the core of the idea of the data mesh. You know, software applications and products are intrinsically delivered as an overall experience. But the kind of driving force that initially led to this schism between online transactional systems and sophisticated analytics is the fact that the data technologies that application developers rely on in order to be able to have responsiveness and flexibility in how they can build and deliver the experience are not conducive to being able to execute these heavyweight analytical processes on them. So that's why we ended up with the, you know, row oriented relational databases and column oriented data warehouses for these 2 different concerns.
And I'm wondering what you see as the potential for being able to bridge that kind of fundamental divide in terms of the data access requirements, while being able to still design and build systems that are able to natively work across those boundaries without having to have this very painful and error prone system of point to point integrations that are the responsibility of, you know, some third party, whether it's external or within the organization?
[00:22:30] Unknown:
That's a really good question. I think where we are in the arc of kind of innovation, database innovation, I completely agree with you that the under, you pinning kind of technology for OLTP or transactional systems and the modes of access for building a database for a microservice that's running your website, which has lots of lots of lots of reads and writes, small reads and writes, versus, you know, building a storage technology that runs analytical workflows, perhaps not as many, but 1 or 2 reads, not many writes, but very heavy reads, large scale reads. At a physical level, there are 2 different sets of technologies that support those. And I do not think that, oh, we have to have 1 universal database to solve our analytical data problem.
I think what we can do, though, we can still respect. And in my mind, DataMesh, at this point in time, respects those differences. Say, at the physical level, at the infrastructure level, yes, you do have different data storage. Even data modeling, the modeling of the data for your, you know, ecommerce application, as you said, grow based, relational databases for, atomic transactions. That database should be optimized for that application, should not be optimized for a logical workflow. But how can we enable extension of these applications in a way that now they do externalize data and access modes for analytical purposes.
So at this point in time, I think what we have implemented and what I'm kind of hypothesized is this idea of kind of data quantums. You know, something that an architectural component that encapsulates the storage, the modeling, the serving of the data, as well as the policies that govern that data designed for analytical access. And that kind of data quantum sits next to or close to to the source applications. In some cases, data quantums are very close to the application that is the source of their data. And in some cases, they're not. You know, so some cases, they're downstream aggregates of multiple upstream data, quantum data. And or even further downstream, they are providing the output of a machine learning model perhaps. So in the case that we were just discussing, in the case that data quantums or data products are more aligned to the source being an application, then how can we assure that the integration between those 2 are a bit closer? And we don't have forest of pipelines putting through down these pipelines, way away from the application developers. And I think the very first step is have the same team to be responsible for both of these. Just by the fact that the people that are sitting next together have the same objective, the objective is serving this domain's function, and part of the domain's function is sharing its data.
Sitting together, I think, is the first step to get the knowledge of data modeling and list of data sharing close to the knowledge of the data modeling for the application, which is the source. And then from the technical perspective, I think the technology is already there. Like, I mean, we have event streaming, which can, you know, stream out kind of the events out of the system as it's updating its own data storage. It can provide domain events that then the data quantum will capture, will summarize, will transform to whatever, you know, columnar format or whatever format that the modes of access it supplies.
We have if it's a legacy application, you know, we have to change data capture tools to do that. So I think the integration between those 2, once it's done by 1 team and the life cycle of that integration is managed by 1 team, the technology exists. I mean, it can certainly be improved, but the technology exists.
[00:26:31] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in DataFold integrates with all major data warehouses as well as frameworks, DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Yeah. I think that to your point that it doesn't need to be all within the confines of the application that the analytics has also done, but we do need to have much more well designed and well enforced contracts at these boundary layers with the application and the downstream analytics consumers. And, you know, the database schema is not a firm enough contract because it's subject to change at the needs of the application developer, or you're the business of restricting the amount of change that's possible because it will break other downstream systems, which hampers innovation in the application.
And so as we were just discussing this here, 1 of the things that comes to mind as to how to make this an easier lift for application developers is, you know, as software engineers, we rely a lot on different frameworks to handle a lot of the, you know, boilerplate common requirements. So Django, Rails, what have you, have become widely popular because of the fact that it handles all of the concerns of being able to terminate web requests, handle, you know, cookies and sessions and gray parameters, database connections. And I think that 1 of the ways that this can become an easier solution for people is if some of these frameworks or new frameworks evolve that have these analytical outputs as a first class concern within them so that it's just a natural part of the development cycle so that I don't have to think about, okay. I need to build a new API, and I need to think about, okay, what are the access patterns? I need to be able to do both bulk reads and incremental reads and be able to maintain the watermark of when this client last read so that I know what new data to send to them, having that just be baked in as part of the kind of standard boilerplate, you get this out of the box kind of a thing. But to the point too about these sort of enforceable contracts, I think that that's 1 of the pieces that's still very weak in the data ecosystem, and it's becoming more prevalent, but it's not easy enough yet. It requires a lot of upfront work and design and, you know, collaboration across the organizations. And it's not just a framework where you say, okay. This is what I'm going to do. This generates the contract. So now as a downstream consumer, I say, okay. This is the shape of the data. I can generate the schema from that. Now I'm going to transform it, and, you know, this is the new contract for this other downstream consumer.
[00:29:37] Unknown:
I completely agree with you. I mean, I love both of those points that you raised. 1 was the extension to the kind of application development frameworks that allow at least emitting and externalizing the application data to this collaborative kind of data quantum that then can consume that and turn that into a analytical data contract. Right? I think that's a fabulous idea. Hopefully, someone is listening right now, and they go and build this thing, or the first generations of it. So I think that's absolutely 1 of the missing pieces. Because when I think about kind of the parts of the solution that can be productized, and I do think about this a lot. The part that I always come back to is that, okay, how do you bootstrap data mesh? And bootstrap data bootstrapping data mesh in enterprises has many different entry points. But 1 of those entry points is the the applications. Right? The application developers to start kind of emitting their data products or providing their at least source aligned, what I call source aligned data products.
And to bootstrap that, it's not that easy to just buy a product and plug it in off the shelf. Right? So you need to change the application logic. You need to work closely. Or those application developers need to work on their system to be able to externalize the data that then can be consumed by the data product and then, you know, change into analytical analytical data. And if we can reduce that overhead, if we can reduce the overhead of an application developer now providing those new contracts, I think we have a much better chance of getting adoption of data mesh because you are accelerating bootstrapping something. And I know that I've talked to a lot of executives, and, like, 1 of the suggestions I've heard from some of them is that we want to inject, in fact, product development, data product, or data quantum creation into the build pipeline of an application.
So at some point in that, from build to deploy pipeline, you create that. And I think, well, that's a good idea for some applications, but it wouldn't be the first thing I would focus on. Because as an organization that you're trying to kind of figure out how this data mesh thing works for you, probably you don't want to focus on a factory light generation of lots of data products that are hardly used. Perhaps it's an optimization. At the end of it, I think it's an optimization that we can work on once the organization has figured out what's on my upgrading model, what does data product mean to me. And then you can kind of try to, I guess, have a put a machinery in place that makes the data product creation faster and bootstrap that. And the other end that you mentioned, which was the contracts for data sharing, I think that is such a weak area, and I don't think it's the job of 1 vendor or the other to invent 1 because I think these are the glues that connect these pieces of the mesh together.
And these glues need to be standardized de facto standards or however. These standards should be open. And these standards should respect the nature of the analytical data sharing and the native modes of accessing analytical data, which is not 1 mode today. Like, where you have the analyst, the evergreen SQL like kind of modes of access. You've got the data scientist, as you mentioned, the kind of the columnar feature based data access. You've got, you know, the event folks and the data intensive kind of event based access. So then what are the standards for each of these native modes of access for sharing data? But not only sharing data, but also be able to run analytical workloads on the data storage, on the source. Right? How do they express what workload they want to run? How do they express what's the identity or the permissions? The identity of the agent, the client agent for access. So, so I think we have a lot of work to do in standardizing. And if I go back to the history of, like, why microservices took off, microservices was also this complex, hairy, you know, system of decomposing solution across 100 and thousands of services.
Why that took off? Well, it came as a very interesting moment in time where we had to standardize on basic Internet protocols as a way of communicating. We had, you know, kind of converged on REST back then, and, of course, later on GraphQL and others. But those convergence and those de facto standards that we started kind of adopting created the interoperability of across these kind of disparate services and compose them into beautiful and complex solutions. And I think unless we have that in place for adequate data sharing,
[00:34:18] Unknown:
building systems that are full of friction. I like that you brought us around to microservices because that is definitely the very native allegory in the application development space. And I think it was the opening sentence of your book you wrote, data mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale with scale being emphasized. And as with everything, scale is very subjective. And in the microservices ecosystem, you know, you never really want to go directly to microservices because you need to build the monolith first to see what are the actual natural boundaries of the logic and the problem domains.
And I'm wondering what you see as some of the useful heuristics for determining when a data mesh becomes an appropriate solution for an organization or for a problem space and some of the ways to be able to effectively identify those boundaries points for where it makes sense to actually draw those dividing lines through, to break out these data quantum?
[00:35:18] Unknown:
Yeah. That's a really good question. Definitely, data mesh is not the right solution for everyone. It's you've got to have the problem of scale. And the problem of scale that data mesh tries to address is the scale of the complexity of an organization. So if you're an organization that has a lot of different business functions, you have a multinational, you know, retailer or a giant tech producer with a lot of different products from mobile phones to, I don't know, laptops and so on. If you are a health care institute, whether you're a provider payer, that you have to get data from so many different touch points and so many sources, and you have many domains.
And at the same time, you have the aspirations, the use cases, the capability to actually use that data in many different diverse solutions, use cases. Like, if you're a health care, let's say, provider, you want to do population health analysis. You want to, you know, personalize your care based on the cohorts of different patients that you have and their very specific personalized needs. Like, if you have all of these aspirations, and you have the scale, the sources, and the complexity of the organizations, and you're also doing merger acquisitions, and you're growing, you probably very likely have been blocked or hindered by your previous centralized solutions.
And if you have the pain of, you know, your data team is unhappy, nobody's happy. Everybody has, like, frowny face on because, you know, the data scientists that want the data are complaining that data is no longer available. The data engineers are under you know, overworked and under a lot of pressure and underappreciated. They're not really incentivized. They don't know what they're doing in the middle. The application developers are kind of oblivious, but also frustrated that they can't, you know, build those intelligent solutions in, and they're complaining about the data sciences. So if you have those blockers, you know, centralized bottlenecks, and you have the scale, then think about Data Mesh.
The other question is, okay, at what point in my you know, I'm still exploring my business and now I'm going through that, you know, kind of massive scare of growth and I'm going through some hypergrowth. And now I've grown and I'm exploiting where in that curve I should think about data mesh. I don't have a specific answer to say where and when, because it depends on many, many different factors. Give you an example. You might say, okay, you know, the size of organization and diversity of domains, that's probably a good 1 to look at. And if your business is growing, that's the 1.
You might be actually a health care startup that from the get go, you want to consume data from many different partners, from many different sources. And you probably just provide 1 or 2 very hyper specialized, let's say, cancer detection and images or something along those lines. But from the very early on, you have this diversity of sources where the sources of data are going to continuously change, change on a different life cycle, different cadence. So maybe even from early on, because the scale of access to reliable sources is a differentiating, you know, strategic kind of benefit to you? You may wanna think about it earlier as opposed to maybe organization that doesn't have that business model. Maybe you wanna think about that once you go through that hypergrowth. So it's a really hard question to answer, but, but simply, do you have a pain of a bottleneck, a centralized bottleneck? And if you do, data mesh might be useful to look into.
[00:38:54] Unknown:
And I think another interesting element of the idea of the data quantum being the building block of the mesh means that it doesn't necessarily have to be that the mesh exists entirely within the boundaries of your organization. So for a smaller company where it doesn't make sense to split out these domains internally, you might still have this data quantum that is at the boundary of your organization so that you can provide data as a service to other organizations that you're doing business with or to your consumers. And so then the data mesh becomes more than just what is within the walls of your company, but it becomes what is this ecosystem of data products that is composable and consumable across these business boundaries.
[00:39:37] Unknown:
I completely agree with you. The way I kind of imagine data quantum, and I know it's 1 of those big words and it sometimes alienate people, but it there wasn't any other words, so I just use this 1. The way I imagine it is that it's the units of value exchange. So if data is the thing that's valuable, the thing that is product and we share, this is the unit of exchange. And if you think about it, if your mesh is within a large enterprise, then you're exchanging value between different parts of your organization, peer to peer.
But if your system is an ecosystem of partners, and there are a lot of closed ecosystems or open with your partners, you know, then you have some sort of contractual agreements around data sharing. It's the same concept. You are sharing that you need to value, but this time, you're sharing that across a trust boundary. So then when you think backwards, hey, I'm now doing data sharing across trust boundary. What is this data quantum thing that I need to share? Well, it can't just simply be those columns or rows of data anymore. It must be also encapsulating, in my mind, the logic and the transformation that keeps this data alive in a way.
Also, the policies that govern this data, the privacy of that data doesn't change. Because we just shook hand. We exchanged this value to somebody else. So it has to bundle the Plus policies within it. I think that's a fantastic use case to take us to that extreme case where we are exchanging data across trust boundaries and saying, what constitutes data? And I don't call it data. I'd call it data product or data quantum just to force us to expand our thinking beyond just bits and bytes of, you know, information about Tobias or, you know, whoever we're sharing information about to what else needs to be bundled within this thing so I can autonomously get value out of it and share it? The other part of it is that, okay, if I want to do this data sharing across trust boundaries, a lot of these big platforms fall apart. I mean, I have seen so many, you know, presentations on data marketplace, that you share your data products on this marketplace. And they do not respect the fact that these are open.
I don't even like to use the word marketplace because, you know, that evokes different kinds of emotion in around data sharing. But these are open ecosystems where the identity of the person who would be, you know, using that data or the system cannot be locked within a single platform. They are running a different platform. They're running a different identity system. So those sort of data sharing standards that you and I just talked about a minute ago need to include identity systems that go beyond the bound of a single organization, need to include standards that go beyond the bound of a single, you know, data storage.
I think we have a lot to build on because we have built internet scale solutions and APIs. We just have to extend those with respect to very different modes of access to data.
[00:42:43] Unknown:
Going back again to the analogy of microservices, another problem that becomes manifest when you are dealing with all these decoupled components is understanding what are all the interconnections between them, but also, you know, if 1 of those elements in the system becomes overloaded, then it can become problematic or it may end up becoming a supernode. Or there's the problem where I have all these microservices, so now I need to query 15 different systems to get all the pieces of data I need to be able to fulfill this request. And so particularly in the context of machine learning where I might be pulling in lots of different data from lots of different sources to be able to build some sort of a composite model, I might need to interact with 15 different data quantum. And so that becomes potentially problematic as I try to figure out what are all the data sources available, how do I wanna compose them together, may maybe they're not all providing consistent interfaces, which I know that that's sort of 1 of the requirements that you put forth as being considered a full fledged data product is that they all expose the same interfaces. But, you know, as a consumer of all of these quantum, how do I make sure that I can find them in the 1st place, and how do I make sure that they're all able to sort of give me data in consistent formats and in a performant manner so that I can make sure that I'm upholding my sort of service agreements to my downstream consumers?
[00:44:05] Unknown:
That's all engineering. That's all engineering. Like, it's it's I don't think it's an unsolvable problem. I think you mentioned a few of those. I completely agree that, you know, in case of perhaps APIs and microservices, until you get to that top layer. Right? The top front end where you're creating a journey that stitches APIs from many different services. You don't have that kind of, as you said, dependencies to a lot and lot and lot of services. Right? But then in the microservices, you do have. But I do appreciate that in the analytical data, most of the value that we get from that data, most of the interesting use cases are switching data and looking at the data across many, many different nodes. So there are a few things that must be in place. And I know this is the part of conversation that I lose all of the friends that I make early in the conversation with just talking about principles. Everybody's like, you know, yeah, we're all friends. We're I agree on these vague principles that you have. But when it comes to actual implementations and the hard engineering disciplines that we have to put in place to actually make it work, I think that's where I lose a lot of my friends. But let's talk about some of those hard engineering kind of disciplines that we have to put in place.
Some of it are around the standardization of what's in the gap, which is standardization of the APIs. Not only just data sharing APIs, but also standardization of observability and discoverability. This data quantum, what information does it need to emit for a discovery tool? And again, my language is different. I don't use, like, a data catalog because I want us to think about a new generation of solutions. A search tool that can discover every data product, discover index, and allow to search every data product that exists on this mesh. Because they all self registered themselves the moment they were created, And they're providing continuously up to date information around what data they're providing, what's the timeliness, what's the completeness, a bunch of other metrics.
So that as you said, a data user, a data scientist coming up with a hypothesis, and they want to validate it. The first thing they need to know, what data do I have access to to be able to exploit the patterns that might exist within them. Right? They have a place to go to. They have a way of searching. They have a discovery tool that allows them to search the whole mesh. And once they search the whole mesh, they can see information that tells them which 1 is the right data provider to even, you know, connect to and start exploring. Who are using it? What are the documentations?
What's the schema around it? And you don't want every single data product use a different way of documenting itself or use a different language for modeling. You really wanna standardize some of these aspects so the consumer has this kind of consistent experience of the mesh regardless of 10 or 50 or 20 or 100 data products. They they look and feel the same, even though internally their implementation of it, they're modeling a different kind of data. So for that, you know, vision to come true, to have a discovery that gives us a consistency of dimension, to be able to compare data products, which was the right 1 for me, that there is observability, there is discoverability, there is, you know, blueprints of the data product that kind of embeds all of these abilities into every data product right from the moment you initialize or create 1. There are dimensional experience capabilities, like a discovery tool or observability tool.
There is a lot to be done. And then your point around, okay, if I want to query across many disparate data products, is that efficient? I think, again, the way we should think about performance is that we should delineate between physically how the data might get stored, indexed, searched, from logically, how do we allow each of these data quantum to have data quantum to have different life cycle. Right? So logically, they are completely independent, different schema. They can be independently changed, modified, evolved, and controlled by different teams.
But we may very well choose the same underlying storage, overall indexing for search, caching, whatever is needed to optimize. So the separation of kind of physical and a logical layer to give a autonomous experience to the providers and users while having the optimization of kind of storage and access would be useful. Again, it's an engineering problem.
[00:48:36] Unknown:
Yes. The hardest problems are always the social ones. Engineering problems, you just need to throw enough thinking at it, and it'll figure itself out eventually.
[00:48:45] Unknown:
And money. And money too.
[00:48:47] Unknown:
Absolutely. And money. There's certainly plenty of that flowing around right now. And so in terms of the sort of implementations of data mesh that you've seen as far as the kind of technical elements and the products that are available in the ecosystem and some of the organizational constructs that people have built up? What are some of the most useful and well thought through examples that you have seen in the time from when you first introduced this idea to where we are now. And now that people have started to latch on to this idea as it has become much more popular, and people are actually starting to think about how to actually make it work at larger and larger scales. We talked about technology a little bit. I I touched on the social part. I think
[00:49:33] Unknown:
the organizations, you know, that they are the early adopters of data mesh are all kinds and forms and colors and shapes. They're not just the scale up. So, you know, just enterprises. It's across the spectrum. And you see a very different kind of starting point if an organization that had a traditional, you know, chief data analytics officer with governance and data science engineering under it, versus an organization that is more enable, kind of smaller, digital native. So you do see different behaviors in those. And I think the ones that are most successful are the ones that can really challenge their own biases and assumptions and bring a new thinking. So maybe I'll just give some examples here.
The larger, more traditional organizations that have had well established let's talk about this governance for a minute. Right? Data governance, they try to solve the problem of the problem that we just talked about. It's a complex system. It's a chaotic system. You know, how do you prevent this chaos? They try to solve the challenges of this kind of independent data sharing with their old methods that they are very much used to. So the old methods are putting control in place. So I love the, you know, the system thinking and kind of the work of Danilo Meadows and the kind, you know, likes of her on system thinking. And if you think about this as a complex system, the traditional organizations that have adopted, they still try to fit it into the system thinking that they had for their organization. So introduce bottlenecks, basically. So if they are worried about, let's say, duplication of the data products and how we're going to prevent this chaos and people building these data products. And that comes from, you know, years of having scars of people copying data into different databases.
What they think about is, well, we're gonna put a certification and validation and a manual kind of quality control in the pipeline of the data products. That just creates synchronization points and a bottleneck. It doesn't work. So I know you asked for good practices. I'm just telling you some of the bad ones. But just contrasting that with kind of companies that have automation, heavily relying on automation and platform solutions, and they're comfortable with chaos to a degree, and find different ways of managing just the same problem, the problem of, let's say, duplicated data products we don't need.
And the thinking that I work with some of the clients and kind of seen around, well, how do you apply system again to a complex system to avoid duplicates? Well, you introduce feedback loops. Right? So you introduce a positive or a negative feedback loop in terms of you want to, for example, in a mesh to expose all of the information about your data products in a centralized I know I said the word centralized but centralized global discovery or search tool. And this search tool will give higher ranking to the data products that have, you know, better user satisfactions. They have more stars. People kind of like them more or use them more and gives lower ranking to the enterprise that seem to be looking exactly the same as the other ones, but they don't have as much usage. So it's self balancing. The mesh tries to self balance itself by this just a simple ranking, positive or negative feedback loops, Which is a very different approach, which of course relies on observability and automation and all of that. But it's a very different social system design approach to solve a problem without putting bottlenecks in place.
The other interesting aspect that I have seen is around education and empowering kind of domain teams and application teams and data literacy. And people have come up, I think HelloFresh had a presentation on that, which I adored. How they kind of gamified the education around data and created different reward programs or incentive programs and educational programs to get everybody on board in terms of valuing data as a piece of kind of product that they build. I'm still discovering and learning participating in creating
[00:53:55] Unknown:
kind of social systems that work, and there's still a lot to be done. 1 of the other interesting things that the idea of the data mesh and some of the ways that it manifests in a company can bring about is the variance in the types of roles and skills that are necessary, where this has already been going through an evolution because the idea of a data engineer is still relatively new in terms of human history where, you know, it started off, we had database administrators, and we had business intelligence engineers, and, you know, then the rise of the data scientists made everybody realize we needed data engineers to be able to hand over the data sources that were, you know, well groomed and maintained and up to date for the data scientists to work from. And now as that has become a more recognized job description, we've also been seeing the rise of the analytics engineer, and now there's data project engineer, machine learning engineer, and there's this, you know, continuing proliferation of titles that all fall within the general category of data professional.
And I'm wondering what you see, you know, some of the impact of data mesh where these data product concerns are being brought into the application development cycle and the kind of domain expertise that's necessary, how that will influence the sort of types of jobs and positions that are necessary for organizations to be able to realize the full benefits of the mesh as they start to scale up and scale out and increase their level of sophistication of data usage?
[00:55:26] Unknown:
What I might say might be a bit controversial, and I do not intend to undermine anybody's skills or talents or, you know, contribution. But if data mesh is successful at this division I had, that would be just engineers, Tobias. There wouldn't be any, you know, all these, like, you know, rainbow of engineers because we introduced its intermediary roles. And every time we introduce an intermediary role, we're creating gap between the producers and the consumers. And we're creating accidental complexity in that system, as opposed to thinking about how to close that gap and get these people talk to each other directly, right?
So I think those intermediary boundary roles of analytics engineer, data engineer, I think that they were needed at the time. If organizations are serious about being data driven I mean, data informed, data driven, however you want to call it, as in embed what you just discussed a minute ago. Embed intelligence in every decision making, in every aspects of their application, or at least in many of them. They have to get rid of these boundary laws. And they have to find a way to upscale, cross skill their engineers to work with data, to work with ML. And of course, they will still have some specialized roles.
And those specialized roles are the ones that are, for example, the PhD graduates of data science. They're really working on the science part. And I do think a lot of the work that we call data science is actually feature engineering. It's not really data science, but there is still a science part. So I think we will still have a smaller number of specialists, but a large portion of kind of generalists, at a point in time, they choose to focus more on the data side. Or at a different point in time in their career, maybe they choose to focus on the application part as opposed to creating these fragments of boundary rules. And the purpose of next generation self serve platforms is closing that gap. Right? To allow the generalist, what we call the generalist acknowledges. And I know there's no such a thing as a generalist technologist. Generalist technologist are technologists that choose to their experts at a different point in time and different thing. But it's possible to move between these areas of expertise because the learning curve is not as steep as it is for some of these specialties.
Yeah. So I think that the purpose of that kind of self serve platform and thinking about this platform capabilities and then raising the abstraction is to not require so many specializations.
[00:57:58] Unknown:
In terms of being able to raise that floor of complexity and, you know, reduce the amount of specialized knowledge that's necessary to be able to just work at a surface level with these different data technologies? What do you see as the role of vendors and sort of service companies in being able to help realize the the potential for this future state? The list is long.
[00:58:23] Unknown:
So maybe share some of the things on top of my wish list for vendors. I think, like any good product design, starts with focusing on the user experience and creating these, you know, new personas, which are your generalist developers. Think about generalist engineer or generalist expert engineer, whatever we're gonna call them. Think about where they come from, what sort of skill sets they have, you know, what they experience, what's the most seamless experience for them to close that cycle of intelligence, right, from application development to intelligence, to deployment of that intelligence into the application. And that's just good again, I'm being cheeky here a little bit, but that's just good product design. Right? Think about the experience of this new persona of users that you have as opposed to solutions that are closer to the middle, closer to the machine, and optimized for the machine. I think a lot of our data solutions, and for good reason, for the last, you know, 2 decades, mostly have been optimizing for what you just talked about, for performance.
Right? For separation of the data from the compute so we can scale out each differently. Now that we've solved that problem, we need to raise the bar and focus on optimization of the experience and optimization of a connected experience from after movement back to the full cycle intelligent, you know, digital solution development. So that would be just the starting point, who we optimize our solutions for. And then the second part of it is that when I look at the data landscape or data solution or vendor landscape today, I see 2 main categories.
I see a fragmented world of tiny little solutions that every startup kind of tries to build and capture the market for a small section. And these solutions, very few startups to start with, how does my solution fit in an ecosystem? Right? How does this connect with the rest of an ecosystem? So ecosystem thinking as building solutions that interconnect nicely with each other is the not second second on my wish list. And then the other camp is big platform. I'm gonna give you the soup to nuts of everything, and just buy 1 solution, because you're gonna get everything you need. And that doesn't lead into a future that I want to be part of. It doesn't necessarily lead to a future that is conducive to innovation by smaller players, by disruptive players. So I guess it's a hard ask, but it's an ask about even if you are a big platform company, still try to be a good citizen of an ecosystem with other players within it. And that means product product design with interoperability and connectivity in mind from day 1.
[01:01:16] Unknown:
And in terms of your experience of working with your clients at Thoughtworks and working with people in the community and writing the book and just interacting with people in general on this idea of the data mesh, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:01:33] Unknown:
So I guess the most interesting and exhilarating 1 was the surprise that I got when I published this content. And content was, like, conference speaking and writing. My expectation was, maybe I mentioned this last time, was sharp objects being thrown at me and me ducking. But to my surprise, I said what a lot of people were thinking about already or implementing. So I just voiced something, perhaps, that was obvious, but not always spoken of. And I tried to put a system around it in that explanation. So the most surprising element is that how the industry has embraced the topic.
Very, very happy about that. I I guess related to that again, the most surprising and yet challenging is that the vendors play a big role in this. And while I'm happily surprised that many vendors has embraced the approach, I still see a challenge as I haven't really seen data mesh native solutions to be created. As you mentioned earlier, I think the technology challenges are solvable engineering problems because there's still quite a bit of a gap that exists. So every project that starts does require investment from the adopters to to build the gap until the gap is filled. But the social aspect of it is still still a big challenge.
I thought that, you know, the domain application developers would be more welcoming to embracing data. And I just realized what a big gap there is between what the mission of the company is as being data driven at the executive level and the reality of the on the ground that how compartmentalized data is still from the reality of the business and applications. And that's the biggest challenge we have to solve, and that's what we talked about. What are those intrinsic motivations to embed intelligence. And, you know, this new age of AI become real, real, real, not at executive even as the as the mission statement, but at the grassroot engineering, business people, BAs, application developers.
[01:03:53] Unknown:
We touched on this a little bit already, but what are the cases where data mesh is the wrong approach and somebody might be better suited with just a, you know, throwing everything into the data lake or the data warehouse and building these different, you know, point to point solutions.
[01:04:06] Unknown:
I guess I add something to your question. I would say, where are the places where the data mesh is wrong approach today? Because the solution you know, today versus 5 years from now, we may make decisions very differently. I think today, if your organization, as we discussed, doesn't have the scale, the scale of sources, the scale of use cases, and you have very modest use cases and modest domain number of domains, I don't think the animation is for you. Again, today, if you don't fit into the innovator, adopter, you know, curve of adoption of a new innovation, as in you're not risk taking, you're not comfortable as an organization with ambiguity, you don't have an experimental attitude towards developing and walking through the unknown, perhaps you've got to waste. Because there is a level of experimentation.
There is a level of unknown and ambiguity and refinement that needs to happen that innovator and lead adopters are Okay with, but Laggard are not Okay with. So if you're traditionally a Laggard, then I don't think it's the right time for you. And the 3rd piece that I would say is that, again, because we are at the point we are today, there are not many off the shelf technologies for you to get and simply integrate, and there is a fair bit of investment in building out. And if you're not a technology company that has technology, you know, respect or embrace technology at its core, as its main developer or even the driver of the business, then now may not be the right time because you would need that kind of technical foundation to build and operate a data mesh.
[01:05:47] Unknown:
And as you continue to work with companies and help to formalize these ideas around the Data Mesh, what do you see as the future for the principles and your overall involvement in helping to drive it forward? And at which point do you think it will become just the community's responsibility to adopt and push forward these ideas?
[01:06:10] Unknown:
I would like to see community's involvement much sooner than later. I'm actually very grateful. There are folks in the community that have led initiatives like, you know, data mesh learning group, you know, flourished over the course of a couple of months from nobody to 1, 000. So I think the community is forming, and there's just a lot of information still figuring out what's the good information, what's the bad information. So a lot of misinformation as well. So so I think the community is evolving, and I'm very actually happy and grateful with the participation of folks in the market.
As of my role, I think up to now and up to once this book is out early next year, been still acting as an evangelist, try to be ahead of the curve, been ahead of the curve with Thoughtworks for a few years. So be engaged in the projects and, the challenges we see on the ground. And come back and share those challenges and learnings with the larger community. I think Thoughtworks has traditionally had the spirit and the culture of sharing. We did that with microservices. We did that with continuous delivery and a lot of other, a few other, I guess, major shifts that we've seen our industry.
[01:07:20] Unknown:
Are there any other aspects of the ideas around data mesh and its manifestations and some of the work that needs to be done that we didn't discuss yet that you'd like to cover before we close out the show? I think at the high level, people kind of understand the principles, the motivations behind it. The book, in fact, is structured around
[01:07:39] Unknown:
why Data Mesh. So have a way of justifying this is for you and why it should matter to you or why it shouldn't matter to you, and then what it is at the level of principles. And I think I have all of those principles almost in the early release if people wanna access. But the piece that we still have to figure out and we haven't figured out yet is really those technical gaps that we have to bring a distributed data sharing model for analytical use cases to life at scale. And I have some opinions, some learnings, some hopes, And these are in the part 3 of the book, which is kind of how to build an architecture.
Some of it's proven. Some are totally hypothetical and still yet to be proven. So I think we have to still work at that. And I think as we discussed, where does data mesh fit into an organizational wide data strategy and execution? I think that's a part 4 of the book that I think we still need to learn and discuss more. So there won't be time for us to go into details of those. But I think the actual technical architecture and implementation, something with a refreshed and outsider, perhaps, perspective. And also the execution of the data mesh at the organizational level, kind of strategy and executions, these these discussions and topics to be discussed.
[01:08:59] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you your preferred contact information. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:09:14] Unknown:
I'm probably gonna tell you the same answer I told you last time. Standards, standards, standards. Right? Whether these are the standards like parquet file and so on or not, but standardizing the gap. The gap where data is shared across different, what we call them, data product quantum or data sources. So we still need to standardize analytical data sharing a bit more. And I see some work to be done with kind of data sharing. There are some small steps that the industry is taking, but we need a lot more of those. And we need vendors, hopefully, data to be incentivized. And hopefully, data mesh is a catalyst to incentivize vendors to share the data, which then hopefully leads to establishing the interoperability standards.
[01:10:09] Unknown:
Cast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and co workers.
Introduction and Sponsor Messages
Guest Introduction: Zhamak Dehghani
Zhamak's Journey into Data
Recap of Data Mesh Principles
Evolution of Data Mesh Thinking
Modern Data Stack and Data Mesh
Productization of Data
Frameworks and Contracts in Data Mesh
When to Implement Data Mesh
Data Mesh Beyond Organizational Boundaries
Challenges in Data Mesh Implementation
Successful Data Mesh Implementations
Impact on Data Roles and Skills
Role of Vendors and Service Companies
Lessons Learned from Data Mesh
When Data Mesh is the Wrong Approach
Future of Data Mesh and Community Involvement
Technical Gaps and Organizational Strategy
Closing Remarks and Contact Information