Summary
The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?
- What are some of the organizational and industry trends that tend to lead to this solution?
- You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?
- In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?
- What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake?
- One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?
- A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
- Who is responsible for implementing and enforcing compliance regimes?
- A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization?
- One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard?
- How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
- Has latency of data retrieval proven to be an issue in your work?
- How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution?
- When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend?
- How do team structures and dynamics shift in this scenario?
- What are the necessary skills for each team?
- Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured?
- Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden?
- For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains?
- Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
- Thoughtworks
- Technology Radar
- Data Lake
- Data Warehouse
- James Dixon
- Azure Data Lake
- "Big Ball Of Mud" Anti-Pattern
- ETL
- ELT
- Hadoop
- Spark
- Kafka
- Event Sourcing
- Airflow
- Data Catalog
- Master Data Management
- Polyseme
- REST
- CNCF (Cloud Native Computing Foundation)
- Cloud Events Standard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And to grow your professional network and find opportunities with the startups that are changing the world, then AngelList is the place to go. Go to data engineering podcast.com/angel to sign up today. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.
We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to data engineering podcast.com/conferences to learn more and to take advantage of our partner discounts when you register. And go to the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
[00:01:55] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management. So, Zhamak, can you start by introducing yourself? Hi, Tobias.
[00:02:06] Unknown:
I am Zhamak Dehghani, and I am the technical director at Thoughtworks, a consulting company. We do software development for, many clients across different industries. I'm also part of our technology advisory board, that gives me the, I guess, the privilege and the vantage point to, see all the projects and technology that we use globally across our clients and digest information from that and publish that as part of our tech technology radar twice a year. And do you remember how you got involved in the area of data management? Sure. So I guess I've done earlier in my, career, I have built systems that part of what they had to do was, you know, as part of the kind of network management systems many years back, collect, you know, real time data from diverse set of devices across, you know, hundreds of nodes, and kind of manage that, digest that, act on it. It was part of some sort of a network management observability.
However, post that experience, most of my focus had been on, distributed architecture, at scale and mostly on operational systems. So data was something that was hidden, you know, inside operational services in a distributed system to do to support what those operational systems need to do. The last 2 years, in the last few years, both being on the tech advisory board for Thoughtworks and also working with, clients on the West Coast of US where I'm located, I had the opportunity of working with teams that were building data platforms. So I I I support different technical teams that we have on different clients, and sometimes I go really deep on 1 client, and sometimes I stay cut off across multiple projects.
So I came from a slightly left field from, you know, distributed systems on operational systems to, people who've been building you know, who've failed who were struggling building kind of the next generation data platform for a large retail or people who have been involved working with teams who are building the, you know, the the next generation analytical platform for 1 of the tech giants here down in San Francisco, and struggling to scale that. So I started, I guess, working in a last couple of years, working with teams that were heavily involved in recovering from the past failures and, of, you know, data warehousing and data platforms and building kind of the next generation and seeing, the challenges and the experiences that they were going through. I'm sorry. I can't name many of these clients that I work with, but they often fall into the category of, you know, large large organizations with fairly rich domains and rich datasets where, you know, the state of the data is relatively, kind of unpredictable or has poor quality because the organizations have been around for a long time and they've been trying to kind of harness the data and use it for for many years. And so you recently ended up writing a post
[00:05:21] Unknown:
that details some of the challenges associated with data lakes and the monolithic nature of their architectural pattern and proposing a new approach that you call a data mesh. So can you start by providing what your definition is of a data lake since there are a lot of disagreements about that in the industry and discussing some of the problems and challenges that they pose as an architectural and organizational pattern? Sure. So maybe
[00:05:49] Unknown:
I give a little bit of a historical perspective on the definition where it started and what we see today as a general patterns. So I think, you know, data lake started the term what was coined by, James Dixon in 2010. And what he envisioned at the time was, was an alternative to data warehousing approach. But what he suggested was, data lake is a place for a single data source to provide its data in its raw format, so that other people, whoever the consumers are, can decide how they want to use it, how they want to model it. So there there was no, you know, the contrast was with, data warehouse, in a way that, you know, data warehouse was producing pristine bottled data that were that they were well modeled. They had very strong schemas and designed to, address very specific, I guess, use case that needs around business intelligence and reporting and so on.
Data lake based on James Dixon initial, I guess, definition was a place that a single data source provide unbottled data, raw data, and that other people can use. Later on, he actually corrected and enhanced that definition saying what he intended for Data Lake to be to be 1 place for raw data from a single data source and an aggregate of data sources into, you know, raw data, into 1 place from multiple data sources, what he called Watergarden. But I think the, he's, you know, he's, thinking around, raw data versus spotter data stirred people's imagination. What data lake turned out to be, is a place that aggregates raw data from all of the data sources, you know, all corners of the organization into 1 place.
And then on to allow mostly analytical uses usage and, you know, data scientists kind of diving in and find out what they need to find out. And then it has evolved even from then onwards. Like, if you see, for example, different data lake solution providers and you look at their websites, whether there are, you know, our famous cloud providers like Azure, Google, so on, or, you know, other service providers. The current, I guess, reincarnation of a data lake is not only a place that write raw data from all the sources in the organization would land, but also it's a single place that, you know, pipelines to cleanse and transform and and provide different views and different access on that data also exist as part of the pipeline of that data lake.
I use the data lake kind of metaphorically. I meant by by data lake really, the incarnation of the current data lake as a single centralized and kind of monolithic data platform to provide data for variety of use cases, anything from, you know, kind of the business intelligence to analytical to machine learning based, accumulating data from all the domains and all the sources, in the organization. So based on that definition, we can, I guess, jump into the second part of your questions about some of the underlying, kind of characteristics that these sort of centralized data platforms,
[00:09:18] Unknown:
share? Yeah. So I'm interested in understanding some of the organizational and technical challenges that they pose, particularly in the current incarnation of how everybody understands a data lake as being essentially a dumping ground for raw data that you don't necessarily know what to do with it yet or that you just wanna be able to archive it essentially forever. And also some of the organizational and industry trends that have led us to this current situation?
[00:09:46] Unknown:
Yeah. Absolutely. I thought I I think since of so I wrote this, I can go into that in a minute, but I wanna preface that by kind of the litmus test that this article has given me to validate whether the challenges that I had seen, was widely applicable, you know, globally or not. So I'll go into the challenges for a minute, but I wanna, I guess, give a little bit of information on, what has happened since I've written this article. So the challenges that I have observed that that we can share in a minute, technical and organizational challenges, I observed working, you know, with a handful of clients, and and also seeing secondhand from the projects that we're running globally.
However, since I've written the article, there are tens of calls that I received, from different people, different organization, kind of sharing the the the challenges that they have and how much this had resonated with them. And it's, was a confirmation that it wasn't just a problem in a pocket that I was exposed to, and it's more more widely spread. I think, some of the underlying challenges are related. So so 1 is the symptoms that we see. Right? So the symptoms the the problems that we see, the symptoms are around mostly, how how do I bootstrap building a data lake? How do I it's it's a big, you know, it's a big part of your infrastructure architecture. It's a central piece of your architecture, But yet, it needs to facilitate collaboration across such a diverse set of stakeholders in your organization. People that are building your operational systems, that are collecting the source data, you know, generating essentially the so source data as the byproduct of what they're doing. And those are stakeholders really have no incentive in providing that byproduct or by that data for other people to consume because there is no, direct incentive or feedback cycle into, where the sources are. So there is a this this central piece of architecture need to collaborate with diverse set of people that represent the source data and also a diverse set of teams and business units that, represent the consumers of the data and all the, you know, kind of possible use cases that you can imagine for a company to become data driven, essentially, to make intelligent decisions based on the data. So the the symptoms that we see are either, you know, people are stuck bootstrapping, building such a monolith and creating those all those points of integration and points of collaboration.
Or people that have succeeded to create some generation or some form of that data lake or data platform, they have failed to scale either onboarding, you know, new data sources and deal with the proliferation of the data sources in organization, or they have failed to scale responding to the consumption models to different access points in the consumption models for that data. So you end up with this kind of centralized rather stale incomplete data set that really doesn't solve, you know, a diverse set of use cases. It might solve, you know, narrow set of use cases. And the 4th, I think, kind of failure symptom, that I see is that, okay, I've got a data platform in place, but how do I change my organization to actually work differently and use data for, you know, decision intelligent decision making, you know, take augment our applications to become smarter and use that data. So I just put the 4th maybe failure mode aside because there are a lot of cultural issues associated with that, and I wanna focus on perhaps more architecture. I know you can't really separate architecture from organization because they kind of mirror each other. And focus on what are the root causes? What are the fundamental characteristics that any centralized data solution, whether it's a warehouse or lake or your next generation big cloud based data platform share that leads to those symptoms? And I think the first and foremost is this, this assumption that data needs to be in 1 place. And when you put things in 1 place, you create 1 team, around, managing it, organizing it, you know, taking it in, digesting it, and serving it. And that fundamental centralized view and centralized implementation is by default a bottleneck for any sort of scale of operations. So that limits how much organizations can really scale operationally, use an intake of the data once you have a centralized team and a centralized architecture in place.
The other characteristics that I, characteristic that I think, is leading to those problems. That's the thing. 1st and foremost, I think, is centralization that that conflates different capabilities. When you create 1 piece of, you know, central architecture, and especially in the big data space, I feel there are 2 separate concerns that get conflated as 1 thing and has to be owned by 1 team. And 1 is the infrastructure, the underlying data infrastructure that serves, you know, hosting or transforming with access to the big data at scale, and the other concern is the domain. What are the domain data in the raw format or in aggregate or model format? What are these domains that we actually try to put together in 1 place? And those concepts, the domains, and also the separation of infrastructure from the data itself gets conflated in 1 place, and that becomes another, you know, point of contention and and lack of skill.
And that leads to, leads to other challenges around, you know, siloing people and siloing skill sets, that has their own, you know, impact that leads to kind of unfulfilled promise of big data by siloing people based on technical skills. You know, a group of data engineers or ML engineers just because they know certain toolset around managing data from the domains where the data actually gets produced and the meaning of that data is known and separating them from domains that are actually consuming that data. And they are more intimately familiar in with how they wanna use the data and separating them into a silo, that's another point of, pressure for for the system, which also leads to other, you know, I guess, human, impact, like the the lack of satisfaction, the pressure that these teams are under. They don't really understand the domains, and they're yet subjected to providing support support and consuming, ingesting these data, making sense of it, and how fragile that interface between the operational systems and the data lake the big data platform is because those operational systems and the life cycle of that data changes very differently, based on the needs of those operational domains to the data that the team is consuming, and that becomes a, you know, a fragile point that you continuously playing catch up. And the data team is continuously playing catch up with the changes that are happening with the data. And the frustration of the consumers because the data scientists or ML engineers or people that the business analysts that wanna use the data, they are fully dependent on the silo data platform engineers to provide the data that they need in the in the way that they need.
So you have, you know, a set of frustrated kind of consumers on the other side and portent engineers in the middle trying to kind of work under this pressure. So there's a I think there's a human aspect to that as well. Yeah. It's definitely interesting seeing the parallels between the monolithic approach of the data lake
[00:17:25] Unknown:
and some of the software architectural patterns that have been evolving of people trying to move away from the big ball of mud because of the fact that it's harder to integrate and maintain and that you have the similar paradigm in the data lake where it's hard to integrate all the different data sources in the 1 system, but also between other systems that might want to use it downstream of the data lake because it's hard to be able to separate out the different models or version the datasets separately or treat them separate because they're all located in this 1 area.
[00:17:57] Unknown:
I agree. I think I think the reason and I have a hypothesis as why this is happening, why we are not taking learnings from the monolithic, you know, operational systems and their lack of scale and the, you know, the the human impact of that and not bringing those learnings to the data space. I definitely see, as you mentioned, parallels between the 2, and that's where it came from. I came kind of a bit of left field to this. And and to be honest, when I wrote this article, when I started first talking about it, I had no idea that I'm gonna, you know, offend a large community, and I'm gonna get sharp objects thrown at me, or is this gonna resonate? And luckily, it's been on the, you know, the the latter side and and a more positive feedback. So I think definitely there are parallels between the 2. My hypothesis is why data engineering or beta data platform has been stuck at least 6 to 7 years behind what we've learned in distributed systems architecture and more complex system design is that the evolutionary, you know, thinking and evolutionary approach to improving data platform is still coming from the same community and the same frame of thinking that the data warehouse is built upon. Even though we have made improvements. Right? We have changed ETLs, like extract transform load to, extract load transform ELTs, essentially with data lake. But it we are still fundamentally using the same construct.
If you zoom into what the data warehouses stages were, you know, in that generation data platform and you look into, you know, even Hadoop based or whatever they did, lake based models of today, they still have similar fundamental constructs such as pipelines, you know, ingestion, cleansing, transformation, serving as major first level architectural components of the big data platform. We have seen layered, you know, functionally divide divided architectural components Back in the day when we tried to, you know, 10 years ago, decouple monoliths.
The very first approach that enterprises to took back in the operational systems when they were thinking about how the heck I'm gonna break down this monolith to some sort of architectural quantum that I can independently somehow evolve was, well, I'm gonna layer this thing. I'm gonna put a UI on top and a business, you know, kind of process in the middle and data in the middle. And I'm gonna bring all my DBAs together to to own this and manage this central as database that is, you know, centrally managing data from different domains. And I'm gonna struct organizationally structure, you know, my organization in in layers.
And that really didn't work because the change hap doesn't happen in 1 layer. How often do you replace your database? The change happens orthogonally to these layers across them. Right? When you build a feature, you probably need to change your UI and your business model and your data at the same time. And that led to this thinking that, you know what? In a modern digital world, the business is moving fast and the change is localized to those business operations. So let's bring this kind of domain driven thinking and build these microservices around the domain where the change is localized so we can independently make that change.
I feel like we are kind of following the same footsteps. So we've come from the data warehouse and a big data, you know, place and 1 team to rule them all. And then we're scratching our head to say, okay. How I'm gonna turn this into its architectural pieces so I can somehow scale it and, well, invert, you know, flip the layered model 90 degrees and tilt your head. And what you see is a data pipeline. I'm gonna create a bunch of services to ingest and a bunch of services to transform and a bunch of services to serve and have these data marts and so on. But how does that scale? That doesn't really scale because you still have those points of handshake and point of friction between these layers to actually meaningfully create new datasets, create new access points, create new, you know, features in your data. You're introducing new date, data products in a way.
And I think that's hopefully, we can create some colors cross pollination across the thinking that happened in in operations and bring into the data, data. And that's what I'm hoping to do with the data mesh tools to bring those paradigms and create this new model, which happens at the intersection of those disciplines we implied in operational domains
[00:22:21] Unknown:
to the world of data so we can scale it up. And another thing that's worth pointing out from what you were saying earlier is the fact that the centralized data lake architecture is also oriented around a centralized team of data engineers and data professionals. And part of that is because of the fact that, you know, within the past few years, it's been difficult to find people who have the requisite skill set, partly because we've been introducing a lot of new different tool chains and terminologies, but also because we're trying to rebrand the role definitions.
And so we're not necessarily as eager to hire people who who do have some of the skill sets but maybe have gone under a different position title, whether it's a DBA or, you know, maybe systems administrators and trying to convert them into this new role types. And so I think that that's another part too of what you're advocating for with this data mesh and the realignment of how the team is organized in more of a distributed and domain oriented fashion and being embedded with the people who are the initial progenitors of the data and the business knowledge that's necessary for being able to structure it in a way to make it useful to the organization as a whole. So I guess you if you can talk through what your ideas are in terms of how you would organizationally structure the data mesh approach, both in terms of the skill sets and the team members and also from a technical perspective and how that contributes to a better approach for evolutionary design of the data systems themselves and the data architecture.
[00:24:00] Unknown:
Sure. I think, the point that you made around, you know, skill sets and and not having really created career paths for either software generalist to have the knowledge of all the tooling required, to, you know, to perform operations on data and providing data as a first class asset or, you know, or or really siloing people into a particular role, like data engineers who have chosen that path perhaps from maybe a DBA or data warehousing path, and not having had the opportunity to really perform as a software engineer and really act, put the hats of a software engineer in place. And that has resulted into in a to a, you know, silo of skill set, but also lack of maturity. So I see a lot of data engineers that are, you know, wonderful engineers. They know their tools that they're using, but they haven't really adopted best practices of software engineering in terms of versioning, you know, continuous delivery, like, versioning of the data, continuous delivery of the data. These concepts are not well established or in well on stood because they're basically evolved in the operational domain, not the in the data domain. And many, you know, wonderful software engineers that haven't had the opportunity to learn Spark and kappa and, you know, stream processing and so on so that they can add that to their tool belt. And I think for future in future generation of software engines, whole hopefully not so far in the in the next few years, Any generalist kind of full stack software engineer will have, you know, the toolkits that they need to know and have expertise in to manipulate data in their tool belt.
I'm really hopeful for that. There is an interesting statistic from LinkedIn, and if I remember it correctly, from 2016. I doubt that has changed much since then. But they had 65, 000, people that had declared themselves as data engineers on their site. And there were only there were 60, 000 jobs available for data looking for data engineers only in the Bay Area. So that shows the discrepancy between what the industry needs and what we are, you know, enabling our people, our engineers to to fulfill those roles. Going back to your that was sorry. My little rant about career paths and developing data engineers. So I personally, with my teams, welcome the data enthusiasts that that they want to become data engineers and provide career paths and opportunities for them to evolve in that role. And I hope other companies would do the same.
So going back to a question around kind of what what's a data mesh? What are those fundamental organizational constructs and technical constructs that we need to put in place? I'm gonna talk about the fundamental principles, and then, hopefully, I can bring it together in 1 cohesive sentence to describe it. The first fundamental principle behind data mesh is that data is owned and organized through the domains. For example, if you are in, let's say, in a health insurance domain, the claims and all the operational systems, and you probably have many of them, that generate raw data about the claims that the members have put through.
That raw data should become the first class concern in your, architecture. So the domain data, the data constructed or generated around domain concepts such as claims, such as members, you know, such as, clinical visits, These are the operational these are the first class concerns in your structure and in your architecture, which I call them domain data products in a way. The second, and that comes from, you know, domain driven kind of distributed architecture. So what does that mean? That means that at the point of origin, systems that are generating the data, they're they're representing the facts of business as we are operating the business such as, events around claims or perhaps even historical snapshots of the claims or some current state of this, claims.
As they they're providing those teams, the teams that are most intimately familiar with that data are responsible to providing that data in a self serve consumable way to the rest of the organization. So the ownership, the the 1 of the constructs or principles is that the ownership of data is now distributed and is given to people who best suited to know and own that data. So that ownership can happen at multiple places. Right? You might have your source operational systems that would now own a new dataset or streams of data, whatever format is most suitable to represent that domain, to own that data. And I have to clarify that that raw data that those, you know, systems of origin generate, we're not talking about their internal operational database. Because internal operational databases and datasets are designed for a different purpose and intent, which is make my system work really well. They're not designed for other people to get a view of what's happening in the business in that domain and capturing those facts and realities.
So it's a different dataset. This is different whether it's a stream, very likely, or it's a time series or whatever format is, is the data that is native datasets that are owned by systems of origin and people who are operating those systems. And then you have datasets that maybe there are more aggregate views. For example, in a domain for, again, health insurance as an example, you might wanna have, predictive points of intervention or, you know, critical points of contact that you want the care provider makes, you know, contact with a member to make sure that they're getting the health care that they need at the right point in time so that they don't get sick and make a lot of claims on the insurance at the end of the day. So that domain itself that is responsible for, making those contacts and making those smart decisions and predictability as when and where I need to contact a member, they might historical, you know, records of all the visits that the member has done and all the claims. It says a joint aggregate view of the data. But that data might be useful not only for their domain, but other other other domains.
So that becomes another domain driven data that that the the the team is providing for rest of organization to support. So that's kind of the distributed domain aspect of it. The second principle behind that is for any for data to be really treated as asset, for for it to be in a distributed fashion, be consumed by multiple people, and still be joined together and filtered and be in a meaningfully aggregated and in a self serve way used. I think we need to bring product thinking to to the world of data. So that's why I call these things kind of domain driven, or domain data products.
Product thinking in a technology space, what does that mean? That means I need to put all the technology, you know, kind of characteristics and tooling, so that I can delight the, you know, I can provide a delightful experience for people who wanna access the data. So these people might be you might have different types of consumers. They might be data scientists. Maybe they're just they wanna just analyze the data and run some queries to see what it's they they may may or not wanna they may wanna use that data to convert it to some other, you know, easy to understand way of data to, you know, kind of spreadsheet so that you have this, diverse set of consumers for your datasets.
But for them, for this dataset to be considered a product, a data product, and bring product thinking, you need to think about, okay, how I'm gonna provide discoverability. So that's the first step. How can somebody find my data product? So it's a discoverability. How can the address it so they can programmatically use it? It? It's the addressability. I need to make sure that I put enough security around it so that whoever is authorized to use it can use it, and they can see things that they should see and not see things they shouldn't see. So the security around it. And, well, for this to be self serve as a product, I need to really have good documentation, maybe example datasets that they can just play with and see what's in the data. I need to provide the schema so they can see structurally what it looks like. So all of the tooling around kind of self describing and supporting kind of the the again, the understanding what the data is. So there is a set of practices that you need to apply to data. So for data to become an asset itself. And finally, the 3rd, I think, discipline that intersect of, this discipline would help the data mesh is the platform thinking part of it. So at this point of conversation, usually people tell me, hold on a minute. You've asked me now to have independent teams, all of my operational teams to own their data and also serve it in such a self serve, you know, easy to use way.
That's a lot of repeatable, you know, metal work that these teams have to do to actually get to the point that they can provide the data. Like, all the pipelines that internally they need to build so that they can extract maybe data from the databases or have a new, you know, event sourcing in place. So that leads it to into, you know, transmitting or emitting the events. There is a lot of work that needs to go documentation, discoverability. How can they do this? This is a lot of costs. Right? So that's where kind of the data infrastructure or platform thinking comes to play.
I see a very specific role in a data mesh, in this mesh of, you know, domain oriented datasets for infrastructure, for what I call self serve data infrastructure. So all the tooling that we can to put in place on top of our raw data infrastructure. So the raw data infrastructure is, you know, your storage and your, backbone messaging or backbone event logs and so on. But on top of it, you need to have a set of tooling that these data product teams can essentially use easily so that they can build those data products quickly. So we put a lot of capabilities when we build the data mesh into that layer, which is a self serve infrastructure layer for supporting discoverability, supporting, you know, documentation, supporting secure access, and and and so on. So that our teams and the success criteria for that infrastructure tooling data infrastructure tooling is the decreased lead time for a data product team or an operational team to get their dataset onto this kind of mesh mesh platform.
And and that's that's, I think, a big part of it. And I think it's it's at this point, it's easy to imagine that, okay, if I have now domain oriented data products, clearly, I need to have people who can bring product thinking to those domains. So you have people that play, kind of the data product owners, and they might be the same same person as a tech lead or same person as the, you know, product owner for your operational system. But what they care about is data as a product they provide into the rest of the organization. So the life cycle of that data versioning, what features they what sort of, you know, elements they want to put into it. What is the kind of service level objective in a way? What are the quality metrics that we support? You know, if this time this data is almost near real time, it probably have missing or has some duplicate events in there, but maybe that's, you know, that's an accepted, kind of service level agreement in terms of the data quality. And they think about those, you know, all the stakeholders for that data. And similarly, now our software engineers who are building operational systems also have skill sets around using Spark or using Airflow or all of the tooling that they need to implement their data products. And similarly, the infrastructure folks that are often, you know, dealing with your compute system and, you know, your storage and your build pipelines and have providing those tools as, you know, self serve tooling.
Now they're also thinking about big data storage. Now so they're thinking about, okay, data discovery. And that's an evolution that we've seen kind of when API, you know, revolution happened. A whole lot of technology came into infrastructure like the API gateways and API documentation and so on that becomes part of the service infrastructure in a way. And I see the same hopefully, the same kind of change would happen with infrastructure folks supporting
[00:37:05] Unknown:
distributed data mesh. Yeah. I think that, you know, when I was reading through your post, 1 of the things that came to mind is a parallel with what's been going on with technical operations where you have an infrastructure operations team that's responsible for ensuring that there is some baseline capacity for being able to provide access to compute and storage and networking. And then you have layered on top of that in some organizations the idea of the site reliability engineer that acts as a consultant to all the different product teams to make sure that they're aware of and thinking about all of the different operational aspects of the software that they're trying to deliver and drawing a parallel with the data mesh where you have this data platform team that's providing access to the underlying resources for being able to store and process and distribute the data, and then having a set of people who are either embedded within all of the different products teams or acting as a consultant to the different product teams to provide education and information about how they can leverage all these different services to make sure that the data that they're providing within their systems and that they're generating within their systems is usable downstream stream to other teams and other organizations, either for direct integration or for being able to provide, secondary data products from that in
[00:38:25] Unknown:
a reusable and repeatable manner? Yes. They're definitely they're they're parallels. And I think we can learn from those lessons. You know, we have gone through the dev and ops separation. And we, you know, with dev ops, we brought them together. And as a side effect of that, we generated a we created a wonderful generation of engineers called SREs. So I think absolutely the same the same parallels can be applied and those learnings can be applied to the world data. 1 thing I would say though, going into this, you know, distributed model with distributed ownership around data domains with different, you know, infrastructure folks supporting that, bringing product thinking and platform thinking and all of that together to it. It's a hard journey, and it's counterculture to to a large degree to what we have today in organizations.
You know, today, data platform, data lake is a separate organization all the way to the side, really not embedded into the operational systems. What I like to hope to see is that data becomes the fabric of the organization. Today, we do not argue about having APIs, you know, serving capabilities to the rest of the organization from services that are owned by different people. But for some reason, we argue that that's not a good model for sharing data, which I'm still puzzled by. So data right now is separated. People are still thinking, you know, there's a separate team deal with my data. Data is like the operational systems, and folks are not really incentivized to to own or even provide, you know, kind of trustworthy data. So I think we still have a long way to go for organizations to reshape themselves, and there is so much challenge. There is a lot of friction and challenge between operational systems and the data that needs to be unlocked and the consumption of that in a meaningful way.
I definitely I am working with and also have seen pockets of this being implemented. But for people who get started and where I'm starting also with a lot of our clients, is not be is not day 1. It doesn't look like a distributed ownership and a distributed team. So on day 1, we, in fact, do start with, maybe 1 physical team that brings people from those operational domains as SMEs into the team. It brings infrastructure folks into the team. But though it's a 1 physical team, the underlying tech we use, like the repos for code that generate data for different data domains are separate repos. They're separate pipelines.
The, you know, infrastructure has a separate backlog for itself. So virtually, internally, have separate teams, but they're still working on day 1. They're still working under 1 program of work. And, hopefully, as we go through the maturity of, you know, building this in a distributed fashion, those internal virtual teams that are you know, they're they've been working on their separate repo and they have a very explicit domain that they're looking after, then they can kind of once we have enough of the infrastructure in place to support those, you know, domains to become autonomous, completely autonomous, then they can go out and, you know, be turn into long standing teams that are responsible for that data for that domain. So that's though that's a target state, I do wanna just, I guess, mention that there is a maturity and there is a journey that we have to go through and probably make some mistakes along the way and and learn and correct, to get to get that to that point at scale. But I'm very hopeful because I have seen this happen, as you mentioned, parallel in the operational world. And I think we have a few missing for it to happen in the data world, but I'm hopeful. Based on the conversations that I had with with companies since this has started, I've heard a lot of, you know, horror stories and failure stories and the need for change. So the need is there. I've also been talking to, many companies that have come forward and say, we are actually doing this. Let's let's talk. So they haven't maybe publicly published and talked about their approach, but they're internally they're in the kind of the leading front of changing the way and challenging the status quo around data management.
[00:42:40] Unknown:
Yeah. And 1 of the issues with data lakes is at the face value, the the, benefit that they provide is that all of the data is collocated. So you reduce latencies when you're trying to join across multiple datasets, but then you potentially lose some of the domain knowledge of the, source teams or source systems that were generating the information in the first place. And so now we're moving in the other direction of trying to bring a lot of the domain expertise to the data and providing that in a product form. But as part of that, we then create other issues in terms of discoverability of all the different datasets, consistency across different schemas for being able to join across them, where you if you leave everybody up to their own devices, they might have different schema formats. And then you're back in the area of trying to generate master data management systems and spend a lot of energy and time trying to be able to coerce all of these systems to have a common view of the information that's core to your overall business domain. And so when I was reading through the post initially, I was starting to think that, you know, we're trading off 1 set of problems for another. But I think that by having the underlying strata of the data platform team and all of the actual data itself is being managed somewhat centrally from the platform perspective, but separately from the domain and expertise perspective, I think we're starting to reach a hybrid where we're optimizing for all the right things and not necessarily having too many trade offs in that space. So I'm curious what your perspective is in terms of those areas of things like data discovery, schema consistency, where the responsibilities lie, between the different teams and or and organizationally for ensuring that you don't end up optimizing too far in 1 direction or the other? Absolutely. I think you
[00:44:37] Unknown:
made a very good point and a very good observation. We definitely don't wanna swing back into data silos and, you know, inconsistency. The fundamental principle behind any distributed system to work is interoperability. So you've mentioned a few things. 1 is the data locality for performance, like, you know, data gravity where the operation like, the the computation will happen. Like, for example, your pipelines accessing that data are running close to your data, and the data is collocated. I think those underlying physical implementations, we should bring those learnings from data lake design into the data mesh. They should be absolutely there. A distributed ownership model does not mean a distributed physical, location of the data. Because as you mentioned, the data infrastructure team is their job is providing that, you know, location agnostic or fit for purpose location as a service to or storage as a service to those data product teams. So not every team willy nilly deciding where I'm gonna store my data or how am I gonna store the data. There is a, you know, infrastructure that is set up to incorporate all of those, you know, performance concerns and the scalability concerns and the colocation concern of the data and computation and provide, you know, in a in in somewhat agnostic way to those data product teams where the data should be located. So, physically, if I'm on the same, you know, on on a bunch of his 3 buckets or on an instance of Kafka or wherever it is, as a data product team, I shouldn't care about that. I should go to data in my data infrastructure to get that service from them, and they are best positioned to tell me where my database and my needs, obviously, and the scale of my data and the nature of it it goes. So we still can bring all of those best practices around the location of the data and so on into the infrastructure.
Discoverability is another aspect of it. So I can I think 1 of the most important properties of a data mesh is having some sort of a global and, you know, single point of access for or single point of discovery system? So that as someone who wants to generate a new solution, I wanna make a smart decision about, you know, get some insight about my patients and my health insurance example. I need to be able to go to 1 place, which is my, you know, data catalog for a lack of better world or data discovery system, that I can see all of the data products that are available for me to use. I can see who owns them, you know, what are the meta information about that data? What's the quality of it? When was the last time it's got updated? What's the cadence of it being updated?
Get, you know, sample data to play with. So that discoverability, kind of function, I see that as a centralized and a global function that in a federated way, you you know, every data product can register itself, you know, with that with that discovery tool. And we have primitive implementation of that, which are just confluence pages to more kind of maybe, like, advanced implementation. So it's just that's that's a globally accessible and centralized, I think, solution. And 1 other thing that you mentioned was, well, now those teams now that that you leave the decision around how the data should be provided and what the schema of it should be. That hasn't changed from the lake, because the lake concern, lake implementation that leaves that also to the, at least in this pure initial, definition to the data sources.
But as I mentioned, interoperability and standardization is a key element. Otherwise, we end up with a bunch of domain data that there is no way I can join them together. For example, there are some concerns, some entities like a member. Right? The member in a health insurance, any claims system probably has its own internal IDs in I don't know. The health care provider system has its own internal IDs. The member is a polysem entity that crosses different domains. So if we don't have a standardization around adapting those internal member IDs to some sort of a global member ID where it can join the data across different domains, These disparate domains data, they're just good by themselves. They're they're they don't allow me to do, you know, merging and filtering and searching.
So the governance, the global governance and standardization is absolutely key. And that governance and standardization sets, you know, sets standards around anything that allows interoperability from, you know, how the maybe formats of certain fields, how we represent date, time, and, you know, time zones to how do we do policy aims of federated ID identity management to what is the unique way of securely accessing data so we don't have 5 different ways of security implementation or secure access model. So all of those concerns, I think part of that global governance and implementation of that, as you mentioned, goes into your data infrastructure as as tooling that people can easily use as as shared tooling. The API revolution wouldn't have happened if we didn't have HTTP as a base as a baseline standardization, raised rest as a standardization.
So these standards have allowed us to to be able to decentralize our monolithic systems. We had the a you know, API gateways and API documentations, a centralized place to find things, like find what are the APIs I need to use. So I think that same same concerns and the best practices out of the lake should come into the data mesh and not get lost.
[00:50:15] Unknown:
And from an organizational perspective, particularly for any groups that already have an existing data lake implementation, Talking through this, I'm sure it sounds very appealing, but it can also sound fairly frightening because of the amount of work that's necessary. And as you mentioned, the necessary step is to take a slow and iterative approach to going from point a to point b. But there are also scales of organization where they might not have the capacity to even begin thinking about having separate domains for their data because everybody is just 1 group. And so I'm curious what your perspective is on how this plays out for smaller organizations, both in terms of the organizational patterns, but also in terms of the amount of overhead that is necessary for this approach and whether there is a certain tipping point where it makes more sense to go from a data lake to a data mesh or vice versa? Yeah. That's a very good question.
[00:51:10] Unknown:
On the smaller kind of scale organizations, if you if they have a data lake, if they have a centralized team that is ingesting data from many different systems, and this is this team is separate from the operational teams. But somehow, because of the size, they can manage and they have they have closed the loop. And by closed the loop, I mean, it's not that they just, you know, consume data and put it in 1 place, but also have satisfied the, the requirements of use cases to use the data. And they have become a data driven company. If you're there and you have managed to, you know, build a, closed tight loop between your operational systems providing the data and, you know, intelligence services or ML based services that are sitting on top of the data, and consuming it, and you have a fast, you know, fast cycle of filtration to to update the data and also to update your models, and that's working well for you. There's no need to change. A lot of start ups and scale ups, you know, they're still using the 1 of the, say, rails applications that build the first day, and that is still okay. But if you have pain and if the pain you're feeling is around, ingestion of the data and keeping that data up to date, understanding the data, you you see fragility, if that's a word.
You you see the fragile points of connectivity between the source source being changed and, you know, your your pipeline is falling apart, or you you are struggling to respond to this, kind of the innovation agenda of your organization. We we we talk about test and learn and build test and learn, which requires a lot of experimentation, and your team is struggling to respond to change data to support those experimentation, then I think that's an inflection point to start thinking about. How can I decouple this? How can I decouple the responsibility? And a starting point, a very, very, you know, simple starting point would be, let's decouple your infrastructure from the data on top of it. So your data engineers probably today are both looking at the infrastructure pieces, which are cross domain. They really don't don't matter what data domain they are serving or they're transforming their cross domain.
Separate that into a separate team, you know, put a success criteria for them, design success criteria that is, you know, is aimed to enable people who want to, provide data on top of that infrastructure, separate that layer first. When you come to the data itself, find out the domains that are going through change more often. The domains that are, you know, changing continuously. Often those domains are not your, you know, customer facing applications or then they're often not the systems operational systems at the point of origin. They may change, but the data they're emitting may not change, but may maybe they do. Often, they're the ones that are sitting in the middle and there are aggregates. However, find out which data domain is changing most rapidly.
Separate those. Separate that 1 first. So, you know, put even a logical maybe your data team is still the same team, but within that team, some people become the the owners or the custodians of that particular domain and bring those people together with the actual systems that emitting, the data if it's a system of origin or you know, teams that are aggregating the data in the middle, if it's an aggregate data, and start start experimenting with having build that dataset as a separate data product in a way and and see how that's working. That would be I think that would be where I would start in a smaller organization.
Organizations that already have, you know, many, many data warehouses, multiple iterations of it or recognitions of it, and they're somewhat disconnected, and they have probably a leak, and maybe it's not really working for them. Again, the if you see the failure symptoms that I mentioned at the beginning of this conversation, you can't scale, you can't bootstrap, you're not having become data driven. Again, start with finding those use cases. We I always took a value driven, use case driven approach. Find a few use cases where, you can really benefit from data that is directly coming from the source, data that is timely and rich, that it is changing more often. But find use cases of for that data, not just the data itself, but, whether it's a ML use cases or BI use cases. Find use use cases, group them together, identify the source of real source of data, not intermediary, you know, data warehouses, and start building out this data mesh, 1 iteration, 1 use case at a time, you know, a few data sources at a time. And that's exactly what I'm doing right now because organizations that I work with, they mostly fall into that category, that kind of hairy data space that needs to incrementally, change. And there's still you know, the there is a place for a lake. There's still a place for a data warehouse. Like, a lot of the BI use cases do require a multidimensional schema, well defined, but they they don't need it to need to be the centerpiece of this architecture.
They become a node on your mesh mostly to closer to the consumption, use cases because they satisfy a very specific set of use cases around business intelligence. So they become the consumers of your mesh and not not the central piece for providing the data. And in your experience of working with different organizations
[00:56:37] Unknown:
and through different problem domains and business domains, I'm wondering if there are any other architectural patterns or anti patterns or design principles that you think that data professionals should be taking a look at that aren't necessarily widespread within that community?
[00:56:52] Unknown:
Yes. I think a lot of the practices that we take somewhat for granted today, if you're building a modern, you know, digital platform and what modern web application. I think there is there is a space to bring those practices to data and both data and ML. I think continuous delivery is an example of that. You know, all the code and also the data itself, also the models that you create are under the control of your continuous delivery system. They're versions. They have integrity tests and, you know, test around it. You know, making that deploy to production, which means release it so that it's accessible to the rest of the organization.
So I would say the basic, you know, good engineering hygiene around continuous delivery, those are not really well understood or well established concern concepts in both data and ML. So a lot of our ML engineers or data scientists, they don't even know what does continuous delivery for ML look like because from, you know, making a model in R on your laptop, there is months of work and translation and somebody else writing a code for you and putting something on the you know, in production that using its stale data that it was trained on, and there is no concept around, oh, you know, my data changed. Now I need to kick off another build cycle because I need to retrain my much, my, you know, machine learning algorithms. So that that continuous delivery both in AI and continuous delivery both in data world as well, like data integrity test and testing for data is not a well understood practice. Versioning data is not a well understood practice.
So think schema versioning or backwards compatibility, I would say that would be some of the early kind of the technical concerns, and capabilities that I would introduce in data engineering teams or ML engineering teams. And are there any other aspects
[00:58:50] Unknown:
of your thinking around data mesh and the challenges of data lakes or anything else pertaining to the overall use of data within organizations that we didn't discuss yet that you'd like to cover before we close out the show?
[00:59:03] Unknown:
I think we pretty much covered everything. I would probably over, maybe emphasize a couple of points. I think making data as a first class concern, an asset, you know, and and structured around your domains, it does not mean that you have to have a well modeled data. It could be your raw data. It could be the raw events that are being generated from the point of origins. But there is product thinking and, you know, self serve and some form of it understood or measured quality and good documentation around it so that other people can use it and you treat it as a product. But it doesn't necessarily mean we've we're doing a lot of modeling of the data. The other the other thing that I would mention, I think, is, I guess, we have already talked about it, but I think the governance and standardization, I I would love to see more standardization the same way that we saw saw with the web and, you know, with APIs applied to data. So we don't have a lot of either open source, like, a lot of different open source tools or a lot of different, you know, kind of proprietary tools, but there isn't a, you know, there isn't a standardization that allows me, for example, to run distributed SQL queries across a diverse set of datasets. I mean, the cloud providers are in the race to provide all sorts of, you know, wonderful data management capabilities on their platform. And I hope that race would lead in some form of standardization, that allows, you know, distributed systems to work. And intentionally, I think a lot of the technologies we see today, even around data discovery, is is based on the assumption that data is operational data hidden in some database in a corner of the organization is not intended for sharing, but we need to go find it and then extract the data out of it. I think that's a flawed predisposition. I think we need to think about tooling that, would allow intentionally diverse intentionally shared diverse set of data datasets. And what does it mean? Like, there's a huge opportunity for toolmakers out there. I think there is a big white space to build next generation tools that are not designed to fight the data, to fight this bad data hidden somewhere. They're designed to share and make, you know, intentionally shared and, intentionally treated as assets, data discoverable and accessible and, you know, mergeable and queryable, but distributedly owned kind of datasets. So I think those are the few final points, to overemphasize.
[01:01:50] Unknown:
Alright. And for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And we've talked a bit about it, but I'd also like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:02:08] Unknown:
Standardization, I would just, if I could wish for 1 thing with a little bit of a convergence and standardization in, that allows still a polyglot world. You know, you you can still have a polyglot world, but I wanna see something like, you know, convergence that happened around Kubernetes in the infrastructure and operational world. Some similar similar or the standardization that we had with, you know, HTTP and rate, you know, REST or gRPC and so on in the world of data so that we can support a polyglot, you know, an ecosystem.
So I think I'm looking for tools that are ecosystem players and kind of support a distributed and polyglots, data world, not data that can be managed just because we put it in 1 database or 1 data store, just because it's used by 1 party owned and ruled by 1 particular tool. So open standardization around data is what I'm looking for. And there are some, you know, small movements. Like, if you look at the and they're not coming, unfortunately, they're not coming from the data world. Like, for example, the work of CNCF off the back of the serverless, thinking about if the events are 1 of the fundamental concepts, you know, talking about cloud events as a standard way of describing events. But that's coming from left field again. That's coming from an operational world trying to play in an ecosystem, not a data world. And I hope we can get more of that from the data world. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing and your thoughts on the matter of how we can make more scalable and maintainable data systems in our organizations
[01:03:46] Unknown:
and as an industry. It's definitely a lot to think about, and it mirrors a lot of my thinking in terms of the operational
[01:03:54] Unknown:
characteristics. So, it's definitely great to get your thoughts on the matter. So thank you for that, and I hope you enjoy the rest of your day. Thank you, Tobias. Thank you for having me. I've been a big fan of your work and what you're doing, getting the, you know, information about this silo of data management, to the to everyone and really making that information available to me at least, since I've become a data enthusiast. Thank you for having me. I'm happy to help. Have a good day.
Introduction to Zhamak Dehghani and Thoughtworks
Zhamak's Journey into Data Management
Challenges with Data Lakes
Organizational and Technical Challenges of Data Lakes
Parallels with Monolithic Software Architecture
Data Mesh: Organizational and Technical Structure
Parallels with Technical Operations
Balancing Discoverability and Consistency in Data Mesh
Adopting Data Mesh in Smaller Organizations
Architectural Patterns and Anti-Patterns in Data Management
Final Thoughts on Data Mesh and Data Management