Summary
Data mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Paolo Platter about Agile Lab’s lessons learned through helping large enterprises establish their own data mesh
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you share your experiences working with data mesh implementations?
- What were the stated goals of project engagements that led to data mesh implementations?
- What are some examples of projects where you explored data mesh as an option and decided that it was a poor fit?
- What are some of the technical and process investments that are necessary to support a mesh strategy?
- When implementing a data mesh what are some of the common concerns/requirements for building and supporting data products?
- What are the general shape that a product will take in a mesh environment?
- What are the features that are necessary for a product to be an effective component in the mesh?
- What are some of the aspects of a data product that are unique to a given implementation?
- You built a platform for implementing data meshes. Can you describe the technical elements of that system?
- What were the primary goals that you were addressing when you decided to invest in building Data Mesh Boost?
- How does Data Mesh Boost help in the implementation of a data mesh?
- Code review is a common practice in construction and maintenance of software systems. How does that activity map to data systems/products?
- What are some of the challenges that you have encountered around CI/CD for data products?
- What are the persistent pain points involved in supporting pre-production validation of changes to data products?
- Beyond the initial work of building and deploying a data product there is the ongoing lifecycle management. How do you approach refactoring old data products to match updated practices/templates?
- What are some of the indicators that tell you when an organization is at a level of sophistication that can support a data mesh approach?
- What are the most interesting, innovative, or unexpected ways that you have seen Data Mesh Boost used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Mesh Boost?
- When is Data Mesh (Boost) the wrong choice?
- What do you have planned for the future of Data Mesh Boost?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- AgileLab
- Spark
- Cloudera
- Zhamak Dehghani
- Data Mesh
- Data Fabric
- Data Virtualization
- q-lang
- Data Mesh Boost
- Data Mesh Marketplace
- SourceGraph
- OpenMetadata
- Egeria
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/ atlan today, that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.
With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Paolo Plater about Agile Labs lessons learned through helping large enterprises establish their own data mesh. So, Paolo, can you start by introducing yourself?
[00:01:38] Unknown:
I'm Paolo Plater. I'm the CTO and cofounder of Jelab. I've been in the data management since 2013, and our mission as company is to build products and projects that elevate the data engineering game for our customers. We believe that data engineering should include all the aspects of data management like software engineering does because we don't see in software documentation specialist or software stewards and something like this. So when we talk about that engineering practice, we talk about data management.
[00:02:16] Unknown:
Do you remember how you first got started working in data?
[00:02:19] Unknown:
At the beginning, when me and Alberto Fiorpo started Jai Lab, we searched for an innovative area to have the possibility to create an impact without fighting with a well established competitive landscape. So at the beginning, we focused on Apache Spark because at that time, distributed technology were really tough and hard to learn. So we started being Spark expert, then we moved towards Cloudera and platforms. And in the end, we dropped the specific technologies, and we embraced data management practice as a whole. It was quite natural evolution, let's say.
[00:03:00] Unknown:
In some of your recent work, along with the rise of data mesh as a concept and a particular pattern that folks are trying to adopt, I know that you've had some experience with helping organizations through that journey. I'm wondering if you can just talk a bit about some of the experience that you've had working with some of these data mesh implementations and what the initial goal was that you were trying to achieve that ended up culminating in data mesh being the appropriate solution.
[00:03:27] Unknown:
We had the first opportunity to work on something similar to data mesh in 2019 because a customer was asked to create it as a platform on top of a domain and microservice oriented architecture, expecting a set of principles that were coming from the operational plane. And after a while, we discovered the storytelling of Jamack, about the data mesh. So we decide to fully embrace it. Now we are lucky enough to observe our data mesh implementation in a really advanced state because it's more than 2 years now. Along the way, we also had the opportunity to face a lot of technical and organizational challenges.
We also failed several times. So now we are bringing this kind of experience on several other customer, and currently, we are driving 5 data mesh implementation around the world, all of them in a larger enterprise. The goals that lead to a data mesh implementation, I think that the first goal is to improve the time to market for the creation of new data initiatives and the willingness to find a better trade off between governance needs and the autonomy of business in creating new insights starting from the data of the company.
This typically is what lead to a data mesh implementation. Instead, I can say that when a company is trying to resolve the consumption problems, So data consumption problems like self-service BI reporting or creation of a unique data marketplace and so on. Data mesh could be not the best fit for the case because, yeah, data mesh is bringing benefits in the data consumption area, but I would say that they are more side effects because data mesh is more focusing on solving data production problems in terms of scaling, time to market, and so on. So if you if you don't want to challenge the data production processes in in your organization, but you want just to create benefits for the data consumers, there are other patterns that maybe are are most suitable, like virtualization, data fabric, or others.
[00:05:56] Unknown:
In terms of the data mesh approach, as you mentioned, it's not always the right answer. And I'm curious if there are cases where you started to go down the path of exploring the possibilities of implementing a data mesh with a given organization and some of the cases where you decided not to continue with that approach and use 1 of the other options that you mentioned such as data virtualization or building a data fabric?
[00:06:19] Unknown:
1 of the cases was the creation of a data marketplace, unifying multiple data platforms that were representing the legacy data platforms of, this organization. And digging into the problems that they wanted to solve, I discovered that they were aiming to solve a data consumption problem where the data are currently in all those data platforms? Can we create a unified catalog to let people understand where to search data and so on? So in that case, there was not willingness to improve the data production process. So we shifted everything towards data fabric implementation, and, it was
[00:07:06] Unknown:
eating the goal. Right? For the cases where you do decide to go down the path of building out a data mesh, what are some of the building blocks that are necessary to have in place before you go down that path, and then some of the technical and organizational slash process capabilities that are necessary to invest in to be able to make that effort successful?
[00:07:31] Unknown:
From a technical standpoint, it's really important to create the right decoupling from the technology because I strongly believe that data mesh is a technology agnostic paradigm. Also in the storytelling of ZAMAC, the utility plane is totally abstracting the underlying technology. So the first step in order to embrace the data mesh, is the creation of of this interoperability plane, and interoperability should hack across multiple technologies and should act data level, but also at process and metadata level. So we really need to decouple the practice and the processes from the technology.
From a process standpoint, instead, the biggest shift is about moving the data management responsibilities into the data product team. So you will have multiple data product teams taking care of the entire life cycle of the data instead of relying on a layered teams structure that maybe are orchestrated also by complex processes. This aspect is forcing to resync the entire delivery process, putting also in crisis a lot of tools that are focusing on the classical workflow to deliver data in a in large organization. For example, all in 1 platforms for data management or for data governance, let's say, are typically designed to allow a data governance team to take and keep control over the process of data delivery.
But instead, in the data mesh paradigm, we need to shift those part of the process into the delivery team. So it will be the data product team that will take ownership around data quality, metadata, lineage, data contracts, and so on. And this must be part of the development life cycle that I think is the the big, missing step also from a technological standpoint. Then from an organizational standpoint, there are a lot of challenges, also cultural, challenges. So the data literature and the organization of business units. Most of the large enterprise we talk with are still divided in business and IT, and they have a customer supplier relationship that's not dealing in the right way with data mesh principles.
[00:10:09] Unknown:
In that example of the larger organization still having that customer relationship between the IT teams and the business users, what are some of the social evolutions that you've had to help work through to be able to support that kind of collaborative aspect of being able to build these data products and maintain velocity or being able to build and maintain velocity of building out these data products without having it get bogged down in bureaucracy or interdepartmental issues that a lot of people have hoped that we've moved beyond?
[00:10:49] Unknown:
Yes. Probably the most are challenged in the data mesh, because a lot of people are starting looking at data mesh for the benefits that it brings also from a business standpoint, but then they don't get the organizational shift that is needed to cope and to implement, successfully the data mesh. No? So what we do typically is to as a company, we can't act at organizational level. No? It's the customer that should be able to reorganize itself. What we have seen is that, domain driven organization is really is really effective data mesh scenarios. So basically, unifying IT and business under a certain domain hat.
This is unifying the goals of the business and and the IT. There's no more budget exchange between business and IT, but most of the organization are not ready to to do this. So as a company, as a system integrator, we try to help the IT department to create a safe and environment to onboard the business, also to better understand and to better gain the benefits coming from the data mesh principles. So we try to create a platform that is providing facilities and is enabling data product teams to deliver, the value that is promised by the data mesh in a very effective and efficient way.
[00:12:29] Unknown:
1 of the core elements of data mesh is the idea of a data product and that being the individual unit that comprises the separate nodes in that mesh. And as you're working with organizations to build out this capacity for producing data products, what are some of the common requirements and common capabilities that you found are necessary to be able to build out these data products and then integrate them throughout this mesh without having to do a bunch of bespoke development each time you want to bring a new product to market?
[00:13:07] Unknown:
At the beginning, the main concern is where do we start. So the blank page syndrome, is the most common problem for data product teams because it's clear that building data product is not what we did until yesterday. So we are pushing a lot of duties and responsibilities on the data product teams. So they need to pay attention to the semantic, the quality, the documentation, and so on. In addition to this, data product should be compliant with the interoperability standards, with the security standards, and so on. So at the beginning for the the product team is not clear how to cope with all these requirements. What we do is to create templates from a platform. The platform team should create templates that are helping the data product team to onboard the technologies, practices, standards, and so on. So they don't start from a blank page. They can start cloning those templates that are already embedding best practices.
And this is helping them a lot also to develop the 1st data product. 1 of the goal that most of the organization have in the initial phase. Okay. Let's try to build an MVP. Let's try to build our first data product. But if the creation of the first data product is requiring 6 months because, first of all, you need to create the automation, all the standards, all the best practice, and so on. The general feeling will not be so good. No? So our goal is to create platform facility that's are helping a company to create the first data product in a very short time, being compliant with all the duties that I that I mentioned before.
[00:14:59] Unknown:
As you are working with these organizations to define these individual data products and how they're going to be used throughout the organization. What are some of the bespoke requirements that have come up where you maybe have a common core of how the product is built and deployed, but you need to add in special capabilities to account for a particular use case or a particular data format or a downstream integration that you're looking to work with, and just some of the ways that you factor that into a common template while still allowing for this flexibility?
[00:15:36] Unknown:
Templates are providing the right flexibility because on 1 side, they are capable to standardize specific behaviors, contracts, technologies, and so on. But on the other side, once the data product team clone the template, then it has the whole flexibility to modify the template, to extend it, and so on. This is quite different compared with what is happening with frameworks. No? If you have a framework, then the framework is extensible only by the platform team. Instead, if you rely on templates, you can still provide the best practice and standards, but then they are open for full extensibility.
So the other product team can put their business logic, but they can also modify a certain set of, of feature and behaviors. Then what is really important is to regain control when it comes to deploy, because we need to ensure that all the data products are respecting a set of principles. We call them computational policies. So the templates are fully customizable, but then when it comes to deploy, they will verify. So the artifact that the data product team is producing will be verified against computational policies to check if they are compliant with the required standards and so on.
[00:17:06] Unknown:
Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month. For more information on Prefect, go to dataengineeringpodcast.com/prefect today.
That's prefect. As far as having those checks built in to ensure that a given product is meeting those compliance requirements, How do you factor that enforcement into the life cycle of building and iterating on these data products and being able to factor that into maybe a continuous integration, continuous deployment scenario?
[00:18:15] Unknown:
The provisioning of data products is a really important topic also because in literature, we say that a data product should be an independent unit of deployment. But the data product is a system is a complex system. So it's a composition of infrastructure, artifacts, API, data transformation, metadata, and so on. So the really important part is to create a guardrail, let's say, so a guided workflow that all the data products should follow in order to deploy themself. And in this deployment pipeline, we can use, CICD pipelines or control planes or other instruments. We can embed computational policies. So those computational policies will be automatically applied to all the data products going down the line.
And computational policies at deploy time, if we are able to apply computational policies at declarative level. So if we are able to describe the data product, then we are able also to inspect it and understand what we are deploying. And then we have, instead computational policies that must be embedded in each data product and run at run time. Zamak called them local policies. So for us, those are, injected behaviors that we inject in all the data products, and we verify that all the data products must include those elements. And we automatically verify that they are running in production. They are checking the data product continuously.
Those local policies can refer to encryption, GDPR, data deletion, stuff that should happen directly on the data. Instead, if we are talking about the deploy time check, it's really important to be able to describe in a standard and formalized way what a data product is, how it is composed, and then we'll be able to apply policy as code pattern at deploy time. Typically, we rely on queue length as a standard way to define policies for a data product, check.
[00:20:38] Unknown:
In terms of the definition and application of these policies, there's also the question of who's responsible for establishing what those parameters need to be, how do you make sure that the parameters meet the different needs of the various stakeholders? And that kind of plays into the idea of code review, And I'm wondering what that looks like in this data context where it's not just the software that you're dealing with reviewing. You're also dealing with these policies and the shape of the data and how you're manipulating it and how that plays into some of the downstream uses.
[00:21:13] Unknown:
Yes. The global policies in the data mesh are defined by the federated governance usually. Those policies are also acting as a safeguard for, interoperability across data products because in a world where we distribute the accountability of data production to different domains and different stakeholders, we really need to take care about interoperability, and we need to pay attention to don't create breaking change down the line. No. Because if a data product is creating a breaking change, could generate a snowball effect in terms of change management also on other domains. So code review and data reviews will become more and more important for this reason, because the data product team is becoming accountable to define a contract and to protect this contract.
So the semantic that you use to describe fields, the the normalization that you apply on your data, the tests that are guaranteeing the quality of the data. So all those elements are really becoming important to don't generate impacts on downstream data products. So, I think that data review, as I call it, should become a strong practice in data product team because you are really responsible to don't generate impact on other teams. Like, when you write software, you don't want to break interfaces, maybe are generating impact on other models and so on. The downstream impact also plays into the question
[00:23:06] Unknown:
successful mesh implementation and some of the, maybe, technological solutions that you
[00:23:15] Unknown:
are becoming a duty of the data product team. So, each data product team should be accountable to implement those elements, but also to provide transparency about those elements. So I need to demonstrate to the other data product teams that I'm implementing data quality in the right way, that I'm exposing the lineage, that I am accountable for service level agreements that I expose because maybe we have a use case that is spawning multiple data products, and we have a global cutoff in terms of service level agreements, but each data product is playing its role in this distributed pipeline.
For us, the key is the observability. So observability is 1 of the standards that a company should define in the data mesh. So when we talk about, defining standards, observability API definitely is 1 of them. There is no global standard at the moment in the industry for, observability API, but every company, should be able to define it. What is really important is that observability information will be used to orchestrate the overall process. So each consumer, instead of relying on a global schedule on a global pipeline and a global scheduling chain, each data product should decide when is the moment to consume other data products based on the observability information.
So if the data has been refreshed, if the quality is in a good shape, and so on. So this is the pattern that we see as very important to manage this distributed orchestration of pipelines and data transformation.
[00:25:08] Unknown:
In the lessons that you've learned and some of the work that you've done in building these solutions that has led you to compiling some of that information and some of those experiences into a product that you're building to help simplify this overall process. And I'm wondering if you can just talk to some of the technical elements of the data mesh boost system that you're offering.
[00:25:30] Unknown:
We are building a platform to help companies to implement proper data mesh in a healthy and sustainable way. So first of all, DataMesh Booster is not coupled to any kind of technology aiming to process, store, or serve the data and is not operating in in that area. We consider DataMesh Boost as a framework to implement a data mesh with the technology and the standards that you prefer. So it's highly customizable. You can inject whatever technology and standard you want to adopt. And we consider it like, you know, the internal software development platforms. We call it internal data engineering platform that is entirely designed with extensibility in mind because we really believe that each company will have its own standards and its own technology to implement a data mesh. And the main technical elements are what we call data product build. That is a powerful template and scaffold mechanism that is allowing the platform team to spread best practices and onboard new technologies, respecting all the interoperability constraints and so on. And on the other side, the data product team can leverage those templates as a starter kit to implement the data products.
Then we have the data product provisioner, that is a component aiming to deploy a data product as a single unit of deployment. This component is applying computational policies and then is orchestrating the process to create all the components that are part of the data product, taking care of dependencies. So if you need to create a storage table and so on, this is managing the dependencies between them. So it's capable to create a dynamic execution plan, and it is acting more or less like a control plane instead of a CICD tool. And finally, we have the data product marketplace where all the information going through the deployment step are flowing into this marketplace that is really specific for data mesh. So it's not a data catalog.
We typically create an higher level, obstruction around such information, and then we integrate with data with data catalog for metadata management. In this marketplace, it's possible to discover and observe the data products. So it's possible to integrate custom observability information. It's possible to plug change management mechanism. For example, if a data product is going to be deprecated, as a downstream data product owner, I want to be aware of this. I want to be notified that a data product will be deprecated. So we take care about the interaction between data product owners and consumers.
[00:28:37] Unknown:
In your work of helping different organizations build the data mesh, I'm wondering what were the motivating factors for deciding to invest in building this as a product to help other organizations with the same processes And some of the design questions and design elements about how to go about implementing this as a product suite that would be usable and flexible without getting into the space of having it be overly flexible, where you might as well just build it from scratch again
[00:29:07] Unknown:
every time? What we observed, is that in order to create, real data products with the proper governance, automation, and the interoperability, and then demonstrates the rate of of investment, you first need to create platform facilities to enable such capabilities. So the data product effectiveness in a company will be evaluated, in terms of time to market, usability, and sustainability. And without a platform, this is really hard to demonstrate. So if you if you try to build a product with all the feature in place but without a supporting platform, it will result in a huge amount of stuff to do. No? And this is not scalable.
Our first goal is to drastically reduce the time that is needed to demonstrate the effectiveness of data mesh principles. Because in in large organization, in order to start a data mesh initiative, you need large consensus, and not all the people have capabilities and the skills to understand the vision, specifically in the business part of the company. So you need to provide some kind of evidence about the business benefits that data mesh will bring, so autonomy and so on, time to market, and so on. But you also need to demonstrate that from a practice standpoint, this is sustainable. So the first goal for us is to reduce the time needed to demonstrate the effectiveness of data mesh.
And the second goal is to build the right guardrails at process and practice level to make this sustainable along the time. So we really need to create a new way of work to allow multiple teams to create value starting from data in a technostic way where the practice is coming first than the technology. So this gap is what convinced us to build the platform.
[00:31:16] Unknown:
Bigeye is an industry leading data observability platform that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate, and reliable. Companies like Instacart, Clubhouse, and Udacity use BigEyes automated data quality monitoring, ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business. Go to dataengineeringpodcast.com/bigeye today to learn more and keep an eye on your data. As far as the pain points that you and organizations you've worked with run into when building out these data mesh implementations, What are some of the common challenges that you've run across and some of the ways that your DataMesh Boost product is aiming to address those challenges?
[00:32:09] Unknown:
Data Mesh Boost, provide an end to end data engineering practice. We call it decentralized data engineering. It helps companies to adopt data mesh in the following areas. So the first 1 is providing a standard way to create templates that are usable across the entire organization, and this is speeding up the time to market to build the data products. Those data products are automatically embedding standards, best practice, automatic provisioning, because all these parts are part of the template itself. The second area where the DataMesh Boost is helping the companies in adopting data mesh is the computational governance, because it provide a very easy way to define computational policies to guarantees that they will be applied to all the data products that you're going to deploy in your environment.
And finally, it is creating a place where data product teams can collaborate and interoperate effectively. This is something that is really in high demand when your data mesh is scaling because at the beginning, the focus is on data product creation. But after a while, change management will be a a really big problem because you need to boost the communication between data product teams and data consumers. So they don't have anymore a master plan to work with. This decentralized data engineering practice is helping you to build a practice that values automation, technology independence, team autonomy, and also it's helping you to embed the best practice in your data teams with almost 0 effort.
All those elements are key elements for a successful data mesh implementation.
[00:34:12] Unknown:
1 of the interesting things that's always worth digging into is cases where you have gone through the process of working with an organization, determining that data mesh is a viable option for solving the problems that they're experiencing, getting that into production, and then cases where after it's in production, it either fails after a period of time or the organization doesn't invest in the upkeep and just some of the ways that those projects can start to go off the rails after they've already been put into production?
[00:34:47] Unknown:
Yeah. There are 2 different phases, in my opinion. The first 1 is how can we deal with changes when we need to apply changes to the data products and we need to evaluate the impacts of those changes. The biggest pain point around data product change management are the breaking change. What we suggest always is to have a data contract first approach, where it's super clear which are the allowed changes that are not breaking the consumers and which are the not allowed ones. A data product team must be accountable to reduce the breaking changes and to provide backwards compatibility.
And, typically, there are huge discussion about what is a breaking change. So do we consider a change in the data pipeline a breaking change, or do we consume our only schema, output schema changes as breaking change? It's really hard to say. What we can say is if you are able to code what a breaking change is, then you are able to apply a very good computational policy. In our framework, we have an extension point to where a company can define true code, which are the rules to automatically identify a breaking change. This is bringing to the picture that you need to adopt data contracts.
You need to define very details which are the allowed breaking change for a specific data contract, and you must be able to code them in a computational policy. This is removing 90% of the discussions.
[00:36:29] Unknown:
Another interesting element of the life cycle management question is that when you have an initial approach to building out a data product, it will be useful up to a certain point in time, and then you discover, okay. This is a useful new attribute or a new capability or updated pattern that we want to use. So you get a new product out into usage with that new pattern or policy or approach. And then there's the question about how do you go through the process of codifying that as a template for future products, but then also backfilling that into existing environments without breaking them and maintaining compatibility with their consumers and just some of that overall life cycle management approach as you start to build out multiple products that are being used by different consumers and just managing consistency across that product suite?
[00:37:20] Unknown:
There are typically 2 different cases where an already deployed data product should be at work. The first case is when you add a computational policies or you change a computational policies. And the second 1 is when you change a template because you want to evolve a best practice or a contract or something like this. Our vision on this is that the platform should automatically detect these 2 conditions and should notify the data product owners that a policy has changed or a new template is available and so on. Then the data product team is fully accountable to rework the data product according with those new standards. But the platform, in general, should provide some facility to do this.
Because we leverage the template pattern, so not a framework oriented pattern, what is possible to do is to create facility that are analyzing the code of the data product, compare it with the new computational policy or the new template, and suggest to the data product team which are the needed changes. There are tools that are supporting this, like source graph. So they are, capable to analyze multiple repositories and provide a batch of changes in an automatic way. There is always the human intervention. But anyway, we consider the data product team accountable to rework their products.
[00:38:58] Unknown:
In your experience of working with organizations and building these data mesh implementations and building the data mesh boost product and figuring out how to make that usable and accessible for the different clients that you're working with. What are some of the most interesting or innovative or unexpected approaches that you've seen either in building a data mesh from whole cloth or ways that you've seen your data mesh boost product used?
[00:39:24] Unknown:
What we are learning is that data mesh principles are transforming the data engineering practices in several, unexpected ways, and many trending topics currently running in the data engineering landscape are strongly connected with the data mesh principles. For example, metadata interoperability and the rise of open standards, like open metadata and edge area are quite key elements to provide service provisioning capabilities and to bring the metadata in the development life cycle. Because as I said, if we want to shift all those duties into the data product team, we need to shift also the process into the development life side. Another important shift that we are observing in the data engineering area is the data contract first approach.
The data contract first approach is strongly needed in the data mesh because we need to segregate the responsibilities of domains. We need to boost their observability capabilities. And the data contract first approach is also creating a demand to have declarative data transformation because we need to provide transparency to the consumers about what transformation we are applying. We need to be able to work in a cross functional team with domain expert and data engineers so that they are able to understand each other. So this is another need that is created by the data mesh itself.
Following these reasons, I have seen 2 customer of us using DataMesh Booster only for data engineering purposes, not specifically for the data mesh. So they don't have a data mesh initiative in place. But, basically, their aim is to create a solid data engineering practice, allowing process decentralization and computational governance. And this kind of initiative can be driven only by IT and data department. It could be a first milestone then to move towards the data mesh. So first of all, you set up a very good data engineering practice with all these aspects, in place.
At that stage, it will be easier to convince business stakeholders about the overall value that you can bring with data mesh. No? Because you have already concrete facilities and benefits that are making the road towards the data mesh easier.
[00:42:05] Unknown:
We've talked a little bit about cases where building a data mesh isn't the right choice. I'm curious for people who have gone through that process of evaluating options and they decided that data mesh is a viable approach for the problems that they're trying to solve. When is the data mesh boost product that you're building the wrong choice for helping them and actually going through that implementation?
[00:42:27] Unknown:
Yeah. DataMesh for sure is the wrong choice for all the companies that are not feeling the pain about scaling and improving their data production practices. There are a lot of other technological and methodological solution. So data warehousing, data lake that can fix problems around metadata management, data consumption, lack of data engineering skills, for example, the rise of low code, no code tools. So I see the data mesh practice as a solution for very large enterprise companies that have problems with scaling their data practices.
And DataMesh Boost as well is the right choice to set up a practice over technology mindset in a company because it's tech agnostic, but is definitely not the right fit for companies, instead looking for, all in one's data platform. So if you are looking to create 1 unified technology driven platform, then definitely Data Mesh Booster is is not helping you.
[00:43:38] Unknown:
And so as you continue to work on the Data Mesh Boost product and help your customers through their own journeys of implementing Data Mesh, what are some of the things you have planned for the near to medium term or any particular projects or areas of interest that you're excited to dig into?
[00:43:55] Unknown:
We are investing heavily to create more value for our customer, creating a new way to structure a data management practice as a wall, composing reusable components, and solving tough problems. 2 areas where we are investing a lot is how to understand if 2 data products are too similar because when you distribute the ownership around data production, you could end up with a kind of jungle. So we want to provide some guidance to the data product owners, to don't duplicate too much data products. And the second area of investment instead is about local policies, local computational policies.
We would like to find a way to have transparent local policies. Currently, basically, we help data product teams to embed those behaviors as additional templates that are not customizable and must be present inside the data product, but this is not enough. So we are working to create some transparent mechanism and protocols to apply computational policies about encryptions and other topics in a totally transparent way like you do with the sidecar in Kubernetes environment. They are totally transparent to the application and business logic and, acting under the hood. In the DAS engineering area, there's no such kind of mechanism at the moment. So it's a technological challenge, let's say.
[00:45:36] Unknown:
Are there any other aspects of the overall implementation of data mesh, your experiences working with organizations through that journey, and the work that you're doing on Data Mesh Boost that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:49] Unknown:
I see in the market a huge gap between all those tools and platforms that are not considering data management as a software development practice. For this reason, those platforms and tools are not integrated with the software development life cycle. This is creating a huge impudence because if on 1 side, we want to bring into the data product team all those accountabilities, but then we are not able to provide them tools to do that in the normal development life cycle, we have a a huge road blocks. And the other gap that I see is the focus on interoperability. Most of the tools and technology in the data management landscape are trying to solve the whole spectrum of problems for the data teams. Those platforms are really coupled with the current standard way of working, but they are not allowing to create and adopt practices.
They are not allowing to integrate new technologies and new methodologies. You are very, very locked by those platforms. So what I would like to see is a future where composability and interoperability are key elements also for technology providers.
[00:47:13] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that and your team are doing, I'll have you add your preferred contact information to the show notes. And I appreciate you taking the time today to join me and share your experiences of helping companies with their data mesh implementations and some of the ways that you have found to factor those lessons into a platform
[00:47:50] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Paolo Plater's Journey in Data Management
First Experiences with Data Mesh
Building Blocks for Data Mesh
Organizational Challenges in Data Mesh
Data Product Requirements and Templates
Ensuring Compliance and Policies
Technical Elements of Data Mesh Boost
Common Challenges in Data Mesh Implementation
Innovative Approaches and Use Cases
Future Plans for Data Mesh Boost