The Importance Of Data Contracts As The Interface For Data Integration With Abhi Sivasailam

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription. When

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Abhi Syv asylum about the different social and technical interfaces available for defining and enforcing data contracts. So, Abhi, can you start by introducing yourself?

My name is Abhi. I head

growth and analytics at Flexport.

And do you remember how you first started working in the area of data?

Yeah. It's in a winding road. So I started my career in the policy world doing doing policy analysis, doing polling, really more on the statistical side, went into went into management consulting after that, focusing on

really fortune 50 predictive marketing analytics,

and then found myself in the start plan as a data scientist. And that's where the through line became clear of the work I've done so far,

which is that most of the value I was creating, most of the value there even was to create with respect to data, I was really in that structuring of data.

And so today, I still am involved in, you know, many data functions across the kind of functional landscape, the data science, the data analytics, etcetera, and also in other business functions like growth. But what I'm most passionate about is data engineering and data management coming through line.

And so as I mentioned at the opening,

the topic that we're covering today is some of your experience of working with the idea of data contracts. And before we get too far into that, I'm wondering if you can just give

your working definition of what a data contract is and some of the goals and purpose of defining and enforcing contracts?

So for me, data contracts are guarantees on 2 things. They're guarantees on correctness, number 1, and accuracy, number 2. So what are those? Correctness is really is the data structured as expected? That means in a way that's that's consistent

and also in some sense, right. It is is it inappropriate? Is it usable? Is it useful representation of data?

So accuracy then is really 2

overlapping considerations.

Number 1, quality. Does the data reflect reality?

Ontologically correct. Is a data generating process, is that stable? And 2, semantics. When you say customer, does that still mean what you said it meant last week, you know, a year ago, a month ago, and does it mean what I think it means? So data contracts are really a way of trying to enforce guarantees around this notion of con correctness and accuracy.

And so in terms of the

sort of contracts and the idea of being able to enforce semantics and correctness,

there's a lot

of area sort of gray area there because the sort of semantics and correctness are, in some cases,

subjective

where,

you know, particularly in the business semantics of, you know, what does a conversion mean? Is it just when somebody fills out the form? Is it when they fill out the form and give us a credit card number? Is it when they have been a customer for 6 months? And so

as you're going through the exercise

of deciding

what are

useful contracts to define,

sort of how do you deal with some of the fuzziness that comes in when you're dealing with these

business level and semantic

aspects of the types of

sort of validation that you want to do on the information that you're working with. You know, before we get there, maybe it's helpful to take a step back and talk a little bit more about the failure case here and, you know, why we do contracts and why contracts

don't happen by default. So

what's the purpose here? The purpose of data contracts is to make explicit what's implicit. So the classical flow of data in those companies, you know, kinda reductively looks something like, the number 1, an engineering team or some kind of data producer, they produce some new data or they change some old data as part of some, you know, product development process. Number 2, that data then lands in some kind of a lake, some kind of a, you know, central central,

clearing house.

In that number 2, the best case scenario is that, great, all the data that was produced upstream, it lands in that lake. The worst case scenario is that it doesn't, that some of that new data that's produced or changed doesn't even land and no 1 knows about it. Once it lands or doesn't land in the data lake, step 3 is basically that data is then processed somehow into data models, that mail, dashboards, you know, what have you, these data artifacts.

And there, the best case scenario is, great. No breaking changes. Everything is good. Things, you know, flow seamlessly in with everything that we had before. The worst case scenario is there are there are breaking changes. There are breaking changes and things that are being produced now are not compatible with things that were being produced before, and downstream artifacts are gonna break.

In practice,

most data organizations,

at least the ones that I've worked with, spend a lot of time dealing

with the worst case scenario of both step 2 and step 3. And this happens because in most companies,

data is governed by implicit contracts. Data producers, which is generally domain engineering teams, simply don't know how data is used downstream, and so they can't possibly be expected to prevent those breaking changes. Certainly not in a way that wouldn't totally cripple their ability

to define their own product, to evolve their own product in isolation.

And, you know, at Flexport,

the company I work, you know, we're a very operationally driven company.

And what that means for us is

we have, like many operational driven companies, thousands and thousands of dashboards that are what I would call operational analytics that actually govern, hey. I am an operator. Here's what I need to do next.

That essentially represents, in a in a real sense, shadow product that is built on top of implicit contracts, these implicit contracts we're talking about. What that means and what that meant for us at Flexport is

product teams upstream will evolve their products as they should. They'll evolve their representations of data as they should. And because of these implicit contracts, because they don't have the context downstream,

that leads to breaking changes. That leads to silent failures. That leads to highly brittle logic. Thousands of lines of case when statements, etcetera, to patch together

versions, implicit versions of these implicit contracts over time. So that's really kind of the animating purpose here, which I think is is helpful to to ground when we think about kinda, like, what are we covering when we talk about data contracts? Well, it's not actually just, you know, maybe the

semantic entities that would govern, let's say, metrics. In many companies, in many contexts, in many even application domains, the notion of implicit contracts actually breaks production applications or production processes,

which is, you know, a deeper problem yet. So with that, maybe I'll jump into a little bit of how we think about, you know, defining defining data contracts. So I'll talk a little bit about the conceptual, and then I'll go into maybe

kinda technical implementation details.

So starting on the conceptual,

it's very important when you think about data contracts

to take lessons from the microservices world.

Now the minute I send microservices, I've immediately lost half your audience. But it's important

to recognize the minute you have a data system of any kind, you immediately live in a multiservice world whether you like it or not. The minute you have some kind of data warehouse, some kind of lake, some kind of lake house, and you have a production system, even if it's just 1 production system, you immediately live in a multiservice world. And when you live in a multiservice world, you have

good reason now to look at some of the tenets of what multiservice systems look like. And 1 of the key tenets

of multiservice systems and microservices in particular

is that you don't couple to the implementation details of another service, and you don't force other services to couple to your implementation details. This is really important. If in microservices,

you force every service to couple to the database representation that another service uses. Services can never evolve their databases, right, which, of course, is absolutely crippling to product development.

The very first thing to recognize here when you think about contracts, how we think about contracts,

is to carry through on that service oriented and that microservices tenant here, where we do not want it to couple to those those implementation details.

Now how do we approach that in a microservices world?

The primary way is through type of indirection

where services will have a layer of indirection between their internal representation of data. Right? How I structure data in a database that I own versus my external representation of data. How do I expose data to, you know, the outside world, to the

SOA and microservices,

typically, this this interaction is

referred to as essentially a persistence model, how I represent locally, versus a domain model, how to reason about my data, how other systems can reason about my data.

And in practice,

the persistence model can change often, will change often. Right? I am creating a new feature. I'm gonna blow up my persistence model. I'm gonna chart it, denormalize it or normalize it or do whatever it's time to create caching layers. A persistence model will change. I might use a graph database today and a DocumentDB tomorrow.

My domain model should be relatively stable, should be more stable, which should represent those more abstract,

stable

business entities, those business realities.

So the very first step here to defining data contracts, the very first step here for me is to define what is your domain model. Domain by domain, service by service, application boundary by application boundary, what is that domain model?

And in doing that, what I generally recommend and what we're following at Flexport is a domain driven design paradigm. That's an approach that I think lends itself well to the data community. So what is domain driven design?

You know, that's a topic in itself. The long and short of it, though, is for folks to consider. There's, this notion of aggregates,

entities, and value objects.

So the example I often give is that of kind of a billing line item on a bill.

In the billing line item as a kind of a a data concept as a data entity, that is an entity. It's a thing, a billing line item. We would model that. We would we would reason about that as a thing, and that thing has properties. Those properties might be the dollar value that the those properties might be, you know, the the actual item name. Those properties are, what we have referred to as as value objects.

Now great. So we have a billing line item. That's vanity. We have these properties. That's a it's a value object.

The problem with looking at billing line items, though, is that billing line items don't make a lot of sense completely on their own. We can reason about them as a discrete thing in the world, but we would never look at just a line item in isolation. We would always look at it as part of the bill. And so enter the notion of an aggregate, which is that billing line items are parts of bills. And so when something happens to a billing line item, really, what we care about is what happens to the bill overall.

So this little digression into into the domain design, into aggregates, and these value objects is important,

especially for us, is the way that we think about contracts is that 90% of the time,

contracts should be based on the domain model, which means you need a domain model first. 90% of the time, the contracts that you have with with other services, with with data systems

is based on domain model. And, typically, the interface between you and another system is that aggregate. Right? When something happens to a value object, you basically expose

the aggregate in total, and that's how systems can communicate to each other. There is a 10% kinda long tail as well where even if you have a a reasonable spanning set of these aggregates, you also sometimes need to, you know, throw in extra business logic, and so sometimes they'll need that extra extra business logic. But either way, that 90 and 10 together, that's the contract. Right? That's kinda how we define a contract. The what of that contract then

is everything about the structure of that aggregate value object, an entity. So, you know, what are the the kind of the fields here associated with that aggregate and entity? What are those value objects? What is the semantic meaning of those value objects? What is the data type of those value objects as we as we send them?

If our domain model has evolved, what is the version number? Is this the 5th time we've evolved this notion of our domain model? Is it the second time? And what does that prior history look like? You know, what did this aggregate look like last month, you know, 2 months before? So on the conceptual side, you know, and we can get into technical details. But on the conceptual side,

this kinda long exposition

is really you know, contracts are not as simple as, oh, yeah. We need a stable reference. Contracts need a a philosophy behind them for us. And that philosophy and that framework, we borrow from the service oriented world,

in really thinking about how can we stably represent this notion of a domain model, decouple that for a persistence model, and then really bind our contracts to that domain model representation.

Yeah. The point about versioning is definitely very interesting because

in the data world, most people think about that in terms of the

schema of the parquet file or the database table, etcetera. But to your point of

the microservices

analogy,

there is more than just the schema information that you need to encode, and so you need to be able to understand

what are all of the details that I'm committing to when I speak to this interface that is

defined by this contract definition, and what are the things that I need to be able to treat as invariants in this communication. And so I'm interested in digging a bit more into the

kind of types of information that you

need to be able to encode in these contracts and some of the ways that that might manifest at both the technical and social level?

Let's maybe start with the technical, and then we can go to the social.

The question of how we have this conceptual model, you know, you have these aggregates entities and value objects. Now how do you get that into a data system?

So I'm long 1 of the trends that I think we'll see more of in in the coming years is a move away from

simple CDC

in data systems. Typically, most data orgs are still replicating. They're still coupled to the persistence model. Right? They're still replicating databases.

The paradigm I encourage folks to move to is more of a transactional outbox pattern. So, technically, how we implement, you know, this notion of these interfaces of these data contracts is we persist those data contracts into an outbox. A transactional outbox, you know, very simply

is, hey. I have transactions happening on my database side. When those transactions happen, before I actually close the transaction,

write something to this outbox table. And what you write to that outbox table is your contract, is, in our case, the update to the aggregate, essentially. Right? The rebroadcast of that aggregate.

So why do I mention that? Because the notion of versioning here is really supported

by the implementation details within that outbox.

Specifically,

what I generally recommend to companies is the outbox should contain representations

of a versioned data format. So for instance, you know, protobuf binaries.

What do you write inside outbox? Well, if what you write is really that aggregate encoded as, let's say, a protobuf binary, then now you can do things that protobufs allow you to do, like version. You can have, you know, version 1, version 2, version 3, etcetera, over time. And all of that semantic context, all of definitional context can be managed along with this evolving data format in in that protobuf and then can be governed and whatnot, whatnot downstream. So what I really recommend to folks on the technical location side and the approach that we're taking at Flexport is the interface should be your domain model. That domain model should be essentially propagated out from an outbox,

not from, you know, tables that you're c d cing from, but an outbox, which you can then CDC from if you like. You know, we use we use the Debezium for that at Flexport.

And then if the way you encode that domain model in that out box is in a versioned

evolving data format, you know, for us, protobuf,

then you can capture all of those semantic changes. You capture all those structural changes, and you can allow consumers downstream

to manage those changes gracefully as they listen to and consume those events.

And so this idea of the outbox

as the sort of output from the source systems, that's definitely useful for the

applications that you control where you're defining the data model.

But how do you handle the similar situation when you're working with maybe third party SaaS platforms and you're using a Fivetran or

a singer to be able to do the data extraction and then still being able to enforce the

appropriate contracts of these are the types of records that I expect to be coming from this system. These are

the formats that I want to encode it in to make sure that my downstream processing is going to be able to execute appropriately on them

and being able to

then also maybe have some contracts about the, you know, volume of data that I expect in each of these batch runs or anything like that. These 3rd party systems is you know, you're hitting on kind of a a core problem that is more acutely felt for companies with a data footprint that's primarily from these 3rd party platforms. There isn't a good fix here because you don't have complete control of the domain model representation, but you do have some control.

And, you know, for us, the way we approach this dovetails on kind of the social aspect that you're talking about, which is, you know, how do we design contracts? How do we help teams design contracts in a way that, you know, speaks to, you know, those downstream use cases? And for us, that is anchored on this role that's very important to our data ecosystem called the analytics engineer. So the analytics engineer, you know, has, I think, been very popularized by DBT and the Fishtown community.

And I think our definition of this analytics engineer role is a little bit different than maybe how others look at it in the industry. I think the way it's often looked at is analysts that write SQL

or analysts that, you know, build data models and manage data pipelines. And for us, analytics engineers are, you know, many things, but they're really this kinda core

data modeling expert, this data modeling SME. You know, we think of them as kind of the central

hub of meaning making throughout the entire enterprise. How do we reason about the semantics enterprise wide, the core concepts that kinda govern our business? The reason I'm saying all of this is coming back to your question here about Salesforce and Net Suite, etcetera.

Well, how do we ensure that the Salesforce and Net Suite and the CRM, the ERP, the HCM have, you know, reasonably stable contracts

and those contracts are

informed by downstream use cases, well, we embed analytics engineers. So our analytics engineers are actually embedded widely into data producers, and this is key for us in making sure that this is the social part. Right? Talked about the technical part. The social part that governs our entire contract apparatus is the role of analytics engineering. For all data producers, our goal is to have an analytics engineer that serves as data steward.

And that data steward helps

the data producers

to collect context from data consumers

and to bake that context in, to bake those those users and use cases in upstream

in a way that allows them to encode those users into those explicit contracts that are actually expressive and useful. And so

the HCM, the ERP, the CRM

all have an analytics engineer

supporting them in how do we structure the data effectively

for internal users, but also external uses outside of those teams. And likewise with all data producers, all data producers have access to an analytics engineer

that really specializes in the art and science of how do we structure data effectively? How do we, you know, make these contracts that are in turn based on domain models? How do we make those domain models really expressive?

Another interesting element

of the idea of

these data contracts, 1 of the things that you mentioned is that the kind of boundary layers are defined by your business domains. And so these are the points at which you want to define and enforce these contracts. But

the actual enforcement piece is another interesting layer where you're mentioning that for this outbox pattern, you've got these protobuf schemas.

But at the social level where maybe you don't have a strictly technically defined,

you know, mechanism for this handoff and still being able to, at the social level, say, I'm sorry, but I can't accept this piece of information from you because it doesn't adhere to, you know, these expectations or I'm sorry, but you're not allowed to

propagate that piece of information because it contains PII in it or anything along those lines and just some of those aspects of enforcement at the social level?

Yeah. So that social level everything about the social level, again, just comes down to this analytics engineering role. You know, for us, this notion of data mesh at Flexport is unworkable without a central

governance group, and that central governance group for us is that analytics engineer. How do we ensure interoperability,

between the contracts that are emitted from, you know, domain a and the contracts that are emitted or consumed by domain b? There needs to be some kind of governance layer, the procedural governance layer, social governance layer, and that is driven by analytics engineers that are a part of 1 center of excellence but are matrixed into these different groups that then come together in these rituals where they say, hey. You know, I want to expose, you know, this new aggregate. Does that play well with, you know, the ecosystem

that, you know, y'all are developing in in this other domain? Does that meet the downstream use cases and the users of your domain? So they're really meant to be that kinda central meaning making facilitator.

And without that, you know, it's really hard for me to imagine how, you know, at scale, things like the data mesh or, you know, even notions of a data contract would work.

This aspect of the data mesh is definitely something that is

gaining a lot of

attention

and interest because of the fact that it does allow us to

decompose these monolithic data problems

into more bounded

domains that we can manage and control in similar means of microservices

of being able to say, this business team is responsible for these elements of information, and they're responsible for defining and enforcing and

providing these contracts and interfaces for

consuming and providing

data between these boundaries. And

I'm interested

in digging a bit more into the manifestation of contracts at those boundaries where

maybe I have an application database, and so I want to provide

data as an output to

analytics consumers

and understanding

what are the types of interfaces that are necessary to be able to make that consumable

and then maybe how that plays into the kind of platform layer of,

you know, as a product team, I have this information. I want to provide it to you. And at the organizational

layer, what is required from a platform capability to be able to make that a tractable problem?

So

we simplify this problem. So, you know, we take this notion of data mesh and, you know, we simplify it in 2 ways to make it workable for us. 1 is we

simply standardize the platform to Snowflake. You know, I think Jamak's original piece on data mesh embraces

polyglot

representation of the data. Right? Some domain should be able to expose their data contracts in graph databases. Some should be able to do so in, you know, in document stores or caching layers or what have you. And in our internal data mesh, we explicitly reject polyglot presentation

and standardize on look. These are, you know, relational SQL style

tables that are all built, and all data products and the data mesh are built with 1 standard

transform framework, which is DBT,

1 standard set of governance frameworks that are baked into DBT, and exposed in 1 standard platform, which is Snowflake. So for us, the data mesh is really Snowflake as a mesh of marts where all of those different domains, those bounded contexts, have essentially their own their own marts where

there's kind of a data lake layer within each mart and a data products layer where they expose these domain data products, these datasets, these contracts. And then, you know, our platform layer is is much simplified. Every mesh consumer is consuming from the same place that they're also contributing,

which is 1 of these marts.

With this ability to create these marts for these various data products, it does give that standardized interface.

Before you settled on that approach, what were some of the other

design choices that you played with to understand how can we

create this platform layer to support this mesh concept and make it a scalable and maintainable solution.

And also some of the requirements that went into your selection of a sort of common core data warehouse with these

MERTs that are contained within that warehouse platform.

We considered essentially various virtualization

options and

API options. You know, what you can think of

as a kind of a basic virtualization option that I think a lot of organizations are contemplating is essentially a GraphQL layer on top of, you know, federated data stores.

And that's a fine way to go. It just entails

overhead, entails, you know, more management, more maintenance. For us, you know, we were on Snowflake. We were using dbt. It's a standard that worked. It's a standard that was intuitive and easy to democratize and easy to maintain. That's the standard we ultimately ended on.

The other aspect of that is the sort of ergonomics

of the solution where, as you said, you're already using Snowflake. You already have some tooling around it. It's a fairly standardized interface. You don't necessarily have to do a bunch of

custom application development to consume a GraphQL endpoint.

You can just use SQL. You know, there's a wealth of tools available for specific applications of that.

And I'm interested in talking about some of the ergonomics of

using data contracts

in this method and some of the ways that they may potentially

inhibit productivity

and prevent

the

sort of free flow of information through the system,

of course,

adhering to, you know, issues of regulatory

or compliance

questions.

Yeah. So, you know, I'll address that by talking about

how we try and mitigate the

impact on, essentially, developer productivity and the speed, the throughput by which new data gets into the mesh and exposes a data product. There's 2 core things here. 1, you know, my favorite topic is, again, the notion of the analytics engineer. The analytics engineer here too is key. You know? I think

modern software development

has

gotten away from the core of what an application is, which is a UI on top of data.

And frameworks like, you know, model view controller

are replete with this notion of, you know, fat model skinny controllers. Most of what an application is and most of where logic is, it should be bound at the model level. And yet, you know, most software engineers, most modern domain engineers aren't so focused or specialized or trained or interested in maintaining those data models. What they're incentivized to focus on and where the state of tooling, you know, tells them to be focused on is on, you know, maybe DevOps or, you know, other kinds of considerations.

So placing an analytics engineer that is actually a specialist in thinking about the domain model, thinking about the domain model right the first time allows us to accelerate

how quickly we can assimilate new changes into that data model, into that domain model, and then push those changes downstream. So 1 key aspect of driving speed here, driving agility is, again, that role of analytics engineer, absolutely crucial.

The second is

creating layers of indirection. Right? And in the same way that we create a layer of indirection between

the persistence model within a service and the domain model of that service,

there

are continuing layers of indirection as data enters and propagates to the data mesh. So the data is exposed from, let's say, the service. Right? The domain model, you know, admitted a contract.

That data then lands in Snowflake. That lands in the corresponding, you know, producer mart into kind of the lake layer. Well, there's another layer of indirection.

On top of that, that data is processed into data products, another layer of interaction. And all of this indirection serves to allow

the layers below to move more quickly. Because there are these layers in a direction above, we can insulate

changes downstream or we can insulate, rather, users downstream from changes upstream. So that's really how we kind of approach speed. 1 is through,

essentially, this indirection approach where we you know, these are all layers of data that are stacked on top of each other and that allows lower layers to move quickly. And 2, again, with the use of an analytics engineer, to really build data models the right way the first time, which allows them to scale and then evolve data models with the skill of someone focused on the art and science of data modeling.

Another interesting element of these

contractual

concepts and obligations is the question of security

and access control

and how you're able to

either define

what the expectations are or maybe defining in some of these interfaces what the

regulatory burdens are within the confines of that business domain.

And particularly then as you move into the data as a product layer where you're maybe

presenting some of this

analyzed information to your customers to be able to say, okay. We in the case of Flexport, maybe we've analyzed all of the

shipping logistics for this, you know, naval route. And based on this information, we're able to predict that if you put a container on a ship in Hong Kong, it's going to reach San Francisco in x number of days. And, you know, these are some of the, you know, cost benefits of being able to use, you know, this shipping line, for instance. And I'm just curious how you think about the,

you know, security and access control implications of

those different boundaries.

Yeah. Security and access control is the topic that looms large in in my mind.

The simple answer here is

I think

of security in each march as having 2 primary parties.

1 is a data owner, and the other is a data steward. So the data owner tends to be, you know, a product manager or, you know, a business owner that's

really accountable to the data generating process, whether that's an application or a third party system or, you know, some other kind of offline process. They are typically steeped in the, quote, unquote, business requirements. They're typically closer to the kinda legal and security requirements, the pathetical concerns, and they set the access and authorization rules. You know, every data product that's created, every data product that evolves has these access and authorization rules. It can be exposed to these roles within data mesh, these domain based roles, you know, these kinds of special users, these kinds of service accounts, etcetera.

Koo enforces then that that does happen, that those roles are actually implemented

in the data steward. And the data steward for each mark is typically an analytics engineer. You know, the in keeping with the data mesh philosophy,

the analytics engineers aren't the only ones that are doing modeling in in SQL and DBT and Snowflake. That is democratized to that whole domain, but there is still a steward that ensures the quality quality constraints, and in this case, the security constraints, and ensures the security constraints are are actually implemented

per the guidelines of the data owner.

Streamset's

DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines,

the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month.

In terms of the kind of biggest challenges that you've been facing in

being able to understand

at an organizational level how to

define and adopt these contractual boundaries

and then implement the technical requirements and build a platform that supports all of these

interfaces,

what are some of the,

I guess, biggest pain points that you've encountered

and some of the ways that you've been able to work through them and understand sort of what are the trade offs of each of these decisions.

So I think the biggest challenge

is

you know, we talk about

data systems and this data mesh as kind of an analogy to service oriented architecture.

And 1 of the biggest problems architecturally, right, conceptually, not the implementation yeah. Not the implementation details, but kind of philosophically

early on with SOA is

if you define the domain boundaries incorrectly, you're gonna have a bad time. And the same is true with data mesh. The same is true with, you know, these kinds of bounded context that we have developed in the data layer. If the way we think about bounded context

is wrong to begin with and the the bounded context themselves are going to evolve rapidly,

then this entire architecture that we built around those bounded context becomes a little bit more brittle. And there's really no way to

completely solve for that. Fast growing companies, and, you know, Flexport is 1 of them, a hypergrowth company that has evolved very rapidly and will continue to evolve very rapidly in terms of its business complexity,

will face changing, shifting, bounding bounding context. The biggest challenge for us going into all of this, thinking about these contracts, thinking about the interfaces between groups is what should those groups be and how to, you know, avoid Conway's law, for instance, and how we partition out these bounded contexts.

So there's really no solve for that. Things will evolve. I would say, you know, what we did was 2 things. 1 is

we recognized that things will evolve,

and where they need to evolve, we try and handle those evolutions first

in the

downstream layers of indirection or rather the upstream layers of indirection in here, which is just you know, we do it in Snowflake. We 1st and foremost, before we move the bounded context, you know, in in the actual service layer, we change them in in Snowflake. We try and reason about, hey. There seems to be a lot of coupling. There seems to be a lot of overlap, you know, between domain a and domain b. You know, perhaps we need to, you know, reconfigure what these bounded context landscape looks like. And so we do that there, and that gives us kind of a safe place to play. This indirection gives us this opportunity to, you know, have a 2 way door in how we model in kind of a safe playground.

And then we bring those learnings upstream, and then we jigger what the bounded context look like upstream. And the second approach is we just took a lot of time thinking about what those bounded context should be in the first place. You don't take that time. You don't really think about what those domain models are and how they relate. You know, all at once, you know, you're basically setting yourself up for these models are going to evolve so rapidly that it's gonna be very hard to keep up. Another interesting element of

these contracts in,

you know, considering maybe the scheme elements is how you're able to establish

feedback loops where you say,

at this point in the life cycle of data, I expect these records to be in this structure.

And so in order for that to happen, I want them to originate from the generating system in this schema.

And so now I want to build in some developer tooling so that anytime I'm generating an event from this application, I can validate as part of my

CICD or maybe my linting checks to say that

this structure is being applied and enforced properly and that the semantics of the information

are matching my expectations that I want them to adhere to at this downstream point in the life cycle of that data. And I'm just wondering if that's something that you have been able to

start to tackle and any

considerations that you have about how to manage some of that feedback loop in the full life cycle of data from the origination to the downstream system, and then as you're evolving that contract, having it feedback into, you know, the information generation

stage? Yeah. So, you know, feedback loops is something where we think a lot about. I think this is a place where there's there's opportunity for better tooling in the space.

Today, feedback loops are, you know, 2 things. 1, you already know what I'm gonna say, which is, again, the analytics engineer, right, the key to the whole thing of the biggest fanboy of this notion of this analytics engineer, where, you know, they're embedded into these data producer teams, but their mandate is to look at consumption patterns downstream, to actually talk to consumers downstream, to be more of this kind of data product manager, so to speak, that's accountable for the quality and ergonomics of that product. Right? So here again, analytics engineering. 2 is, well, we should also support those analytics engineers with, you know, some kind of tooling to help them understand what those downstream use cases look like.

And here, they have essentially 2 options today, and, you know, we hope that this will get better over time. 1 option is well, again, remember

that a key part of our data mesh philosophy is tooling standardization.

I explicitly reject polyglot presentation. I also explicitly and even more strikingly reject a variety of transformation tooling, and so we've standardized on dbt. And so analytics engineers have very clear visibility

in terms of models that are created downstream

on what are those antecedent nodes that those models rely on. So anytime an analytics engineer has a data product in their domain, that they're the latest steward for, they can very clearly see this node, this product that we created

is the parent node to, you know, a 1000 other products and here are those other products. Here's the transform steps along the way. I have full visibility into them. Now let me go and try and understand

what those use cases, what those downstream transformations

imply to me as the data steward of this domain and as the architect of this domain. And number 2, we're also looking at well, sometimes, downstream use cases are not persisted as pipelines in d v 2. Sometimes, they're just consumption. You know, I'm just writing a query. I just go beta dashboard. And so we're also looking at, essentially, query parts and column level lineage based on usage to inform how are these data products being used and what does that mean for how they should be constructed in the future.

Working in the dbt space, I'm curious if there are

any practices

or

specific

technical implementations that you've been able to lean on to

ensure that the different stages of the DBT pipeline are adhering to those different contracts as you propagate across those different business domains?

Yeah. I think

leaning on a couple of things is what I'd recommend to to folks in in the space. So the first thing I recommend to folks in the space is, you know, I think you did a talk recently with someone on the column level, column name contracts. Right? I think that was Emily, and that is a fantastic approach. It's basically trying to create a strongly typed world in within dbt and and Snowflake and and SQL based data models. Absolutely recommend that if you can standardize on these kind of strong naming conventions as kind of pseudo types, then those pseudo types everywhere they exist will have the same level of guarantees.

That's 1 thing I'd recommend, you know, beyond that, various packages to try and standardize on types of analysis, types of transformation,

and then, of course, we also have the analytics engineers, that backstop for ensuring

common governance standards, common implementation standards.

Owen work at Flexport. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process of defining and building out this data mesh implementation

and understanding how best to define and apply these contractual obligations between business teams?

Yeah. I think the hardest part of all of this that I wanna emphasize to folks is the importance and people that have worked in microservices know this, the importance of domain modeling, the importance of in the application footprint, especially, really reasoning about your domain in a way that is reasonably stable through time. A well designed domain model

is a domain model that will, a, be stable and be expressive enough that it can adapt to, the variety of downstream use cases that people would ask of that that domain model. So really putting enough effort into stress testing that domain model, I think, is is the biggest pain. I can't say that we've completely solved that either. The it's a challenge that everyone that's ever worked in the service oriented architecture will face, and that's that's what I've impressed on people. That's the hard part that's really important to get right. For people who are

starting to explore the space of data mesh and data contracts and business domain modeling, what are some of the cases where it's the wrong choice and it adds too much overhead and maybe they're best suited with just doing a more sort of ad hoc approach to how they manage these handoffs or maybe just being a bit more,

I guess, forgiving in terms of the interfaces

that they provide and

consume? I think what we've talked about in terms of having data contracts as well as

having the separation between a domain model and a persistence model,

I think, is never the wrong choice. You know, I've talked to a variety of companies. I've talked to some earlier stage companies that have, you know, brought up, this sounds like a lot of overhead. You know? Can I really commit to this? Can I afford to invest in this now? And, look, having worked in a lot of early stage companies, I'd say early stage companies can least afford not to because early stage companies are most likely to have

explosive

change in their domain models, explosive change in their persistence models that will completely wreck

downstream use cases

that are dependent on these implicit contracts. So I think there's never a a wrong time to implement this, these kind of basic notions of, hey. Think about what your domain model is, abstract that persistence model, don't couple to your persistence model, and try and reason about these data quantums in in ways that are reasonably bounded and separated from each other. I would say that what will make it very hard is if you don't have specialists in the room like analytics engineers. So if you don't have as very early hires, if you don't have something like this analytics engineering role, that can be more of that that meaning making central data specialist that specializes in structuring data models at every layer of the stack, right, not just in SQL, not just in Snowflake, but also upstream. If you don't have that, things are gonna be very difficult. So I would say never a wrong time. But if you don't have a role like an analytics engineer, that is the first thing I'd prioritize

as quickly as possible so that you can you can start paying down these investments in tech

debt. As you continue to work with your team and evolve these domain models and expand on the investment that you've made in these contractual boundaries. What are some of the things you have planned for the near to medium term or any particular

projects or improvements that you're excited to dig into?

We are rolling out the data catalog. We're rolling out Stemma, and that's gonna be integral to our enterprise use, successful use of the data mesh and the contents of Snowflake.

What we'd like to expand

on is the contracts

that are actually emitted from the domains themselves being represented in the catalog, and that's something that you know, we're looking at the best way to do that and to try and combine that with our catalog on Snowflake. Really, we wanna make sure we have enterprise wide visibility into kinda every layer of indirection along the data stack. So we wanna start with that terminal layer, but really we wanna move our way up the stack as well. Are there any other aspects of the work that you're doing on data mesh and data contracts at Flexport or the overall space of how to

enforce these

interface definitions

in the data ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think there's 2 things here that are kinda open questions, open opportunities for for tooling and technology.

The first is what in my talk with Scott Herleman on his Data Mesh Radio podcast, he coined as, data product marketing,

which I think is is a very important notion. So taking a step back, the notion of thinking of data assets as data products is easily 1 of the buzziest concepts in the space. It's probably the most attractive principle

behind the notion of data mesh.

And, you know, like all metaphors, it's really easy to over apply this 1, but I don't think we're at the point where we can over apply it yet. The thinking is still relatively early. I'd encourage people to lean in, and what that means is if data management is analogous product development, what does that mean for how we work and what kind of functions we need to support the work that we do?

And 1 place to start is considering product marketing. So when mature product orgs build product, they support those launches with product marketing.

And we really need the same when we launch data products. This is something that I think data catalogs, especially, are really uniquely well positioned to do. How can data product owners

And when And when that iteration that when that feedback loop produces iterations that introduce breaking changes,

how can we proactively

notify those downstream users and get their feedback qualitatively, their active feedback, not their implicit feedback? Right? Now this is definitely something we're approaching the process at Flexport, but I think it's something tooling

can help with. And at scale, and I think Flexport will reach that point soon, it becomes necessary. So I think that's number 1.

Number 2, I think, is

really automated modeling.

So

AutoML

was very buzzy 3 to 5 years ago, and that buzz has died down somewhat in part because the benefits of ML, auto or otherwise, are sort of narrow. But that isn't the case for data modeling. And so the question is how do we help data modeling scale and how do we help those data modelers, those analytics engineers? How do we help them be higher leverage?

And I think what that looks like is

some kind of machine in the loop augmentation machine in the loop modeling augmentation

that becomes more automated.

And

here, I think there's 3

things in the space that are,

I think, interesting approaches. So the first is this notion of this entity layer. So Ben Stansall talked about this last week in in his great substack.

And

my take here on this notion of anti player is that,

you know, we're talking about modeling. We're talking about the importance of designing these contracts well.

But

company entities

are

remarkably standard across types of companies. And to me, the biggest source of inefficiency

in thinking about defining these contracts or thinking about defining

data models is arbitrary uniqueness. So 1 of the strongest held beliefs about data, but also about just running tech companies, is there's only so many operator tactics. There's only so many operator models. It's a very bounded set. B to b is b to b is b to b. It's out of marketplace. It's out of marketplace. And

to the extent that's

true, what we can all really benefit from is a cross company. Right?

Cross company entity layer, contract layer

that drives that standardization.

So relevant to everything that we've talked about here, you can think of this as kind of a new type of contract layer where when your source system, whatever that source system is, whether that's a third party source system or a proprietary source system, when your source system wants to represent a customer entity and you're a b to b SaaS company, You can do that however you want locally. You can do however you want from a persistence model perspective, whatever makes sense. But the minute that data then leaves your boundary, is exposed from your boundary,

and it must couple to a contract standard that is reasonably universal. Right? That isn't arbitrarily unique.

And, you know, the database has tried to do this in the past. I think, you know, there have been attempts with in a lackluster way with things like DVD packages, with 5tran blocks, with LookML blocks, you know, etcetera.

But they're not really thinking about this contract layer. They're still thinking about this downstream. Right? I think they have distribution wrong. I think the right way to think about this is moving upstream

and defining standardized

contracts that services can actually couple to. That's probably kind of an open source and and community driven contract layer. So so that's 1 way I think we can make, you know, modeling more effective, more efficient in a way that is really in keeping with everything we've talked about

on contracts.

And the last thing on the AutoML side is really I'd love to see more by way of kind of introspection

based

automated modeling, and we've explored some of this at Flexport. There's ways to

introspect on the data that you just have within a domain or, you know, within the enterprise. You know, you can look at what is the metadata of a table and what does that mean for primary and foreign keys and how do these tables join together. You can actually introspect on the data itself to try and infer join relationships.

You can also take introspection further, and you can introspect on query usage. Right? This is something we're starting to do where we're looking at, well, how are people actually querying that data and what does that mean for what new kinds of, you know, unions and joins and whatnot, what new kinds of aggregations we need to only develop. That's part of our data models.

And, you know, if you took this a step further, if you have this kind of introspection logic across companies, right, this is a kind of a a tool across across multiple companies. Then in keeping with our notion of, you know, cutting down an arbitrary uniqueness,

you could actually use introspection across multiple company's query usage

to inform, you know,

standard ways of representing

MAU,

by monthly active user models or customer models or what have you. So I think there's a lot of opportunity there as well, and, know, we're starting to look at at both of these. But I think for either of these succeed, it would need, you know, the open source, community driven, and needs partnership from many more orgs beyond Flexport.

To that point as well, it's interesting to consider the

role of things like the emerging

metrics layer as a contractual boundary where you're moving from the data warehouse into these downstream systems. And so you're creating a contract around the semantics of things like, for instance, as you said, monthly active users. And

to the point of

the kind of third party systems, another interesting

potential direction that I'm sure would take a long time and an act of congress to enforce is the idea that maybe these 3rd party systems

generate some sort of, like, open metadata contract about these are the data models that we expose through our APIs where you're able to extract information to propagate to your downstream systems and so that we have a much more kind of visible and clearly defined set of data models that we can work with from these upstream systems.

Yeah. What a beautiful world that would be. If there's any way to get there beyond an act of congress, it's not going to be led by, you know, isolated

third party tools. It's not gonna be led by Salesforce saying, hey. We are going to, you know, create this industry wide or this rather cross platform wide entity layer. It's going to be led by the community. It's gonna be led by demand. It's going to be led by data professionals that are standardizing

on these kinda entity primitives

that then these upstream systems can couple to. Right? So I think that's what's really gonna drive it, and I think we're early days. But we're very actively thinking about that. I'm very actively thinking about that, and, you know, would love to connect with anyone else that's interested.

Absolutely. Well, for anybody who does want to get in touch and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd be interested in getting your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think it is a lot of that data product marketing, notion that we just talked about. I think a lot of it is that kinda introspection based or entity layer based auto modeling tooling. I think the third is, you know, there's another way to approach automated modeling, which is another talk in itself

that is to do with knowledge graphs. So at Flexport, we're experimenting with knowledge graphs as a way to represent the realities of global trade.

My take is that well structured graphs the structure,

as a as a format, natively lend themselves to

more expressive

questions and answers,

you know, less modeling, and also, you know, interesting ML driven auto modeling approaches.

So, you know, 1 of the things I'm very interested in is also, you know, tooling that democratizes

the creation of those knowledge graphs, helping more companies create those knowledge graphs

and to also do so in part by helping maybe their non graph databases

also speak a graph. So I think that's the other thing I'm I'm really interested in and another hour long topic in and of itself.

Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Flexport to

dig deep into this area of data mesh

and being able to model your data usage in these bounded contexts of the different business domains and creating and enforcing these contractual elements between them. It's definitely a very

interesting

and constantly evolving space. I appreciate all the time and energy you're putting into exploring that within your organization and helping to share it externally. So thank you again for your time, and I hope you enjoy the rest of your day. Cool. Great. Have a good 1.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links