Achieving Data Reliability: The Role of Data Contracts in Modern Data Management

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for.

Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake.

And Starburst is trusted by teams of all sizes, including Comcast and DoorDash.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.

Your host is Tobias Macy, and today I'd like to welcome back Tom Byans about using data contracts to build a clearer API for your data. So, Tom, can you start by introducing yourself for anybody who hasn't heard your past appearances?

Sure. Yeah. I'm, Tom, CTO, cofounder of Soda.

Started off in the software engineering space, building workflow engines in open source at JBoss and Red Hat,

creating open source brand names like, JBPM and Activity, then moved into Data, cofounded,

Soda together with Martin,

because we saw that, like, data quality was becoming a massive problem.

And,

what we are doing now in terms of data is for, like, has some similarity with what I did in the past in the sense that it's, open source declarative languages and engines that we built.

And do you remember how you got started working in the data space and why it is that you've decided to spend so much of your time and energy focused on it?

Yeah. Sure. Sure. It was, like, just the excitement of what was going on back then in data. So I was working

in in process management workflows for a long time, and that was a promising,

environment,

for a very long time. But I saw that in data, things were moving fast and happening. And I have some,

ideas about it and there were some similar similarities,

but that was really an exciting time.

I joined it. And indeed now, the background in software really helps me

to to find out, like, what's going on in data and how we can improve that, that landscape.

Now in terms of the topic today for data contracts, I'm wondering if you can give some sense about the scope and purpose that data contracts have in the context of this conversation

and in the data space in general.

Sure. The purpose of data contracts is actually to achieve more reliable analytical data.

And analytical data on itself

has been notoriously

have a, it has a bad reputation

because it breaks regularly.

And on the other hand, recently, the potential usage for data and application is growing fast. So there was reporting, of course. There's now recommendation engines, pricing algorithms

even. So imagine,

a hotel website which puts pricing on there and there's, like, faulty data feeding into the pricing algorithm.

The software will keep work working,

but the revenue will go down if your prices are bad. So,

so these data algorithms, they only can work properly

if the data is reliable.

And that's,

that's that's the problem we're tackling with contracts as well where it plays, an important part.

Yeah. And analytical data pipelines in itself is a huge

integration problem.

The data pipelines themselves are very brittle.

And data contracts is in fact like a new approach for data testing that goes broader than the technology itself.

Sure. We'll touch on that. Like,

it applies the same principles

as unit testing and software engineering. So in software, when code changes,

you'll need to rerun your full test suite to gain trust in your changed software.

Similar with data, each time a new data is produced,

you'll need to test it to keep the trust.

From that perspective of trust and ensuring the correctness and quality of data,

there has been

several years worth of momentum building up around the idea of data observability

and data quality monitoring. And I'm curious if you can give some sense about the ways that those concepts overlap with or maybe even contend with the idea of data contracts.

Yeah. Sure. And that's a great question because

observability

has been getting a lot of attention in the last few years.

It's similar to, Datadog and New Relic in monitoring your applications. Right?

And this is all to create

visibility into your data warehouse.

It simplifies the diagnostic process. So when there's a potential data issue that needs to be investigated,

observability

helps you to to diagnose.

And this is the the reactive part of it. It's like after the fact that you're going to look and create visibility helping to to find the the problems.

But

while this this actually helps and is an important ingredient,

it's only, like, putting out the fire while the pyromaniac

is still out

there. And so data testing on itself is the goal of stopping the pyromaniac.

And this is where I feel that there's, like, way too much focus on the observability

alone and, like, data testing has not gotten the proper attention yet. And I think that will surely come,

soon or is coming actually.

So data testing is all about, like,

making sure that the data is as expected

across the various handover points in your data pipeline.

And this is really the preventing part.

So just like in software, it's not that observability

is better than testing. You'll need both like observability and testing.

So that's,

yeah. That I think is the key to towards comparing the 2. So in data contracts now

becomes like an important part or approach, which goes further than just the technology

in the data testing part.

On that note of testing in the software application space, there has been a long history and there are still points of contention, but it's generally agreed upon that unit tests are a good thing. And there have been there are general patterns that have built up around how to do unit testing, how to do integration testing, testing, what it means to do end to end testing, and the ratios

of those different types.

In data, there's been a lot of conversation recently about that idea of bringing unit tests to data, but obviously there's another dimension to it that makes it more complicated. And I'm curious if you can talk again to the ways that these concepts of unit testing in the data space

compare to the purpose of data contracts and maybe some of the ways that teams should be thinking about the appropriate ratios of data unit tests, data contracts, and the role of observability

as a

perspective on top of those.

Yeah. And so, yeah, in terms of the actual,

like, link between like, there's the various forms

of testing that you have in software. Right?

And and I'm not sure to what extent, like, there's, like, absolute

consensus in the engineering world,

when you need, like, and how much integration testing versus the unit testing and all that.

But the principle itself

is generally,

accepted. Like, if you don't do automated testing, like, on a broad scale, both, like, on an integration level as on a unit scale,

then you will end up in a situation that you don't trust your new release. And I think that definitely transports

to data.

Where

if you don't test anything, you will lose, faith in your data. And you will have a problem, like, in a boardroom. You see the graph going down and you wonder, like, is this bad data or is this actually our business going down? So if you don't trust that data, then it's actually

useless and your whole investment,

goes down. Now, like, one level deeper, I'm not sure if all the analogies work there, But definitely this is the first one

which which really, like, at the large scale I don't think that

the principle of that testing is as adopted yet as it should. So I think for the majority of situations,

there is the whole aspect

of making creating awareness

around, like, what is a data component in a pipeline? Currently, pipelines are start to end. It's like, what is a component and what is the the handover point? And, like, making sure, like, where do you apply the test? I think that's where we should start,

first before

digging any deeper in terms of, like,

the yeah. Where do you need to

how do you need to call those types of tests?

The other interesting

wrinkle that comes into play when you're thinking about testing for your data is that when you're dealing with an application,

you have the idea of I need to run it through the test suite through the CI/CD process before it goes into production.

Once it passes all of the tests, I have pretty good confidence that everything is going to work when I deploy.

With data, you have some measure of confidence that you can test the business logic around your transformations,

your extracted load, etcetera.

But the problem that always comes up with data and when you're talking to people about testing for data is that data changes,

and you don't necessarily have full control or even visibility into when or how that data is going to change. And so you can't just say, okay. I'm gonna run it through my set of tests. I'm gonna put it into production, and everything's great. And I'm wondering how that also factors into the way that you think about the

testing and validation and what an integration and and an end to end test means in the context of a data pipeline or a data flow.

Yeah. That, that's actually a great point because I see

many people struggling with that with that notion. I think

there's 2 key,

events that you need to separate.

1 is

when code changes, there is a potential of the software breaking. So you have pipeline code, you change that, that software might break. And it ends up in bad data applications or in bad data.

So this is the CI/CD pipeline of your data pipeline, basically.

And so when you change your transformation logic or ingestion logic, it makes

perfect sense to test that, to have sample data that you test and run that on. But the tricky part, which took me a while to figure it out, is that

data, like, as it is in production,

you have a imagine a a daily batch job. Your airflow runs on a daily schedule.

The data passes through. Like, there's no code change every day. So that means that

and the data might break at some point without the code being changed.

So every batch of data, you could consider that as a new release of data, which has to be tested just the same. And I think that that's kind of the analogy there. It's like, are you changing in CI/CD, you're gonna test your code changes.

In the production pipeline, you're gonna test your data changes

because that's also a new release of data.

Circling back around to data contracts,

what are some of the types of guarantees

and requirements that you can enforce using that mechanism? And what are some of the examples of things that you can't logically represent in the construct of a data contract?

Yep.

So, yeah,

in terms of what can you,

enforce,

the

the API like, the contract is the API for data. Let's let's look at that

first before we dive into what can you check.

So

API for data means, like, I touched a bit earlier on how, like, a long data pipeline

might exist of several components,

which are currently a bit blurry. Right? So I don't think a lot of teams actually have great awareness around, like, where does one component stop and the next one start. And I think that to me is one of the biggest

kind

of advances

that data contract

bring into the the data space. Because it it kind of demarcates

the the kind of componentization

of your of your pipelines.

And it's like all those datasets that are the handover between one component to another or between one team and another.

Those are justified

to to represent

an API

because the previous component or team delivers a table with new data in it,

and the next team is gonna use that or the next component. Like the dbt transformation might have as input,

a certain dataset,

and that has

certain like, they have certain assumptions on that. Like, what's the schema look like? What are the uniqueness properties if if joins are being done and so on? So so when you're using that, the principle of encapsulation in software, I think is crucial and is missing in the data

kind of awareness,

space. Because

the the encapsulation

means that from that previous component,

you don't need to know all its internals. The only thing you really need to know is that table. What can I rely on on that table in terms of schema and all the other properties that you use? And this is what you describe in a contract. So the dataset

is Tableau data structure, typically typically a Tableau review.

And then you have the schema that goes with it and then all of the other data quality properties.

And that's what you document

in, a contract,

which is very similar to an open API or a GraphQL

kind of description

on services.

But in this case, the context is that it's a table that you're gonna consume over a SQL connection or, like, firing SQL to it. So that in itself, like, that encapsulation

is is key. And now second part of that question is, like, what are the guarantees

that

you want to enforce

as you're like processing a new batch of data?

That is, typically things like, as I mentioned, the schema is something that you for sure wanna test because as you're gonna use that data, you're gonna use the columns most likely. And then the naming and the data types need to match. But also missing values, validity, uniqueness,

referential constraints,

those, like, everything that you can touch test on that new batch of data

is what you want to to do as part of your,

contract enforcement.

At OutShift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and start up agility.

Their latest innovation for the AI ecosystem is Motific,

addressing a critical gap in going from prototype to production with generative AI.

Motific is your vendor and model agnostic platform for building safe, trustworthy, and cost effective generative AI solutions in days instead of months.

Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process.

Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency.

Go to motific dotai today to learn more. That's motifi

c.ai.

In that flow of, I have a new batch of data, I'm applying all of these tests to it.

One of the challenges that comes in there is do you apply those tests before you actually process all of the data? Do you do it after you process all of the data? How do you make sure that after you've already run through the processing, you wanna make sure that your transformations are correct. We've already landed the data in the destination, but it fails the tests, and so you wanna prevent it from actually being used

either in a business intelligence report or by downstream use cases. I'm curious if you can talk to some of the ways that you think about

the kind of the the the

pretest and post test and how to control the propagation of data once it fails a certain batch of tests.

Yeah. That that makes a lot of sense too.

In terms of when to test what, like, there's a trade off here because,

usually, your pipeline,

if it's not built with this from the start in mind, then you just append new data to a certain incremental dataset.

And then,

which actually, like,

as you're adding this, and this is where the notion of releasing new data

comes into play because you're adding it to the incremental table, which actually means you've published it. You cannot retract it. Like, the consumer might have just run a query and already consumed this new information.

So if you're gonna test it

after adding it to the incremental table,

then there is a potential that you've released it without testing.

So that's the risk. But

this actually

usually is easy to do because you can easily apply a filter in a contract, for instance, or in your data testing.

And then you don't need to change your, pipeline code.

So this is how we say, like, you can start with contracts by layering it on top of your existing architecture,

but you won't get your notifications

to to do, like, your proper circuit breaking. If you really want proper circuit breaking, then you need to do your kind of C/ICD

approach, which is you're gonna land your new data in a separate table, for instance, and run your contract checks on there. And then only when that succeeds, then you're gonna append it to the incremental dataset.

Requires a bit of work. If you do this from the start, it's actually quite easy. If you do it retrofitting,

then it's gonna take some work. And then usually, it's okay to just,

test test it and, like, signal the problem if it just lands on the incremental dataset.

And, fortunately,

the capabilities of the underlying

storage and query engines are evolving to a point where it improves the ability to be able to make these changes

and test them before you publish. I'm thinking of things like the Nessie project for iceberg tables, lake f s for general data lake approaches,

the

0 copy tables for snowflake where you can make make a copy of the table, make those changes, test it, and then publish them back.

So the the it's becoming possible and easier, but it also depends on what you're actually using as your underlying

substrate.

Definitely. Yeah. I agree there.

As far as the implementation of these data contracts and the ways to think about

how the

contracts are defined and who defines them

in terms of unit testing, that was associated with the overall DevOps trend of shift left where you wanna move everything as early in the process as possible.

And in data, a lot of that shift left means that you have to bring in application teams so that they can notify you when they're making changes to the underlying sources that you're pulling data from. I'm curious how you think about the

responsibility

and application of data contracts and how that fits into the technical and organizational structure of a business.

Okay. Yep. Cool. Let let me try and start all the way from the consumers and then work our way backwards,

towards the source data.

Because this is this is, I think, showing the story shows the power of contracts.

Where imagine you're building a simple report.

You're using some data like 3 data tables.

And,

like, if you want that data to be tested, you want to create your contract on there. As a as a consumer, you can start

by saying these are my tables. This is the contract that I want to see,

verified before, and I want to be notified if this doesn't work. But as a consumer, it's pretty unpractical

to manage those contracts because you don't have the ability to change. You don't have the the power to change the the pipeline that produce the data. Those are the engineering teams that actually produce the the data producers.

So actually, a contract just like an open API description is not something that the clients want to maintain. No. It's the team producing the data. So you always want

to hand over the contract to the team producing the data. It's an integral part of the software producing it, and it is the the description of the interface. So in the end, the consumers,

they want these contracts. They want to know about the contracts because they describe, like, how can I use it? It's all the metadata describing the data that you can use in your data products.

But then you want those data producers to take ownership. Right? Now in

in the the transformations

that have led up to this refined data that goes into the reports,

they have probably used input data either from the extraction or from a previous transformation

or whatever. So

the producers

are often reluctant

to to take on this ownership, to take on the guarantees, and to provide the guarantees in the contract because they actually rely on input data which which they might not fully trust. And so that puts pressure on their input data to have

contracts on there.

And so if they know, like, oh, I have on the last transformation, I have my contracts

on all of my inputs, then I actually can guarantee the outputs of the refined data. And so

that that kind of mechanism goes all the way up to the source data.

So which which kind of

brings like a new level,

a coarser grain level onto your data infrastructure.

It's not that you should be looking at all the tables in your data warehouse. No. You should be looking at all the handover tables

between those components. So if you have a component that produces something, you want to see your inputs also protected

by a contract. And this goes all the way to the source system like the production data. This is where something,

tricky is happening a lot, of course, because

initially,

we need people we need the data from the production systems. Right? And then there is, like, a rest API around that production system which the team is fully aware of and they provide consistency

and that's managed as a product.

But then in order to export the data to the analytics team, they just break into the backdoor, and they just steal

their database table data, which is never intended as an API,

and they didn't know. But, actually, this is a hard conversation that needs to be had because the team actually needs it. And this is where your contract

where where it'll say the first user of that data

could say, like, I'm giving I'm making this initial contract,

but have that conversation with the production team saying, like, oh, can you take ownership? We actually use this as a product. And we are better off knowing that you cannot give any

guarantees. But that you say, like, put some integration tests so that we know when it's breaking rather than just ignoring the problem. I think that that's kind of where

contracts come into play into that into that ecosystem.

So it's the handovers and it's pushing all the way up to the source,

where some hard conversations sometimes need to be held.

In that overall

flow of information

starting from that consumer of,

I wanna make sure that these

constraints are always true. I wanna know if they're not true,

pushing that down into the

consumers and producers and pipelines and applications

that have

relationships with that data.

Obviously, those contracts are integrated as part of that pipeline that consumes and transforms and produces that data for those consumers.

But I'm wondering if you can talk to some of the ways that these contracts

have a ripple effect across the overall organization and their approach to data and some of the ways that you surface the information about those contracts, particularly when they are failing so that somebody who is relying on that set of data can know, oh, hey. I'm looking at this dashboard, but I can't actually trust it because I see that this check failed.

How do they even get to that information, and how do you try to surface that in a way that helps to build trust rather than detracting from trust of saying, oh, well, the data's always broken. I can't ever trust it.

Yeah. Yeah. That that makes a lot of sense. And I think, like, the the the key thing here is indeed

I think we need to get out of the situation

where data breaks so often. So it's, like, the pyromaniac,

being out there. Like, if you start doing contracts, I think overall, you'll see that you'll get less issues and you get less of this problem. But but, of course, we're very early in the in the journey. That's that's one thing. But how do they surface? Is that you like, the first thing you have to realize what I I want to reiterate what I just said, which is

nowadays, people just look at the whole warehouse and see, like, a gazillion

datasets,

like tables or views,

where data is stored. And then the actually, a value that contracts bring is that you identify the datasets

of which someone takes ownership.

And then immediately, if you have those datasets, because the datasets with a contract are the ones that someone says like I stand by this dataset,

which then pushes down all the other datasets which are not as relevant. And in this sense,

this is already a key thing, a key property when pushing information to a catalog or a data discovery tool. Like, the data discovery tools is where the datasets are being found by the consumers.

So and that would be key as you're adopting or gradually are adopting contracts. They're gonna see, oh, this is a a dataset governed by a a contract. This is a more interesting one,

for me rather than some dataset that might or might not be

good. That's the the one thing. And then, of course, in that same data discovery tool is where you want to have maybe more granular,

information as to, like, how often it breaks and, like, which checks,

did break,

that that's where this this pops up. But I think, like, the key is not necessarily in figuring out and having the

the the consumers to learn about, like, the actual data because that's the debugging process. That's where we have, like, more in-depth tools to to find the root cause.

But those should just be,

like, is it covered by a contract? Okay. Then I can already rely on it a lot more. And who's the owner? Who can I talk to? I think those are the key questions that you'll want to find in your, data discovery tool.

And so now bringing this from the abstract of data contracts, what they are, how you use them to the concrete example of what you're building at Soda in this data contracts tool,

I'm wondering if you can talk through some of the ways that you thought through the design process and the syntax and implementation for how to actually bring these data contracts into the

pipeline, into the data ecosystem,

and how to ensure that they can be written and understood and maintained and not have it just become another pile of spaghetti code that nobody can understand and everybody has to debug.

Exactly. Yep. No. I can definitely,

talk a bit about how we got there. Because

early on, even before data contracts came into play, we,

created SodaCL

as a declarative YAML language for expression

expressing data quality checks.

And that gave us the validation

of the declarative approach

specifying these quality,

checks in YAML. That really,

resonated.

But it was created

from the perspective of the consumer. It was like a consumer, that's where the problems show up, and we started from there

expressing checks over multiple datasets.

And so while you can actually build a contract

strategy on top of this declarative language if you

if you really have this background of how the organization should work in terms of producers, what the ownership means, and then the consumers,

and then how to put these, checks in between. But we realized that there was an opportunity

to align much better with the data producer world. And the data producer

and the fact that they, actually produce a set of output ports in the data mesh terminology

or, like,

output datasets of your software component,

that's where we see there was an opportunity to align the language a lot better with that. And that's that's actually what we did. So we can leverage from our perspective

the query engine that does that runs the the evaluation of all the checks,

that's solution that was already,

built and and stabilized for a long time. And the only thing we had to do was tune the language towards this new use case

of,

running all the checks for a single dataset.

Another interesting aspect of this space is that

the approach of testing for data,

building

guarantees around data has been around for a while. Different tools have implemented it in different ways. There's also the space of metrics definitions to say, okay. I have worked through this data, and these are the types of things that you can expect from it. These are the semantics around it.

So dbt has their metrics and unit tests. There's the great expectations tool that is built around making sure that data

matches your expectations around what you want it to be. There's the tool that you're building of data contracts in the Soda open source ecosystem.

I'm wondering if you can talk to some of the ways that you think about the areas of overlap of what you're building with some of those other tools, and in particular, some of the

either

emerging or nascent standards as to how to think about the definition and maintenance of these guarantees for data?

Right. Yeah. 2 very different questions, but I'll tackle them 1 by 1.

First of all, like

like, what's the overlap with other tools? Right? And so from a Soda perspective,

we definitely see our,

product

as a component in a central data stack. So there's,

what we see is, like, we don't have, like, a single technology that works in our environment.

We want to work across a variety of environments. And that is usually, like, we we touched on earlier, an existing kind of architecture,

different kind of orchestration tools that might be DBT, but all the transformations,

as well. And there might be data floating around without outside of dbt.

And so this is this is what we kind of have the combination. We we apply the principles

of

data contracts. And then as we saw before, it's like it's mostly like an organizational thing. It's making sure that you support the right workflows.

And we do this as a combination of observability and data testing, and we deliver that as a package

so that it makes sense to to install this on your central data infrastructure.

So that that the whole team can like as a as a central data team, you can say this is the tool we're working with for data testing. And then all the different teams have the guidance

in their particular environment. They can actually have the guarantees. Like, there is guidance on how to apply it, and it also works in those different environments. So that's that's how it,

our our perspective,

is a bit different. And then again, like the second part is like, yes,

we also think that is a very new space.

The standards are popping up, left and right. We think that is really important. And it feels also that this is a place where

standard could really help because there's lots of tools integrating with it. We didn't touch on that yet. But, like, unit testing is one aspect

of contracts. It's like pushing metadata

to data discovery tools is another use case for another tool. There's retention. There's access control. There are all different aspects that you can model easily in a in a contract. So it makes perfect sense to consider

a standardization effort. There are multiple,

competing ones at the at the moment. There's the ODCs, the open data contract standard, we know them as BITOL.

There's other BS like,

open data product specification, and there's probably a few others as well. So we keep a very close eye on those and help wherever we can to push those, forward because we know that for our,

for our customers, this is this is crucial.

Like and this is the value that that standards can bring. The the whole data

landscape is super fragmented yet.

We'll we'll see probably some consolidation going forward, which would be really good to have. But as long as we don't have that, all these tools

need to interoperate.

And I think contracts are gonna play, a major role in that. So wherever we can help, we're, we're there.

To that point of integration

and access control in particular, I think that that's a very interesting application of these contracts to say, I guarantee that this set of data is only accessible by people who have these role,

but

there isn't really any

cohesive standard around how to actually apply access controls across different data tools. That's one of the problems that I run into constantly at my work.

And I'm wondering if you can talk to some of the types of integrations that you're thinking about building or have already built for these data contract specifications,

some of the areas

in the data ecosystem

that,

I guess, are in good shape for being able to push these types of guarantees down into other layers and some of the areas that you're seeing gaps as far as how to actually approach that integration and enforcement of the guarantees that you want to specify.

Okay. Let let me try them run them 1 by 1. I think if you start with a with a contract and what you can do with it, there's multiple use cases for it. That's really important,

to distinct here. One use case is

being the central source of metadata

for your data.

So

the the system of record, better to speak, in terms of, like the system of record for your metadata,

which means you have your YAML file, you describe your schema and your types and all that. And all the information for consuming it is in that file. It's managed by the producers because they control the actual data production. They should also control that description.

And then pushing that to the data discovery, as we said, like, that's one use case.

That's one tool actually consuming that file, that YAML file, and extracting a portion of that information

and displaying it to the consumers.

The second one is the one that we focus on is the unit testing. Like, as you are,

specifying the schema and also your data quality properties, we can extract those and run checks to see if that really matches, and that part is covered with some checks. So unit testing is a second use

case. And there might be others. So the next one is where we see,

like, the the the companies, the customers themselves build their custom workflows, and they're using a couple of properties left and right that they add,

could be around ownership,

PII information, and what it means for them or where they want to enforce this and use this. So they built the

the tooling and the the software logic in their workflows

that leverage this information.

And so there's more, like, retention and access management you could specify.

And and in that sense, contracts

become kind of the configuration

files for the tools in your data stack. That's also another way of looking at it. And I think the challenge that we're gonna face going forward

is that as more tools adopt the

the the contract, which is already like a huge benefit versus like having all this logic spread around over all tools. If you see that centrally managed in a single file, I think that's huge benefit.

But as an engineer, if you're now gonna, like, after a couple of years change this file, how are you gonna know that this property,

like, which tools is this property gonna impact? If you change the uniqueness,

from false to true or you specify

from nothing, like you say, this is a unique column,

does that imply that you're gonna run a unit test on it and that you're gonna check uniqueness on the full? Like, you might do that without being aware. You just say, like, I wanna

I wanna use, like, the the data catalog use case or the data discovery. I wanna say to my users it's unique and not realizing that you're gonna test for it. So that's what what we take into the design of the language to make sure that you're having a clear distinction between what you're gonna test and what you're

going to going to push. So so that's that's what I see as a challenge. As more tools get configured in this contract,

which is valuable,

how to make sure that the engineers still,

like, stay in charge.

In terms of the

workflow

of bringing these data contracts into your ecosystem. I'm wondering if you can talk to some of the

types of questions that you're seeing engineering teams come up with, some of the ways that they're thinking about how and where to apply these data contracts and some of the

some of the ways that they've been able to benefit from paying down

complexity, but also some of the ways that they're maybe running into issues of they want the data contract to do something and it's not flexible enough or they just don't maybe understand

the limitations of what types of things they can guarantee?

Yeah.

More like on the other side, it was like I saw some some very

interesting,

use cases

of data contracts, which I didn't expect initially,

which kind of broadened my mind,

a little bit. Because we're saying, like, we started off with API for data. Right? Now we have this metadata,

data contract describing the schema, and all of that.

Now someone asked for a little tool that I didn't expect,

and that was to generate

the the DDL

statement, like, the the create table statement from the contract.

And while this was just like a a normal,

like, feature request,

it started to to trickle down. What does this mean? And it actually

could mean that the data contract could become the control plane

for your warehouse,

where you're not, like, starting from the DDL and then working your way to the contract, but the other way around.

Like, the engineer starts from building the the contract first, the and starting to to build that metadata, and then just doing an apply of this where the tool then calculates the difference

and just runs it. So so I think that was a powerful

kind of metaphor or an insight that says, like, maybe we're going in that direction because we're fixing actually a limitation of the warehouses, of the storage layer. We're fixing the fact that storage layers have very limited metadata. They only have column name and data type, and that's it. Whereas

much of the workflows in a data,

environment,

they are

they require a lot more metadata.

Just like document,

systems in the past,

it's not about the documents. It's like you can only do proper workflows if you associate metadata with your documents. Here, just the same. You can only do interesting workflows

and automate things in your organization if you have your metadata together with your datasets. So that was a that was an interesting,

way of how we saw this being applied.

And in your work of

building these data contract interfaces,

thinking through how they fit into an organization's

data ecosystem,

building the tooling around it? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?

Yeah. The

one of the more interesting things we came across was applying the generative

AI to this. We we totally didn't expect or at least I didn't. I was initially

a little bit skeptic, and then someone said, just just give it a try. We opened up a prompt,

and we learned the prompt a little bit about, what soda was. And then,

I thought, like, oh, can I use it as a contract generator?

So I set in the prompt, I added, like, the create table statement

because we have it. You can extract it from the warehouse.

And I added a sample data by just adding into the prompt a couple of insert statements without really explaining.

And I asked,

the prompt, like, can you generate a contract for me?

And it was, pretty impressive. It came, like, back with a full working contract.

There was a student,

dataset. There was a GPA in it column. Nothing more. It figured out that it could create, like, a min and max check on the GPA,

whereas all the values were between 3.23.7.

It created the check's minimum value 0 max 4,

which kind of, like, it deducted from the GPA, the acronym, and, like, from its network, it says, like, okay. This must be a GPA score. And then, apply the data quality check. So that was really impressive.

Then we like, I thought, oh, this looks good. Let me try to run this.

It was actually a problem because

as you are,

the like, as you're doing a create table, it was in Postgres.

And as you're doing create table, it took, like, a VARCHAR

kind of description. So

the machine,

learning or the assistant just took that

one

as the the data type in the contract.

But if you ask the metadata of Postgres, what's the type? You get, like, character for Ryan. This was the only error. So I tried to run it. There's, like, a whole bunch of logs coming out, and somewhere it says, like, your check your contract check failed because of the data type doesn't match. And I thought, like, why not try it? I just copied and paste the whole logs into the prompt saying, like, so it doesn't seem to run. So can you fix it? And the assistant came back with, like, a fixed

contract that actually works. And that had a good balance of which checks to apply and which not. So I was I was totally impressed with that, and that made me look ahead in terms of, wow, this is really how generative AI can change

the interface

of a system, where you you used to do it in terms of, like, an IDE, like YAML editing

or or XML editing back in the days. Now you're gonna have a conversation. Like, can you update my contract?

This is what I want. Can you fix it? So that that was,

yeah, impressive to see. So we just added that quickly to the product.

For people who are

interested in the capabilities that we've been discussing,

are looking to improve their overall

reliability of their data platform, what are the contexts in which a data contract is the wrong choice?

Yep.

So that's a that's a hard one actually. So, like, if I would guess, like, if you anticipate

that your analytical data, like, does not change, remains stable, nothing changes,

then you might not be needing it.

It's like and maybe it's the same

as in in software engineering. Like, when do you not need your unit testing? Like, I guess,

it's also pretty hard. So it's kind of like as you're now building a new software project, like, it might be that you say I have a script here that is for my personal use and, like, I run it 3 3 times a year. I don't know if maybe not even anymore after today. Then I don't think it's justified to run unit tests on it. Same with your data. If you do a one off, don't bother.

But if it's part of your data infrastructure

and you're building, like, enterprise

workflows on top of it, like, I don't think you should be

avoiding this, at all. It's gonna be hard. And so I and I agree it's early. It's like it's not that this is a common practice,

but I definitely

feel that it's coming. And,

and I expect, like, within 5 years that no one's gonna start another data pipeline project without thinking about the contract first.

As you continue to work on your tooling, keep an eye on the overall ecosystem

of data and how people are thinking about building guarantees around their pipelines and their analytical capacity. What are some of the things you have planned for the near to medium term or any predictions that you have going forward about how the space might evolve?

Yeah. So in Soda, we're, like, in early access right now, and we plan to to bring this to to GA later this year. We can do this quite fast because we like the the the solid foundations of the engine that we

have. But in general, I think the more interesting part is is to look ahead in how the the the uptake of contracts.

And mostly like the organizational

aspects and

the the the the assumptions that we have in software, how do they apply in data? And can we get the same kind

of vibe there that we're starting to think in data pipeline components with interfaces between them? Where the producer teams take ownership

because they're currently often missing in action.

And, like, that principle,

that for me is what I think is gonna be, like, very interesting to watch as that play out. Is it gonna be, like in software, a very lengthy process? There, it took, like, maybe 10 years before

unit testing was quite adopted, but now we have that example. So it's it's much easier to explain it now in terms of software prints terms,

and that you need it is is also quite easy to see. So

so I think it's gonna be a lot faster. But is it 2 years? Is it gonna be 7 or something in between?

Yeah. That that's gonna be interesting to play out in in in my perspective looking at.

Are there any other aspects of this space of data contracts,

either conceptually

or in terms of your implementation at Soda that we didn't discuss yet that you'd like to cover before we close out the show?

No. I guess, like, the the the biggest thing that I would like to see happening is a more integrated

data platform. It's like there's currently

gazillion tools that all do a small part

and you feel somehow that would be much easier if they were better integrated.

I think that's gonna happen at some point. So the the the bigger players

are

thinking about this,

of

course. And so I think that that to me is what I see also as very interesting

times ahead. Like, how is this all gonna be

consolidated so that you get, like, a complete

data infrastructure

platform as as a single product

rather than having to stitch together all the different tools as as is, unfortunately,

the

the times we are in right now.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And I just want to thank you again for taking the time today to join me and share the work that you and your team are doing on data contracts and share your perspective on the role that they play in an organization's data ecosystem

and the applications

that they have to help to build greater confidence and reusability of data. So thank you again for that, and I hope you enjoy the rest of your day. Thank you, Tobias.

It was a super pleasure to be here, and thank you for helping the data

ecosystem to,

to share all this knowledge. That's very much appreciated.

Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast