Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.
Announcements
Parting Question
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
- Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
- Introduction
- How did you get involved in the area of data management?
- Can you describe the scope and purpose of data contracts in the context of this conversation?
- In what way(s) do they differ from data quality/data observability?
- Data contracts are also known as the API for data, can you elaborate on this?
- What are the types of guarantees and requirements that you can enforce with these data contracts?
- What are some examples of constraints or guarantees that cannot be represented in these contracts?
- Are data contracts related to the shift-left?
- Data contracts are also known as the API for data, can you elaborate on this?
- The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
- How did you approach the design of the syntax and implementation for Soda's data contracts?
- Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
- Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
- What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
- When are data contracts the wrong choice?
- What do you have planned for the future of data contracts?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Soda
- Podcast Episode
- JBoss
- Data Contract
- Airflow
- Unit Testing
- Integration Testing
- OpenAPI
- GraphQL
- Circuit Breaker Pattern
- SodaCL
- Soda Data Contracts
- Data Mesh
- Great Expectations
- dbt Unit Tests
- Open Data Contracts
- ODCS == Open Data Contract Standard
- ODPS == Open Data Product Specification
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'd like to welcome back Tom Byans about using data contracts to build a clearer API for your data. So, Tom, can you start by introducing yourself for anybody who hasn't heard your past appearances?
[00:01:04] Tom Baeyens:
Sure. Yeah. I'm, Tom, CTO, cofounder of Soda. Started off in the software engineering space, building workflow engines in open source at JBoss and Red Hat, creating open source brand names like, JBPM and Activity, then moved into Data, cofounded, Soda together with Martin, because we saw that, like, data quality was becoming a massive problem. And, what we are doing now in terms of data is for, like, has some similarity with what I did in the past in the sense that it's, open source declarative languages and engines that we built.
[00:01:41] Tobias Macey:
And do you remember how you got started working in the data space and why it is that you've decided to spend so much of your time and energy focused on it?
[00:01:50] Tom Baeyens:
Yeah. Sure. Sure. It was, like, just the excitement of what was going on back then in data. So I was working in in process management workflows for a long time, and that was a promising, environment, for a very long time. But I saw that in data, things were moving fast and happening. And I have some, ideas about it and there were some similar similarities, but that was really an exciting time. I joined it. And indeed now, the background in software really helps me to to find out, like, what's going on in data and how we can improve that, that landscape.
[00:02:28] Tobias Macey:
Now in terms of the topic today for data contracts, I'm wondering if you can give some sense about the scope and purpose that data contracts have in the context of this conversation and in the data space in general.
[00:02:44] Tom Baeyens:
Sure. The purpose of data contracts is actually to achieve more reliable analytical data. And analytical data on itself has been notoriously have a, it has a bad reputation because it breaks regularly. And on the other hand, recently, the potential usage for data and application is growing fast. So there was reporting, of course. There's now recommendation engines, pricing algorithms even. So imagine, a hotel website which puts pricing on there and there's, like, faulty data feeding into the pricing algorithm. The software will keep work working, but the revenue will go down if your prices are bad. So, so these data algorithms, they only can work properly if the data is reliable.
And that's, that's that's the problem we're tackling with contracts as well where it plays, an important part. Yeah. And analytical data pipelines in itself is a huge integration problem. The data pipelines themselves are very brittle. And data contracts is in fact like a new approach for data testing that goes broader than the technology itself. Sure. We'll touch on that. Like, it applies the same principles as unit testing and software engineering. So in software, when code changes, you'll need to rerun your full test suite to gain trust in your changed software. Similar with data, each time a new data is produced, you'll need to test it to keep the trust.
[00:04:15] Tobias Macey:
From that perspective of trust and ensuring the correctness and quality of data, there has been several years worth of momentum building up around the idea of data observability and data quality monitoring. And I'm curious if you can give some sense about the ways that those concepts overlap with or maybe even contend with the idea of data contracts.
[00:04:39] Tom Baeyens:
Yeah. Sure. And that's a great question because observability has been getting a lot of attention in the last few years. It's similar to, Datadog and New Relic in monitoring your applications. Right? And this is all to create visibility into your data warehouse. It simplifies the diagnostic process. So when there's a potential data issue that needs to be investigated, observability helps you to to diagnose. And this is the the reactive part of it. It's like after the fact that you're going to look and create visibility helping to to find the the problems. But while this this actually helps and is an important ingredient, it's only, like, putting out the fire while the pyromaniac is still out there. And so data testing on itself is the goal of stopping the pyromaniac.
And this is where I feel that there's, like, way too much focus on the observability alone and, like, data testing has not gotten the proper attention yet. And I think that will surely come, soon or is coming actually. So data testing is all about, like, making sure that the data is as expected across the various handover points in your data pipeline. And this is really the preventing part. So just like in software, it's not that observability is better than testing. You'll need both like observability and testing. So that's, yeah. That I think is the key to towards comparing the 2. So in data contracts now becomes like an important part or approach, which goes further than just the technology in the data testing part.
[00:06:22] Tobias Macey:
On that note of testing in the software application space, there has been a long history and there are still points of contention, but it's generally agreed upon that unit tests are a good thing. And there have been there are general patterns that have built up around how to do unit testing, how to do integration testing, testing, what it means to do end to end testing, and the ratios of those different types. In data, there's been a lot of conversation recently about that idea of bringing unit tests to data, but obviously there's another dimension to it that makes it more complicated. And I'm curious if you can talk again to the ways that these concepts of unit testing in the data space compare to the purpose of data contracts and maybe some of the ways that teams should be thinking about the appropriate ratios of data unit tests, data contracts, and the role of observability as a perspective on top of those.
[00:07:18] Tom Baeyens:
Yeah. And so, yeah, in terms of the actual, like, link between like, there's the various forms of testing that you have in software. Right? And and I'm not sure to what extent, like, there's, like, absolute consensus in the engineering world, when you need, like, and how much integration testing versus the unit testing and all that. But the principle itself is generally, accepted. Like, if you don't do automated testing, like, on a broad scale, both, like, on an integration level as on a unit scale, then you will end up in a situation that you don't trust your new release. And I think that definitely transports to data.
Where if you don't test anything, you will lose, faith in your data. And you will have a problem, like, in a boardroom. You see the graph going down and you wonder, like, is this bad data or is this actually our business going down? So if you don't trust that data, then it's actually useless and your whole investment, goes down. Now, like, one level deeper, I'm not sure if all the analogies work there, But definitely this is the first one which which really, like, at the large scale I don't think that the principle of that testing is as adopted yet as it should. So I think for the majority of situations, there is the whole aspect of making creating awareness around, like, what is a data component in a pipeline? Currently, pipelines are start to end. It's like, what is a component and what is the the handover point? And, like, making sure, like, where do you apply the test? I think that's where we should start, first before digging any deeper in terms of, like, the yeah. Where do you need to how do you need to call those types of tests?
[00:09:07] Tobias Macey:
The other interesting wrinkle that comes into play when you're thinking about testing for your data is that when you're dealing with an application, you have the idea of I need to run it through the test suite through the CI/CD process before it goes into production. Once it passes all of the tests, I have pretty good confidence that everything is going to work when I deploy. With data, you have some measure of confidence that you can test the business logic around your transformations, your extracted load, etcetera. But the problem that always comes up with data and when you're talking to people about testing for data is that data changes, and you don't necessarily have full control or even visibility into when or how that data is going to change. And so you can't just say, okay. I'm gonna run it through my set of tests. I'm gonna put it into production, and everything's great. And I'm wondering how that also factors into the way that you think about the testing and validation and what an integration and and an end to end test means in the context of a data pipeline or a data flow.
[00:10:09] Tom Baeyens:
Yeah. That, that's actually a great point because I see many people struggling with that with that notion. I think there's 2 key, events that you need to separate. 1 is when code changes, there is a potential of the software breaking. So you have pipeline code, you change that, that software might break. And it ends up in bad data applications or in bad data. So this is the CI/CD pipeline of your data pipeline, basically. And so when you change your transformation logic or ingestion logic, it makes perfect sense to test that, to have sample data that you test and run that on. But the tricky part, which took me a while to figure it out, is that data, like, as it is in production, you have a imagine a a daily batch job. Your airflow runs on a daily schedule.
The data passes through. Like, there's no code change every day. So that means that and the data might break at some point without the code being changed. So every batch of data, you could consider that as a new release of data, which has to be tested just the same. And I think that that's kind of the analogy there. It's like, are you changing in CI/CD, you're gonna test your code changes. In the production pipeline, you're gonna test your data changes because that's also a new release of data.
[00:11:36] Tobias Macey:
Circling back around to data contracts, what are some of the types of guarantees and requirements that you can enforce using that mechanism? And what are some of the examples of things that you can't logically represent in the construct of a data contract?
[00:11:53] Tom Baeyens:
Yep. So, yeah, in terms of what can you, enforce, the the API like, the contract is the API for data. Let's let's look at that first before we dive into what can you check. So API for data means, like, I touched a bit earlier on how, like, a long data pipeline might exist of several components, which are currently a bit blurry. Right? So I don't think a lot of teams actually have great awareness around, like, where does one component stop and the next one start. And I think that to me is one of the biggest kind of advances that data contract bring into the the data space. Because it it kind of demarcates the the kind of componentization of your of your pipelines.
And it's like all those datasets that are the handover between one component to another or between one team and another. Those are justified to to represent an API because the previous component or team delivers a table with new data in it, and the next team is gonna use that or the next component. Like the dbt transformation might have as input, a certain dataset, and that has certain like, they have certain assumptions on that. Like, what's the schema look like? What are the uniqueness properties if if joins are being done and so on? So so when you're using that, the principle of encapsulation in software, I think is crucial and is missing in the data kind of awareness, space. Because the the encapsulation means that from that previous component, you don't need to know all its internals. The only thing you really need to know is that table. What can I rely on on that table in terms of schema and all the other properties that you use? And this is what you describe in a contract. So the dataset is Tableau data structure, typically typically a Tableau review.
And then you have the schema that goes with it and then all of the other data quality properties. And that's what you document in, a contract, which is very similar to an open API or a GraphQL kind of description on services. But in this case, the context is that it's a table that you're gonna consume over a SQL connection or, like, firing SQL to it. So that in itself, like, that encapsulation is is key. And now second part of that question is, like, what are the guarantees that you want to enforce as you're like processing a new batch of data? That is, typically things like, as I mentioned, the schema is something that you for sure wanna test because as you're gonna use that data, you're gonna use the columns most likely. And then the naming and the data types need to match. But also missing values, validity, uniqueness, referential constraints, those, like, everything that you can touch test on that new batch of data is what you want to to do as part of your, contract enforcement.
[00:15:07] Tobias Macey:
At OutShift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and start up agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model agnostic platform for building safe, trustworthy, and cost effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency.
Go to motific dotai today to learn more. That's motifi c.ai. In that flow of, I have a new batch of data, I'm applying all of these tests to it. One of the challenges that comes in there is do you apply those tests before you actually process all of the data? Do you do it after you process all of the data? How do you make sure that after you've already run through the processing, you wanna make sure that your transformations are correct. We've already landed the data in the destination, but it fails the tests, and so you wanna prevent it from actually being used either in a business intelligence report or by downstream use cases. I'm curious if you can talk to some of the ways that you think about the kind of the the the pretest and post test and how to control the propagation of data once it fails a certain batch of tests.
[00:16:43] Tom Baeyens:
Yeah. That that makes a lot of sense too. In terms of when to test what, like, there's a trade off here because, usually, your pipeline, if it's not built with this from the start in mind, then you just append new data to a certain incremental dataset. And then, which actually, like, as you're adding this, and this is where the notion of releasing new data comes into play because you're adding it to the incremental table, which actually means you've published it. You cannot retract it. Like, the consumer might have just run a query and already consumed this new information. So if you're gonna test it after adding it to the incremental table, then there is a potential that you've released it without testing.
So that's the risk. But this actually usually is easy to do because you can easily apply a filter in a contract, for instance, or in your data testing. And then you don't need to change your, pipeline code. So this is how we say, like, you can start with contracts by layering it on top of your existing architecture, but you won't get your notifications to to do, like, your proper circuit breaking. If you really want proper circuit breaking, then you need to do your kind of C/ICD approach, which is you're gonna land your new data in a separate table, for instance, and run your contract checks on there. And then only when that succeeds, then you're gonna append it to the incremental dataset. Requires a bit of work. If you do this from the start, it's actually quite easy. If you do it retrofitting, then it's gonna take some work. And then usually, it's okay to just, test test it and, like, signal the problem if it just lands on the incremental dataset.
[00:18:31] Tobias Macey:
And, fortunately, the capabilities of the underlying storage and query engines are evolving to a point where it improves the ability to be able to make these changes and test them before you publish. I'm thinking of things like the Nessie project for iceberg tables, lake f s for general data lake approaches, the 0 copy tables for snowflake where you can make make a copy of the table, make those changes, test it, and then publish them back. So the the it's becoming possible and easier, but it also depends on what you're actually using as your underlying substrate.
[00:19:07] Tom Baeyens:
Definitely. Yeah. I agree there.
[00:19:09] Tobias Macey:
As far as the implementation of these data contracts and the ways to think about how the contracts are defined and who defines them in terms of unit testing, that was associated with the overall DevOps trend of shift left where you wanna move everything as early in the process as possible. And in data, a lot of that shift left means that you have to bring in application teams so that they can notify you when they're making changes to the underlying sources that you're pulling data from. I'm curious how you think about the responsibility and application of data contracts and how that fits into the technical and organizational structure of a business.
[00:19:50] Tom Baeyens:
Okay. Yep. Cool. Let let me try and start all the way from the consumers and then work our way backwards, towards the source data. Because this is this is, I think, showing the story shows the power of contracts. Where imagine you're building a simple report. You're using some data like 3 data tables. And, like, if you want that data to be tested, you want to create your contract on there. As a as a consumer, you can start by saying these are my tables. This is the contract that I want to see, verified before, and I want to be notified if this doesn't work. But as a consumer, it's pretty unpractical to manage those contracts because you don't have the ability to change. You don't have the the power to change the the pipeline that produce the data. Those are the engineering teams that actually produce the the data producers.
So actually, a contract just like an open API description is not something that the clients want to maintain. No. It's the team producing the data. So you always want to hand over the contract to the team producing the data. It's an integral part of the software producing it, and it is the the description of the interface. So in the end, the consumers, they want these contracts. They want to know about the contracts because they describe, like, how can I use it? It's all the metadata describing the data that you can use in your data products. But then you want those data producers to take ownership. Right? Now in in the the transformations that have led up to this refined data that goes into the reports, they have probably used input data either from the extraction or from a previous transformation or whatever. So the producers are often reluctant to to take on this ownership, to take on the guarantees, and to provide the guarantees in the contract because they actually rely on input data which which they might not fully trust. And so that puts pressure on their input data to have contracts on there.
And so if they know, like, oh, I have on the last transformation, I have my contracts on all of my inputs, then I actually can guarantee the outputs of the refined data. And so that that kind of mechanism goes all the way up to the source data. So which which kind of brings like a new level, a coarser grain level onto your data infrastructure. It's not that you should be looking at all the tables in your data warehouse. No. You should be looking at all the handover tables between those components. So if you have a component that produces something, you want to see your inputs also protected by a contract. And this goes all the way to the source system like the production data. This is where something, tricky is happening a lot, of course, because initially, we need people we need the data from the production systems. Right? And then there is, like, a rest API around that production system which the team is fully aware of and they provide consistency and that's managed as a product.
But then in order to export the data to the analytics team, they just break into the backdoor, and they just steal their database table data, which is never intended as an API, and they didn't know. But, actually, this is a hard conversation that needs to be had because the team actually needs it. And this is where your contract where where it'll say the first user of that data could say, like, I'm giving I'm making this initial contract, but have that conversation with the production team saying, like, oh, can you take ownership? We actually use this as a product. And we are better off knowing that you cannot give any guarantees. But that you say, like, put some integration tests so that we know when it's breaking rather than just ignoring the problem. I think that that's kind of where contracts come into play into that into that ecosystem.
So it's the handovers and it's pushing all the way up to the source, where some hard conversations sometimes need to be held.
[00:24:00] Tobias Macey:
In that overall flow of information starting from that consumer of, I wanna make sure that these constraints are always true. I wanna know if they're not true, pushing that down into the consumers and producers and pipelines and applications that have relationships with that data. Obviously, those contracts are integrated as part of that pipeline that consumes and transforms and produces that data for those consumers. But I'm wondering if you can talk to some of the ways that these contracts have a ripple effect across the overall organization and their approach to data and some of the ways that you surface the information about those contracts, particularly when they are failing so that somebody who is relying on that set of data can know, oh, hey. I'm looking at this dashboard, but I can't actually trust it because I see that this check failed. How do they even get to that information, and how do you try to surface that in a way that helps to build trust rather than detracting from trust of saying, oh, well, the data's always broken. I can't ever trust it.
[00:25:09] Tom Baeyens:
Yeah. Yeah. That that makes a lot of sense. And I think, like, the the the key thing here is indeed I think we need to get out of the situation where data breaks so often. So it's, like, the pyromaniac, being out there. Like, if you start doing contracts, I think overall, you'll see that you'll get less issues and you get less of this problem. But but, of course, we're very early in the in the journey. That's that's one thing. But how do they surface? Is that you like, the first thing you have to realize what I I want to reiterate what I just said, which is nowadays, people just look at the whole warehouse and see, like, a gazillion datasets, like tables or views, where data is stored. And then the actually, a value that contracts bring is that you identify the datasets of which someone takes ownership.
And then immediately, if you have those datasets, because the datasets with a contract are the ones that someone says like I stand by this dataset, which then pushes down all the other datasets which are not as relevant. And in this sense, this is already a key thing, a key property when pushing information to a catalog or a data discovery tool. Like, the data discovery tools is where the datasets are being found by the consumers. So and that would be key as you're adopting or gradually are adopting contracts. They're gonna see, oh, this is a a dataset governed by a a contract. This is a more interesting one, for me rather than some dataset that might or might not be good. That's the the one thing. And then, of course, in that same data discovery tool is where you want to have maybe more granular, information as to, like, how often it breaks and, like, which checks, did break, that that's where this this pops up. But I think, like, the key is not necessarily in figuring out and having the the the consumers to learn about, like, the actual data because that's the debugging process. That's where we have, like, more in-depth tools to to find the root cause.
But those should just be, like, is it covered by a contract? Okay. Then I can already rely on it a lot more. And who's the owner? Who can I talk to? I think those are the key questions that you'll want to find in your, data discovery tool.
[00:27:29] Tobias Macey:
And so now bringing this from the abstract of data contracts, what they are, how you use them to the concrete example of what you're building at Soda in this data contracts tool, I'm wondering if you can talk through some of the ways that you thought through the design process and the syntax and implementation for how to actually bring these data contracts into the pipeline, into the data ecosystem, and how to ensure that they can be written and understood and maintained and not have it just become another pile of spaghetti code that nobody can understand and everybody has to debug.
[00:28:08] Tom Baeyens:
Exactly. Yep. No. I can definitely, talk a bit about how we got there. Because early on, even before data contracts came into play, we, created SodaCL as a declarative YAML language for expression expressing data quality checks. And that gave us the validation of the declarative approach specifying these quality, checks in YAML. That really, resonated. But it was created from the perspective of the consumer. It was like a consumer, that's where the problems show up, and we started from there expressing checks over multiple datasets. And so while you can actually build a contract strategy on top of this declarative language if you if you really have this background of how the organization should work in terms of producers, what the ownership means, and then the consumers, and then how to put these, checks in between. But we realized that there was an opportunity to align much better with the data producer world. And the data producer and the fact that they, actually produce a set of output ports in the data mesh terminology or, like, output datasets of your software component, that's where we see there was an opportunity to align the language a lot better with that. And that's that's actually what we did. So we can leverage from our perspective the query engine that does that runs the the evaluation of all the checks, that's solution that was already, built and and stabilized for a long time. And the only thing we had to do was tune the language towards this new use case of, running all the checks for a single dataset.
[00:29:55] Tobias Macey:
Another interesting aspect of this space is that the approach of testing for data, building guarantees around data has been around for a while. Different tools have implemented it in different ways. There's also the space of metrics definitions to say, okay. I have worked through this data, and these are the types of things that you can expect from it. These are the semantics around it. So dbt has their metrics and unit tests. There's the great expectations tool that is built around making sure that data matches your expectations around what you want it to be. There's the tool that you're building of data contracts in the Soda open source ecosystem.
I'm wondering if you can talk to some of the ways that you think about the areas of overlap of what you're building with some of those other tools, and in particular, some of the either emerging or nascent standards as to how to think about the definition and maintenance of these guarantees for data?
[00:30:58] Tom Baeyens:
Right. Yeah. 2 very different questions, but I'll tackle them 1 by 1. First of all, like like, what's the overlap with other tools? Right? And so from a Soda perspective, we definitely see our, product as a component in a central data stack. So there's, what we see is, like, we don't have, like, a single technology that works in our environment. We want to work across a variety of environments. And that is usually, like, we we touched on earlier, an existing kind of architecture, different kind of orchestration tools that might be DBT, but all the transformations, as well. And there might be data floating around without outside of dbt.
And so this is this is what we kind of have the combination. We we apply the principles of data contracts. And then as we saw before, it's like it's mostly like an organizational thing. It's making sure that you support the right workflows. And we do this as a combination of observability and data testing, and we deliver that as a package so that it makes sense to to install this on your central data infrastructure. So that that the whole team can like as a as a central data team, you can say this is the tool we're working with for data testing. And then all the different teams have the guidance in their particular environment. They can actually have the guarantees. Like, there is guidance on how to apply it, and it also works in those different environments. So that's that's how it, our our perspective, is a bit different. And then again, like the second part is like, yes, we also think that is a very new space.
The standards are popping up, left and right. We think that is really important. And it feels also that this is a place where standard could really help because there's lots of tools integrating with it. We didn't touch on that yet. But, like, unit testing is one aspect of contracts. It's like pushing metadata to data discovery tools is another use case for another tool. There's retention. There's access control. There are all different aspects that you can model easily in a in a contract. So it makes perfect sense to consider a standardization effort. There are multiple, competing ones at the at the moment. There's the ODCs, the open data contract standard, we know them as BITOL.
There's other BS like, open data product specification, and there's probably a few others as well. So we keep a very close eye on those and help wherever we can to push those, forward because we know that for our, for our customers, this is this is crucial. Like and this is the value that that standards can bring. The the whole data landscape is super fragmented yet. We'll we'll see probably some consolidation going forward, which would be really good to have. But as long as we don't have that, all these tools need to interoperate.
And I think contracts are gonna play, a major role in that. So wherever we can help, we're, we're there.
[00:34:03] Tobias Macey:
To that point of integration and access control in particular, I think that that's a very interesting application of these contracts to say, I guarantee that this set of data is only accessible by people who have these role, but there isn't really any cohesive standard around how to actually apply access controls across different data tools. That's one of the problems that I run into constantly at my work. And I'm wondering if you can talk to some of the types of integrations that you're thinking about building or have already built for these data contract specifications, some of the areas in the data ecosystem that, I guess, are in good shape for being able to push these types of guarantees down into other layers and some of the areas that you're seeing gaps as far as how to actually approach that integration and enforcement of the guarantees that you want to specify.
[00:34:57] Tom Baeyens:
Okay. Let let me try them run them 1 by 1. I think if you start with a with a contract and what you can do with it, there's multiple use cases for it. That's really important, to distinct here. One use case is being the central source of metadata for your data. So the the system of record, better to speak, in terms of, like the system of record for your metadata, which means you have your YAML file, you describe your schema and your types and all that. And all the information for consuming it is in that file. It's managed by the producers because they control the actual data production. They should also control that description. And then pushing that to the data discovery, as we said, like, that's one use case.
That's one tool actually consuming that file, that YAML file, and extracting a portion of that information and displaying it to the consumers. The second one is the one that we focus on is the unit testing. Like, as you are, specifying the schema and also your data quality properties, we can extract those and run checks to see if that really matches, and that part is covered with some checks. So unit testing is a second use case. And there might be others. So the next one is where we see, like, the the the companies, the customers themselves build their custom workflows, and they're using a couple of properties left and right that they add, could be around ownership, PII information, and what it means for them or where they want to enforce this and use this. So they built the the tooling and the the software logic in their workflows that leverage this information.
And so there's more, like, retention and access management you could specify. And and in that sense, contracts become kind of the configuration files for the tools in your data stack. That's also another way of looking at it. And I think the challenge that we're gonna face going forward is that as more tools adopt the the the contract, which is already like a huge benefit versus like having all this logic spread around over all tools. If you see that centrally managed in a single file, I think that's huge benefit. But as an engineer, if you're now gonna, like, after a couple of years change this file, how are you gonna know that this property, like, which tools is this property gonna impact? If you change the uniqueness, from false to true or you specify from nothing, like you say, this is a unique column, does that imply that you're gonna run a unit test on it and that you're gonna check uniqueness on the full? Like, you might do that without being aware. You just say, like, I wanna I wanna use, like, the the data catalog use case or the data discovery. I wanna say to my users it's unique and not realizing that you're gonna test for it. So that's what what we take into the design of the language to make sure that you're having a clear distinction between what you're gonna test and what you're going to going to push. So so that's that's what I see as a challenge. As more tools get configured in this contract, which is valuable, how to make sure that the engineers still, like, stay in charge.
[00:38:02] Tobias Macey:
In terms of the workflow of bringing these data contracts into your ecosystem. I'm wondering if you can talk to some of the types of questions that you're seeing engineering teams come up with, some of the ways that they're thinking about how and where to apply these data contracts and some of the some of the ways that they've been able to benefit from paying down complexity, but also some of the ways that they're maybe running into issues of they want the data contract to do something and it's not flexible enough or they just don't maybe understand the limitations of what types of things they can guarantee?
[00:38:41] Tom Baeyens:
Yeah. More like on the other side, it was like I saw some some very interesting, use cases of data contracts, which I didn't expect initially, which kind of broadened my mind, a little bit. Because we're saying, like, we started off with API for data. Right? Now we have this metadata, data contract describing the schema, and all of that. Now someone asked for a little tool that I didn't expect, and that was to generate the the DDL statement, like, the the create table statement from the contract. And while this was just like a a normal, like, feature request, it started to to trickle down. What does this mean? And it actually could mean that the data contract could become the control plane for your warehouse, where you're not, like, starting from the DDL and then working your way to the contract, but the other way around.
Like, the engineer starts from building the the contract first, the and starting to to build that metadata, and then just doing an apply of this where the tool then calculates the difference and just runs it. So so I think that was a powerful kind of metaphor or an insight that says, like, maybe we're going in that direction because we're fixing actually a limitation of the warehouses, of the storage layer. We're fixing the fact that storage layers have very limited metadata. They only have column name and data type, and that's it. Whereas much of the workflows in a data, environment, they are they require a lot more metadata.
Just like document, systems in the past, it's not about the documents. It's like you can only do proper workflows if you associate metadata with your documents. Here, just the same. You can only do interesting workflows and automate things in your organization if you have your metadata together with your datasets. So that was a that was an interesting, way of how we saw this being applied.
[00:40:42] Tobias Macey:
And in your work of building these data contract interfaces, thinking through how they fit into an organization's data ecosystem, building the tooling around it? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:40:59] Tom Baeyens:
Yeah. The one of the more interesting things we came across was applying the generative AI to this. We we totally didn't expect or at least I didn't. I was initially a little bit skeptic, and then someone said, just just give it a try. We opened up a prompt, and we learned the prompt a little bit about, what soda was. And then, I thought, like, oh, can I use it as a contract generator? So I set in the prompt, I added, like, the create table statement because we have it. You can extract it from the warehouse. And I added a sample data by just adding into the prompt a couple of insert statements without really explaining.
And I asked, the prompt, like, can you generate a contract for me? And it was, pretty impressive. It came, like, back with a full working contract. There was a student, dataset. There was a GPA in it column. Nothing more. It figured out that it could create, like, a min and max check on the GPA, whereas all the values were between 3.23.7. It created the check's minimum value 0 max 4, which kind of, like, it deducted from the GPA, the acronym, and, like, from its network, it says, like, okay. This must be a GPA score. And then, apply the data quality check. So that was really impressive. Then we like, I thought, oh, this looks good. Let me try to run this.
It was actually a problem because as you are, the like, as you're doing a create table, it was in Postgres. And as you're doing create table, it took, like, a VARCHAR kind of description. So the machine, learning or the assistant just took that one as the the data type in the contract. But if you ask the metadata of Postgres, what's the type? You get, like, character for Ryan. This was the only error. So I tried to run it. There's, like, a whole bunch of logs coming out, and somewhere it says, like, your check your contract check failed because of the data type doesn't match. And I thought, like, why not try it? I just copied and paste the whole logs into the prompt saying, like, so it doesn't seem to run. So can you fix it? And the assistant came back with, like, a fixed contract that actually works. And that had a good balance of which checks to apply and which not. So I was I was totally impressed with that, and that made me look ahead in terms of, wow, this is really how generative AI can change the interface of a system, where you you used to do it in terms of, like, an IDE, like YAML editing or or XML editing back in the days. Now you're gonna have a conversation. Like, can you update my contract?
This is what I want. Can you fix it? So that that was, yeah, impressive to see. So we just added that quickly to the product.
[00:43:50] Tobias Macey:
For people who are interested in the capabilities that we've been discussing, are looking to improve their overall reliability of their data platform, what are the contexts in which a data contract is the wrong choice? Yep.
[00:44:07] Tom Baeyens:
So that's a that's a hard one actually. So, like, if I would guess, like, if you anticipate that your analytical data, like, does not change, remains stable, nothing changes, then you might not be needing it. It's like and maybe it's the same as in in software engineering. Like, when do you not need your unit testing? Like, I guess, it's also pretty hard. So it's kind of like as you're now building a new software project, like, it might be that you say I have a script here that is for my personal use and, like, I run it 3 3 times a year. I don't know if maybe not even anymore after today. Then I don't think it's justified to run unit tests on it. Same with your data. If you do a one off, don't bother. But if it's part of your data infrastructure and you're building, like, enterprise workflows on top of it, like, I don't think you should be avoiding this, at all. It's gonna be hard. And so I and I agree it's early. It's like it's not that this is a common practice, but I definitely feel that it's coming. And, and I expect, like, within 5 years that no one's gonna start another data pipeline project without thinking about the contract first.
[00:45:17] Tobias Macey:
As you continue to work on your tooling, keep an eye on the overall ecosystem of data and how people are thinking about building guarantees around their pipelines and their analytical capacity. What are some of the things you have planned for the near to medium term or any predictions that you have going forward about how the space might evolve?
[00:45:38] Tom Baeyens:
Yeah. So in Soda, we're, like, in early access right now, and we plan to to bring this to to GA later this year. We can do this quite fast because we like the the the solid foundations of the engine that we have. But in general, I think the more interesting part is is to look ahead in how the the the uptake of contracts. And mostly like the organizational aspects and the the the the assumptions that we have in software, how do they apply in data? And can we get the same kind of vibe there that we're starting to think in data pipeline components with interfaces between them? Where the producer teams take ownership because they're currently often missing in action.
And, like, that principle, that for me is what I think is gonna be, like, very interesting to watch as that play out. Is it gonna be, like in software, a very lengthy process? There, it took, like, maybe 10 years before unit testing was quite adopted, but now we have that example. So it's it's much easier to explain it now in terms of software prints terms, and that you need it is is also quite easy to see. So so I think it's gonna be a lot faster. But is it 2 years? Is it gonna be 7 or something in between? Yeah. That that's gonna be interesting to play out in in in my perspective looking at.
[00:46:55] Tobias Macey:
Are there any other aspects of this space of data contracts, either conceptually or in terms of your implementation at Soda that we didn't discuss yet that you'd like to cover before we close out the show?
[00:47:09] Tom Baeyens:
No. I guess, like, the the the biggest thing that I would like to see happening is a more integrated data platform. It's like there's currently gazillion tools that all do a small part and you feel somehow that would be much easier if they were better integrated. I think that's gonna happen at some point. So the the the bigger players are thinking about this, of course. And so I think that that to me is what I see also as very interesting times ahead. Like, how is this all gonna be consolidated so that you get, like, a complete data infrastructure platform as as a single product rather than having to stitch together all the different tools as as is, unfortunately, the the times we are in right now.
[00:47:59] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And I just want to thank you again for taking the time today to join me and share the work that you and your team are doing on data contracts and share your perspective on the role that they play in an organization's data ecosystem and the applications that they have to help to build greater confidence and reusability of data. So thank you again for that, and I hope you enjoy the rest of your day. Thank you, Tobias.
[00:48:29] Tom Baeyens:
It was a super pleasure to be here, and thank you for helping the data ecosystem to, to share all this knowledge. That's very much appreciated.
[00:48:46] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end to end data lake has platform built on Trino, the query engine Apache Iceberg was designed for. Starburst has complete support for all table formats, including Apache Iceberg, Hive, and Delta Lake. And Starburst is trusted by teams of all sizes, including Comcast and DoorDash. Want to see Starburst in action? Go to data engineering podcast.com/starburst today and get $500 in credits to try Starburst Galaxy, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'd like to welcome back Tom Byans about using data contracts to build a clearer API for your data. So, Tom, can you start by introducing yourself for anybody who hasn't heard your past appearances?
[00:01:04] Tom Baeyens:
Sure. Yeah. I'm, Tom, CTO, cofounder of Soda. Started off in the software engineering space, building workflow engines in open source at JBoss and Red Hat, creating open source brand names like, JBPM and Activity, then moved into Data, cofounded, Soda together with Martin, because we saw that, like, data quality was becoming a massive problem. And, what we are doing now in terms of data is for, like, has some similarity with what I did in the past in the sense that it's, open source declarative languages and engines that we built.
[00:01:41] Tobias Macey:
And do you remember how you got started working in the data space and why it is that you've decided to spend so much of your time and energy focused on it?
[00:01:50] Tom Baeyens:
Yeah. Sure. Sure. It was, like, just the excitement of what was going on back then in data. So I was working in in process management workflows for a long time, and that was a promising, environment, for a very long time. But I saw that in data, things were moving fast and happening. And I have some, ideas about it and there were some similar similarities, but that was really an exciting time. I joined it. And indeed now, the background in software really helps me to to find out, like, what's going on in data and how we can improve that, that landscape.
[00:02:28] Tobias Macey:
Now in terms of the topic today for data contracts, I'm wondering if you can give some sense about the scope and purpose that data contracts have in the context of this conversation and in the data space in general.
[00:02:44] Tom Baeyens:
Sure. The purpose of data contracts is actually to achieve more reliable analytical data. And analytical data on itself has been notoriously have a, it has a bad reputation because it breaks regularly. And on the other hand, recently, the potential usage for data and application is growing fast. So there was reporting, of course. There's now recommendation engines, pricing algorithms even. So imagine, a hotel website which puts pricing on there and there's, like, faulty data feeding into the pricing algorithm. The software will keep work working, but the revenue will go down if your prices are bad. So, so these data algorithms, they only can work properly if the data is reliable.
And that's, that's that's the problem we're tackling with contracts as well where it plays, an important part. Yeah. And analytical data pipelines in itself is a huge integration problem. The data pipelines themselves are very brittle. And data contracts is in fact like a new approach for data testing that goes broader than the technology itself. Sure. We'll touch on that. Like, it applies the same principles as unit testing and software engineering. So in software, when code changes, you'll need to rerun your full test suite to gain trust in your changed software. Similar with data, each time a new data is produced, you'll need to test it to keep the trust.
[00:04:15] Tobias Macey:
From that perspective of trust and ensuring the correctness and quality of data, there has been several years worth of momentum building up around the idea of data observability and data quality monitoring. And I'm curious if you can give some sense about the ways that those concepts overlap with or maybe even contend with the idea of data contracts.
[00:04:39] Tom Baeyens:
Yeah. Sure. And that's a great question because observability has been getting a lot of attention in the last few years. It's similar to, Datadog and New Relic in monitoring your applications. Right? And this is all to create visibility into your data warehouse. It simplifies the diagnostic process. So when there's a potential data issue that needs to be investigated, observability helps you to to diagnose. And this is the the reactive part of it. It's like after the fact that you're going to look and create visibility helping to to find the the problems. But while this this actually helps and is an important ingredient, it's only, like, putting out the fire while the pyromaniac is still out there. And so data testing on itself is the goal of stopping the pyromaniac.
And this is where I feel that there's, like, way too much focus on the observability alone and, like, data testing has not gotten the proper attention yet. And I think that will surely come, soon or is coming actually. So data testing is all about, like, making sure that the data is as expected across the various handover points in your data pipeline. And this is really the preventing part. So just like in software, it's not that observability is better than testing. You'll need both like observability and testing. So that's, yeah. That I think is the key to towards comparing the 2. So in data contracts now becomes like an important part or approach, which goes further than just the technology in the data testing part.
[00:06:22] Tobias Macey:
On that note of testing in the software application space, there has been a long history and there are still points of contention, but it's generally agreed upon that unit tests are a good thing. And there have been there are general patterns that have built up around how to do unit testing, how to do integration testing, testing, what it means to do end to end testing, and the ratios of those different types. In data, there's been a lot of conversation recently about that idea of bringing unit tests to data, but obviously there's another dimension to it that makes it more complicated. And I'm curious if you can talk again to the ways that these concepts of unit testing in the data space compare to the purpose of data contracts and maybe some of the ways that teams should be thinking about the appropriate ratios of data unit tests, data contracts, and the role of observability as a perspective on top of those.
[00:07:18] Tom Baeyens:
Yeah. And so, yeah, in terms of the actual, like, link between like, there's the various forms of testing that you have in software. Right? And and I'm not sure to what extent, like, there's, like, absolute consensus in the engineering world, when you need, like, and how much integration testing versus the unit testing and all that. But the principle itself is generally, accepted. Like, if you don't do automated testing, like, on a broad scale, both, like, on an integration level as on a unit scale, then you will end up in a situation that you don't trust your new release. And I think that definitely transports to data.
Where if you don't test anything, you will lose, faith in your data. And you will have a problem, like, in a boardroom. You see the graph going down and you wonder, like, is this bad data or is this actually our business going down? So if you don't trust that data, then it's actually useless and your whole investment, goes down. Now, like, one level deeper, I'm not sure if all the analogies work there, But definitely this is the first one which which really, like, at the large scale I don't think that the principle of that testing is as adopted yet as it should. So I think for the majority of situations, there is the whole aspect of making creating awareness around, like, what is a data component in a pipeline? Currently, pipelines are start to end. It's like, what is a component and what is the the handover point? And, like, making sure, like, where do you apply the test? I think that's where we should start, first before digging any deeper in terms of, like, the yeah. Where do you need to how do you need to call those types of tests?
[00:09:07] Tobias Macey:
The other interesting wrinkle that comes into play when you're thinking about testing for your data is that when you're dealing with an application, you have the idea of I need to run it through the test suite through the CI/CD process before it goes into production. Once it passes all of the tests, I have pretty good confidence that everything is going to work when I deploy. With data, you have some measure of confidence that you can test the business logic around your transformations, your extracted load, etcetera. But the problem that always comes up with data and when you're talking to people about testing for data is that data changes, and you don't necessarily have full control or even visibility into when or how that data is going to change. And so you can't just say, okay. I'm gonna run it through my set of tests. I'm gonna put it into production, and everything's great. And I'm wondering how that also factors into the way that you think about the testing and validation and what an integration and and an end to end test means in the context of a data pipeline or a data flow.
[00:10:09] Tom Baeyens:
Yeah. That, that's actually a great point because I see many people struggling with that with that notion. I think there's 2 key, events that you need to separate. 1 is when code changes, there is a potential of the software breaking. So you have pipeline code, you change that, that software might break. And it ends up in bad data applications or in bad data. So this is the CI/CD pipeline of your data pipeline, basically. And so when you change your transformation logic or ingestion logic, it makes perfect sense to test that, to have sample data that you test and run that on. But the tricky part, which took me a while to figure it out, is that data, like, as it is in production, you have a imagine a a daily batch job. Your airflow runs on a daily schedule.
The data passes through. Like, there's no code change every day. So that means that and the data might break at some point without the code being changed. So every batch of data, you could consider that as a new release of data, which has to be tested just the same. And I think that that's kind of the analogy there. It's like, are you changing in CI/CD, you're gonna test your code changes. In the production pipeline, you're gonna test your data changes because that's also a new release of data.
[00:11:36] Tobias Macey:
Circling back around to data contracts, what are some of the types of guarantees and requirements that you can enforce using that mechanism? And what are some of the examples of things that you can't logically represent in the construct of a data contract?
[00:11:53] Tom Baeyens:
Yep. So, yeah, in terms of what can you, enforce, the the API like, the contract is the API for data. Let's let's look at that first before we dive into what can you check. So API for data means, like, I touched a bit earlier on how, like, a long data pipeline might exist of several components, which are currently a bit blurry. Right? So I don't think a lot of teams actually have great awareness around, like, where does one component stop and the next one start. And I think that to me is one of the biggest kind of advances that data contract bring into the the data space. Because it it kind of demarcates the the kind of componentization of your of your pipelines.
And it's like all those datasets that are the handover between one component to another or between one team and another. Those are justified to to represent an API because the previous component or team delivers a table with new data in it, and the next team is gonna use that or the next component. Like the dbt transformation might have as input, a certain dataset, and that has certain like, they have certain assumptions on that. Like, what's the schema look like? What are the uniqueness properties if if joins are being done and so on? So so when you're using that, the principle of encapsulation in software, I think is crucial and is missing in the data kind of awareness, space. Because the the encapsulation means that from that previous component, you don't need to know all its internals. The only thing you really need to know is that table. What can I rely on on that table in terms of schema and all the other properties that you use? And this is what you describe in a contract. So the dataset is Tableau data structure, typically typically a Tableau review.
And then you have the schema that goes with it and then all of the other data quality properties. And that's what you document in, a contract, which is very similar to an open API or a GraphQL kind of description on services. But in this case, the context is that it's a table that you're gonna consume over a SQL connection or, like, firing SQL to it. So that in itself, like, that encapsulation is is key. And now second part of that question is, like, what are the guarantees that you want to enforce as you're like processing a new batch of data? That is, typically things like, as I mentioned, the schema is something that you for sure wanna test because as you're gonna use that data, you're gonna use the columns most likely. And then the naming and the data types need to match. But also missing values, validity, uniqueness, referential constraints, those, like, everything that you can touch test on that new batch of data is what you want to to do as part of your, contract enforcement.
[00:15:07] Tobias Macey:
At OutShift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and start up agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model agnostic platform for building safe, trustworthy, and cost effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency.
Go to motific dotai today to learn more. That's motifi c.ai. In that flow of, I have a new batch of data, I'm applying all of these tests to it. One of the challenges that comes in there is do you apply those tests before you actually process all of the data? Do you do it after you process all of the data? How do you make sure that after you've already run through the processing, you wanna make sure that your transformations are correct. We've already landed the data in the destination, but it fails the tests, and so you wanna prevent it from actually being used either in a business intelligence report or by downstream use cases. I'm curious if you can talk to some of the ways that you think about the kind of the the the pretest and post test and how to control the propagation of data once it fails a certain batch of tests.
[00:16:43] Tom Baeyens:
Yeah. That that makes a lot of sense too. In terms of when to test what, like, there's a trade off here because, usually, your pipeline, if it's not built with this from the start in mind, then you just append new data to a certain incremental dataset. And then, which actually, like, as you're adding this, and this is where the notion of releasing new data comes into play because you're adding it to the incremental table, which actually means you've published it. You cannot retract it. Like, the consumer might have just run a query and already consumed this new information. So if you're gonna test it after adding it to the incremental table, then there is a potential that you've released it without testing.
So that's the risk. But this actually usually is easy to do because you can easily apply a filter in a contract, for instance, or in your data testing. And then you don't need to change your, pipeline code. So this is how we say, like, you can start with contracts by layering it on top of your existing architecture, but you won't get your notifications to to do, like, your proper circuit breaking. If you really want proper circuit breaking, then you need to do your kind of C/ICD approach, which is you're gonna land your new data in a separate table, for instance, and run your contract checks on there. And then only when that succeeds, then you're gonna append it to the incremental dataset. Requires a bit of work. If you do this from the start, it's actually quite easy. If you do it retrofitting, then it's gonna take some work. And then usually, it's okay to just, test test it and, like, signal the problem if it just lands on the incremental dataset.
[00:18:31] Tobias Macey:
And, fortunately, the capabilities of the underlying storage and query engines are evolving to a point where it improves the ability to be able to make these changes and test them before you publish. I'm thinking of things like the Nessie project for iceberg tables, lake f s for general data lake approaches, the 0 copy tables for snowflake where you can make make a copy of the table, make those changes, test it, and then publish them back. So the the it's becoming possible and easier, but it also depends on what you're actually using as your underlying substrate.
[00:19:07] Tom Baeyens:
Definitely. Yeah. I agree there.
[00:19:09] Tobias Macey:
As far as the implementation of these data contracts and the ways to think about how the contracts are defined and who defines them in terms of unit testing, that was associated with the overall DevOps trend of shift left where you wanna move everything as early in the process as possible. And in data, a lot of that shift left means that you have to bring in application teams so that they can notify you when they're making changes to the underlying sources that you're pulling data from. I'm curious how you think about the responsibility and application of data contracts and how that fits into the technical and organizational structure of a business.
[00:19:50] Tom Baeyens:
Okay. Yep. Cool. Let let me try and start all the way from the consumers and then work our way backwards, towards the source data. Because this is this is, I think, showing the story shows the power of contracts. Where imagine you're building a simple report. You're using some data like 3 data tables. And, like, if you want that data to be tested, you want to create your contract on there. As a as a consumer, you can start by saying these are my tables. This is the contract that I want to see, verified before, and I want to be notified if this doesn't work. But as a consumer, it's pretty unpractical to manage those contracts because you don't have the ability to change. You don't have the the power to change the the pipeline that produce the data. Those are the engineering teams that actually produce the the data producers.
So actually, a contract just like an open API description is not something that the clients want to maintain. No. It's the team producing the data. So you always want to hand over the contract to the team producing the data. It's an integral part of the software producing it, and it is the the description of the interface. So in the end, the consumers, they want these contracts. They want to know about the contracts because they describe, like, how can I use it? It's all the metadata describing the data that you can use in your data products. But then you want those data producers to take ownership. Right? Now in in the the transformations that have led up to this refined data that goes into the reports, they have probably used input data either from the extraction or from a previous transformation or whatever. So the producers are often reluctant to to take on this ownership, to take on the guarantees, and to provide the guarantees in the contract because they actually rely on input data which which they might not fully trust. And so that puts pressure on their input data to have contracts on there.
And so if they know, like, oh, I have on the last transformation, I have my contracts on all of my inputs, then I actually can guarantee the outputs of the refined data. And so that that kind of mechanism goes all the way up to the source data. So which which kind of brings like a new level, a coarser grain level onto your data infrastructure. It's not that you should be looking at all the tables in your data warehouse. No. You should be looking at all the handover tables between those components. So if you have a component that produces something, you want to see your inputs also protected by a contract. And this goes all the way to the source system like the production data. This is where something, tricky is happening a lot, of course, because initially, we need people we need the data from the production systems. Right? And then there is, like, a rest API around that production system which the team is fully aware of and they provide consistency and that's managed as a product.
But then in order to export the data to the analytics team, they just break into the backdoor, and they just steal their database table data, which is never intended as an API, and they didn't know. But, actually, this is a hard conversation that needs to be had because the team actually needs it. And this is where your contract where where it'll say the first user of that data could say, like, I'm giving I'm making this initial contract, but have that conversation with the production team saying, like, oh, can you take ownership? We actually use this as a product. And we are better off knowing that you cannot give any guarantees. But that you say, like, put some integration tests so that we know when it's breaking rather than just ignoring the problem. I think that that's kind of where contracts come into play into that into that ecosystem.
So it's the handovers and it's pushing all the way up to the source, where some hard conversations sometimes need to be held.
[00:24:00] Tobias Macey:
In that overall flow of information starting from that consumer of, I wanna make sure that these constraints are always true. I wanna know if they're not true, pushing that down into the consumers and producers and pipelines and applications that have relationships with that data. Obviously, those contracts are integrated as part of that pipeline that consumes and transforms and produces that data for those consumers. But I'm wondering if you can talk to some of the ways that these contracts have a ripple effect across the overall organization and their approach to data and some of the ways that you surface the information about those contracts, particularly when they are failing so that somebody who is relying on that set of data can know, oh, hey. I'm looking at this dashboard, but I can't actually trust it because I see that this check failed. How do they even get to that information, and how do you try to surface that in a way that helps to build trust rather than detracting from trust of saying, oh, well, the data's always broken. I can't ever trust it.
[00:25:09] Tom Baeyens:
Yeah. Yeah. That that makes a lot of sense. And I think, like, the the the key thing here is indeed I think we need to get out of the situation where data breaks so often. So it's, like, the pyromaniac, being out there. Like, if you start doing contracts, I think overall, you'll see that you'll get less issues and you get less of this problem. But but, of course, we're very early in the in the journey. That's that's one thing. But how do they surface? Is that you like, the first thing you have to realize what I I want to reiterate what I just said, which is nowadays, people just look at the whole warehouse and see, like, a gazillion datasets, like tables or views, where data is stored. And then the actually, a value that contracts bring is that you identify the datasets of which someone takes ownership.
And then immediately, if you have those datasets, because the datasets with a contract are the ones that someone says like I stand by this dataset, which then pushes down all the other datasets which are not as relevant. And in this sense, this is already a key thing, a key property when pushing information to a catalog or a data discovery tool. Like, the data discovery tools is where the datasets are being found by the consumers. So and that would be key as you're adopting or gradually are adopting contracts. They're gonna see, oh, this is a a dataset governed by a a contract. This is a more interesting one, for me rather than some dataset that might or might not be good. That's the the one thing. And then, of course, in that same data discovery tool is where you want to have maybe more granular, information as to, like, how often it breaks and, like, which checks, did break, that that's where this this pops up. But I think, like, the key is not necessarily in figuring out and having the the the consumers to learn about, like, the actual data because that's the debugging process. That's where we have, like, more in-depth tools to to find the root cause.
But those should just be, like, is it covered by a contract? Okay. Then I can already rely on it a lot more. And who's the owner? Who can I talk to? I think those are the key questions that you'll want to find in your, data discovery tool.
[00:27:29] Tobias Macey:
And so now bringing this from the abstract of data contracts, what they are, how you use them to the concrete example of what you're building at Soda in this data contracts tool, I'm wondering if you can talk through some of the ways that you thought through the design process and the syntax and implementation for how to actually bring these data contracts into the pipeline, into the data ecosystem, and how to ensure that they can be written and understood and maintained and not have it just become another pile of spaghetti code that nobody can understand and everybody has to debug.
[00:28:08] Tom Baeyens:
Exactly. Yep. No. I can definitely, talk a bit about how we got there. Because early on, even before data contracts came into play, we, created SodaCL as a declarative YAML language for expression expressing data quality checks. And that gave us the validation of the declarative approach specifying these quality, checks in YAML. That really, resonated. But it was created from the perspective of the consumer. It was like a consumer, that's where the problems show up, and we started from there expressing checks over multiple datasets. And so while you can actually build a contract strategy on top of this declarative language if you if you really have this background of how the organization should work in terms of producers, what the ownership means, and then the consumers, and then how to put these, checks in between. But we realized that there was an opportunity to align much better with the data producer world. And the data producer and the fact that they, actually produce a set of output ports in the data mesh terminology or, like, output datasets of your software component, that's where we see there was an opportunity to align the language a lot better with that. And that's that's actually what we did. So we can leverage from our perspective the query engine that does that runs the the evaluation of all the checks, that's solution that was already, built and and stabilized for a long time. And the only thing we had to do was tune the language towards this new use case of, running all the checks for a single dataset.
[00:29:55] Tobias Macey:
Another interesting aspect of this space is that the approach of testing for data, building guarantees around data has been around for a while. Different tools have implemented it in different ways. There's also the space of metrics definitions to say, okay. I have worked through this data, and these are the types of things that you can expect from it. These are the semantics around it. So dbt has their metrics and unit tests. There's the great expectations tool that is built around making sure that data matches your expectations around what you want it to be. There's the tool that you're building of data contracts in the Soda open source ecosystem.
I'm wondering if you can talk to some of the ways that you think about the areas of overlap of what you're building with some of those other tools, and in particular, some of the either emerging or nascent standards as to how to think about the definition and maintenance of these guarantees for data?
[00:30:58] Tom Baeyens:
Right. Yeah. 2 very different questions, but I'll tackle them 1 by 1. First of all, like like, what's the overlap with other tools? Right? And so from a Soda perspective, we definitely see our, product as a component in a central data stack. So there's, what we see is, like, we don't have, like, a single technology that works in our environment. We want to work across a variety of environments. And that is usually, like, we we touched on earlier, an existing kind of architecture, different kind of orchestration tools that might be DBT, but all the transformations, as well. And there might be data floating around without outside of dbt.
And so this is this is what we kind of have the combination. We we apply the principles of data contracts. And then as we saw before, it's like it's mostly like an organizational thing. It's making sure that you support the right workflows. And we do this as a combination of observability and data testing, and we deliver that as a package so that it makes sense to to install this on your central data infrastructure. So that that the whole team can like as a as a central data team, you can say this is the tool we're working with for data testing. And then all the different teams have the guidance in their particular environment. They can actually have the guarantees. Like, there is guidance on how to apply it, and it also works in those different environments. So that's that's how it, our our perspective, is a bit different. And then again, like the second part is like, yes, we also think that is a very new space.
The standards are popping up, left and right. We think that is really important. And it feels also that this is a place where standard could really help because there's lots of tools integrating with it. We didn't touch on that yet. But, like, unit testing is one aspect of contracts. It's like pushing metadata to data discovery tools is another use case for another tool. There's retention. There's access control. There are all different aspects that you can model easily in a in a contract. So it makes perfect sense to consider a standardization effort. There are multiple, competing ones at the at the moment. There's the ODCs, the open data contract standard, we know them as BITOL.
There's other BS like, open data product specification, and there's probably a few others as well. So we keep a very close eye on those and help wherever we can to push those, forward because we know that for our, for our customers, this is this is crucial. Like and this is the value that that standards can bring. The the whole data landscape is super fragmented yet. We'll we'll see probably some consolidation going forward, which would be really good to have. But as long as we don't have that, all these tools need to interoperate.
And I think contracts are gonna play, a major role in that. So wherever we can help, we're, we're there.
[00:34:03] Tobias Macey:
To that point of integration and access control in particular, I think that that's a very interesting application of these contracts to say, I guarantee that this set of data is only accessible by people who have these role, but there isn't really any cohesive standard around how to actually apply access controls across different data tools. That's one of the problems that I run into constantly at my work. And I'm wondering if you can talk to some of the types of integrations that you're thinking about building or have already built for these data contract specifications, some of the areas in the data ecosystem that, I guess, are in good shape for being able to push these types of guarantees down into other layers and some of the areas that you're seeing gaps as far as how to actually approach that integration and enforcement of the guarantees that you want to specify.
[00:34:57] Tom Baeyens:
Okay. Let let me try them run them 1 by 1. I think if you start with a with a contract and what you can do with it, there's multiple use cases for it. That's really important, to distinct here. One use case is being the central source of metadata for your data. So the the system of record, better to speak, in terms of, like the system of record for your metadata, which means you have your YAML file, you describe your schema and your types and all that. And all the information for consuming it is in that file. It's managed by the producers because they control the actual data production. They should also control that description. And then pushing that to the data discovery, as we said, like, that's one use case.
That's one tool actually consuming that file, that YAML file, and extracting a portion of that information and displaying it to the consumers. The second one is the one that we focus on is the unit testing. Like, as you are, specifying the schema and also your data quality properties, we can extract those and run checks to see if that really matches, and that part is covered with some checks. So unit testing is a second use case. And there might be others. So the next one is where we see, like, the the the companies, the customers themselves build their custom workflows, and they're using a couple of properties left and right that they add, could be around ownership, PII information, and what it means for them or where they want to enforce this and use this. So they built the the tooling and the the software logic in their workflows that leverage this information.
And so there's more, like, retention and access management you could specify. And and in that sense, contracts become kind of the configuration files for the tools in your data stack. That's also another way of looking at it. And I think the challenge that we're gonna face going forward is that as more tools adopt the the the contract, which is already like a huge benefit versus like having all this logic spread around over all tools. If you see that centrally managed in a single file, I think that's huge benefit. But as an engineer, if you're now gonna, like, after a couple of years change this file, how are you gonna know that this property, like, which tools is this property gonna impact? If you change the uniqueness, from false to true or you specify from nothing, like you say, this is a unique column, does that imply that you're gonna run a unit test on it and that you're gonna check uniqueness on the full? Like, you might do that without being aware. You just say, like, I wanna I wanna use, like, the the data catalog use case or the data discovery. I wanna say to my users it's unique and not realizing that you're gonna test for it. So that's what what we take into the design of the language to make sure that you're having a clear distinction between what you're gonna test and what you're going to going to push. So so that's that's what I see as a challenge. As more tools get configured in this contract, which is valuable, how to make sure that the engineers still, like, stay in charge.
[00:38:02] Tobias Macey:
In terms of the workflow of bringing these data contracts into your ecosystem. I'm wondering if you can talk to some of the types of questions that you're seeing engineering teams come up with, some of the ways that they're thinking about how and where to apply these data contracts and some of the some of the ways that they've been able to benefit from paying down complexity, but also some of the ways that they're maybe running into issues of they want the data contract to do something and it's not flexible enough or they just don't maybe understand the limitations of what types of things they can guarantee?
[00:38:41] Tom Baeyens:
Yeah. More like on the other side, it was like I saw some some very interesting, use cases of data contracts, which I didn't expect initially, which kind of broadened my mind, a little bit. Because we're saying, like, we started off with API for data. Right? Now we have this metadata, data contract describing the schema, and all of that. Now someone asked for a little tool that I didn't expect, and that was to generate the the DDL statement, like, the the create table statement from the contract. And while this was just like a a normal, like, feature request, it started to to trickle down. What does this mean? And it actually could mean that the data contract could become the control plane for your warehouse, where you're not, like, starting from the DDL and then working your way to the contract, but the other way around.
Like, the engineer starts from building the the contract first, the and starting to to build that metadata, and then just doing an apply of this where the tool then calculates the difference and just runs it. So so I think that was a powerful kind of metaphor or an insight that says, like, maybe we're going in that direction because we're fixing actually a limitation of the warehouses, of the storage layer. We're fixing the fact that storage layers have very limited metadata. They only have column name and data type, and that's it. Whereas much of the workflows in a data, environment, they are they require a lot more metadata.
Just like document, systems in the past, it's not about the documents. It's like you can only do proper workflows if you associate metadata with your documents. Here, just the same. You can only do interesting workflows and automate things in your organization if you have your metadata together with your datasets. So that was a that was an interesting, way of how we saw this being applied.
[00:40:42] Tobias Macey:
And in your work of building these data contract interfaces, thinking through how they fit into an organization's data ecosystem, building the tooling around it? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:40:59] Tom Baeyens:
Yeah. The one of the more interesting things we came across was applying the generative AI to this. We we totally didn't expect or at least I didn't. I was initially a little bit skeptic, and then someone said, just just give it a try. We opened up a prompt, and we learned the prompt a little bit about, what soda was. And then, I thought, like, oh, can I use it as a contract generator? So I set in the prompt, I added, like, the create table statement because we have it. You can extract it from the warehouse. And I added a sample data by just adding into the prompt a couple of insert statements without really explaining.
And I asked, the prompt, like, can you generate a contract for me? And it was, pretty impressive. It came, like, back with a full working contract. There was a student, dataset. There was a GPA in it column. Nothing more. It figured out that it could create, like, a min and max check on the GPA, whereas all the values were between 3.23.7. It created the check's minimum value 0 max 4, which kind of, like, it deducted from the GPA, the acronym, and, like, from its network, it says, like, okay. This must be a GPA score. And then, apply the data quality check. So that was really impressive. Then we like, I thought, oh, this looks good. Let me try to run this.
It was actually a problem because as you are, the like, as you're doing a create table, it was in Postgres. And as you're doing create table, it took, like, a VARCHAR kind of description. So the machine, learning or the assistant just took that one as the the data type in the contract. But if you ask the metadata of Postgres, what's the type? You get, like, character for Ryan. This was the only error. So I tried to run it. There's, like, a whole bunch of logs coming out, and somewhere it says, like, your check your contract check failed because of the data type doesn't match. And I thought, like, why not try it? I just copied and paste the whole logs into the prompt saying, like, so it doesn't seem to run. So can you fix it? And the assistant came back with, like, a fixed contract that actually works. And that had a good balance of which checks to apply and which not. So I was I was totally impressed with that, and that made me look ahead in terms of, wow, this is really how generative AI can change the interface of a system, where you you used to do it in terms of, like, an IDE, like YAML editing or or XML editing back in the days. Now you're gonna have a conversation. Like, can you update my contract?
This is what I want. Can you fix it? So that that was, yeah, impressive to see. So we just added that quickly to the product.
[00:43:50] Tobias Macey:
For people who are interested in the capabilities that we've been discussing, are looking to improve their overall reliability of their data platform, what are the contexts in which a data contract is the wrong choice? Yep.
[00:44:07] Tom Baeyens:
So that's a that's a hard one actually. So, like, if I would guess, like, if you anticipate that your analytical data, like, does not change, remains stable, nothing changes, then you might not be needing it. It's like and maybe it's the same as in in software engineering. Like, when do you not need your unit testing? Like, I guess, it's also pretty hard. So it's kind of like as you're now building a new software project, like, it might be that you say I have a script here that is for my personal use and, like, I run it 3 3 times a year. I don't know if maybe not even anymore after today. Then I don't think it's justified to run unit tests on it. Same with your data. If you do a one off, don't bother. But if it's part of your data infrastructure and you're building, like, enterprise workflows on top of it, like, I don't think you should be avoiding this, at all. It's gonna be hard. And so I and I agree it's early. It's like it's not that this is a common practice, but I definitely feel that it's coming. And, and I expect, like, within 5 years that no one's gonna start another data pipeline project without thinking about the contract first.
[00:45:17] Tobias Macey:
As you continue to work on your tooling, keep an eye on the overall ecosystem of data and how people are thinking about building guarantees around their pipelines and their analytical capacity. What are some of the things you have planned for the near to medium term or any predictions that you have going forward about how the space might evolve?
[00:45:38] Tom Baeyens:
Yeah. So in Soda, we're, like, in early access right now, and we plan to to bring this to to GA later this year. We can do this quite fast because we like the the the solid foundations of the engine that we have. But in general, I think the more interesting part is is to look ahead in how the the the uptake of contracts. And mostly like the organizational aspects and the the the the assumptions that we have in software, how do they apply in data? And can we get the same kind of vibe there that we're starting to think in data pipeline components with interfaces between them? Where the producer teams take ownership because they're currently often missing in action.
And, like, that principle, that for me is what I think is gonna be, like, very interesting to watch as that play out. Is it gonna be, like in software, a very lengthy process? There, it took, like, maybe 10 years before unit testing was quite adopted, but now we have that example. So it's it's much easier to explain it now in terms of software prints terms, and that you need it is is also quite easy to see. So so I think it's gonna be a lot faster. But is it 2 years? Is it gonna be 7 or something in between? Yeah. That that's gonna be interesting to play out in in in my perspective looking at.
[00:46:55] Tobias Macey:
Are there any other aspects of this space of data contracts, either conceptually or in terms of your implementation at Soda that we didn't discuss yet that you'd like to cover before we close out the show?
[00:47:09] Tom Baeyens:
No. I guess, like, the the the biggest thing that I would like to see happening is a more integrated data platform. It's like there's currently gazillion tools that all do a small part and you feel somehow that would be much easier if they were better integrated. I think that's gonna happen at some point. So the the the bigger players are thinking about this, of course. And so I think that that to me is what I see also as very interesting times ahead. Like, how is this all gonna be consolidated so that you get, like, a complete data infrastructure platform as as a single product rather than having to stitch together all the different tools as as is, unfortunately, the the times we are in right now.
[00:47:59] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And I just want to thank you again for taking the time today to join me and share the work that you and your team are doing on data contracts and share your perspective on the role that they play in an organization's data ecosystem and the applications that they have to help to build greater confidence and reusability of data. So thank you again for that, and I hope you enjoy the rest of your day. Thank you, Tobias.
[00:48:29] Tom Baeyens:
It was a super pleasure to be here, and thank you for helping the data ecosystem to, to share all this knowledge. That's very much appreciated.
[00:48:46] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: Tom Byans
Tom's Journey into Data Engineering
Understanding Data Contracts
Data Observability vs Data Contracts
Testing in Data Pipelines
Challenges in Data Testing
Guarantees and Limitations of Data Contracts
Implementing Data Contracts
Impact of Data Contracts on Organizations
SOTA's Approach to Data Contracts
Integration and Access Control
Workflow and Benefits of Data Contracts
Lessons Learned in Building Data Contracts
Future of Data Contracts and Predictions
Closing Remarks