Summary
One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Your host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what your conception of a data contract is?
- What are some of the ways that you have seen them implemented?
- How has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization?
- What does the negotiation process look like for identifying what needs to be included in a contract?
- What are the interfaces/integration points where data contracts are most useful/necessary?
- What are the discussions that need to happen when deciding when/whether a contract "violation" is a blocking action vs. issuing a notification?
- At what level of detail/granularity are contracts most helpful?
- At the technical level, what does the implementation/integration/deployment of a contract look like?
- What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations?
- When are data contracts the wrong choice?
- What do you have planned for the future of data contracts in great expectations?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Great Expectations
- Progressive Typing
- Pioneers, Settlers, Town Planners
- Pydantic
- Typescript
- Duck Typing
- Flyte
- Dagster
- Trino
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
- Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg) Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: [dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode) today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
- Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png) Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of show. Your host is Tobias Macy. And today, I'm interviewing Abe Gong about the technical and organizational implementation of data contracts. So, Abe, can you start by introducing yourself?
[00:01:36] Unknown:
Thanks, Tobias. I'm Abe Gong. I am a data scientist, data engineer, and entrepreneur. I'm 1 of the co creators of Great Expectations, the open source data quality project. And now CEO, cofounder at a company that's taking it to market. Do you remember how you first got started working in data? I first got started in data I mean, depends how far back you go, but grad school. I went to University of Michigan because it was an extremely quantitative social science program. And I was there when Pandas was first born, and I'll say, you know, dropped r and immediately moved to Pandas because I love Python. My dissertation was actually building more than a 100 different web scrapers to pull down blog content from all over the Internet for world's 1st representative sample of political blogs. And there was just data all through that and messy data and kind of real data, which I think is something that you don't always get exposed to in grad school.
From there, I went into tech. I worked at Jawbone, worked at a company called Aspire Health, always leading data teams. And I'm really driven by curiosity. So a lot of those were companies where a big part of the thesis was doing something new and interesting with data that hadn't been done before.
[00:02:46] Unknown:
That brings us forward to today where we're talking about data contracts, which is a very popular topic of conversation at the moment. A lot of people have their own opinions on what that means, what that looks like, what their purpose is. And I'm wondering if you can start by setting a baseline of what you view as the kind of conception of what a data contract is and how it's defined and the purpose that it serves.
[00:03:11] Unknown:
It's super buzzy right now. Right? So, sure. 1 more definition for me. I'm sure that will help. I think there are 2 big concerns in data contracts, and we're sometimes conflating them. I think 1 of them is around alignment and making sure that different people understand the same data well enough that they can do their work together. I think there's also a component around enforcement, where you can use look, it's actually not just enforcement. I would think of it as in the life cycle of creating and maintaining those data contracts, data quality checks become a really important and useful part of that that can be automated. I think those conversations are sometimes conflated.
[00:03:50] Unknown:
And so in terms of the kind of technical and organizational and social aspects of the implementation of data contracts, I'm wondering if you can talk through some of the ways that you've seen teams approaching that.
[00:04:02] Unknown:
Just looking at it through a great expectations lens, the abstraction in our system that maps most closely to a data contract is what we call a checkpoint. And a checkpoint basically, it's like, the the way people use it is it's a small code snippet that you can drop in wherever wherever you want. So you could put it in an airflow DAG. You could run it in line with Spark code. You could run it as a cron job to monitor a data warehouse. And what what a checkpoint does is it pulls together expectations, so a set of tests and rules about your data, with the actual data itself, verifies that those tests are passing, or if they're not passing, then you have flags that as an error, and then notifies the appropriate parties. And what I mean by that is you log validation results, and then most teams will end up setting up a Slack notification or or sometimes more than 1 type of notification to let people know, hey. This data the contract is being violated right now.
So the practically speaking, the expectations themselves become a fairly rich definition for what people want the data to do, and then the checkpoint becomes an enforcement mechanism. So I said earlier that there was an alignment and an enforcement component. The execution of the checkpoint does enforcement, like it verifies that things are are correct. And then the point at which you build the expectation suite, whatever your process is for that, is how alignment gets taken care of.
[00:05:24] Unknown:
As far as the kind of ways that people are thinking about contracts, they're it's also interesting because it's such an overloaded term where contracts are these legal documents that are used to define what the kind of relative responsibilities are between 2 different parties. It can be used as a technical definition of a interface contract from an API design perspective. It can be just a loose contract of kind of unwritten rules of social engagement. And I'm wondering how you see teams thinking about those gradations of meaning as well, whether it is something that needs to be I need to have this defined as a bulletproof technical implementation where if the API spec isn't matched exactly by the data as it comes in, it's going to fail? Or is it more a social and organizational conversation and kind of gentleman's agreement of, hey, I really want this data to look this way in order to be able to use it for this purpose, so it'd be great if you could make sure that happens. I'm wondering kind of what are some of the ways that people are thinking about that aspect of what the kind of contractual obligations are and who is involved in those negotiations.
[00:06:33] Unknown:
It is all over the place right now, and I think it's just a really interesting conversation. To me, it feels kind of like an organizational Rorschach test. Do you know the framework of pioneers and settler roles for growing companies? I do. But for folks who might not be familiar, if you can just kind of briefly sketch it out. Yeah. Of course. So pioneer roles, which you usually see much earlier in a company's history, are people who are excited about breaking new ground, kind of doing new things, setting it up for the first time, And people who tend to be best in those roles and most excited about those roles don't like constraints. Right? Like, they like to dive in, tackle a problem, work it out as fast as they can in their own head or with code or, you know, marketing tool or whatever the tools are.
Settler type roles tend to come after. You've got a thing that's up and running. It's worked a few times, probably not super reliable yet, but something good is happening and you wanna make it so that that can scale up and just become a steady, reliable part of the organization. I think a lot of the differences around data contracts settle into which type of role are people in. So as I watch debates between different participants, it's often people coming from larger organizations where the data team has to answer to more stakeholder groups and more other people around the organization that are leaning harder on the, let's negotiate this, let's make sure that things are steady, let's not change too frequently.
And then it's smaller organizations or or newer maybe, where there aren't as many other people depending on the data yet, and you wanna be able to explore and move fast, that I hear pushing harder for the side of, hey. Let's not make this a big negotiation. Let me just send you a notification if something changes.
[00:08:08] Unknown:
As far as your lens on this space, given your engagement with the Great Expectations project and its community, I'm curious how that maybe contrasts with from people who are coming from a different basis point where maybe they're coming at it from the data governance arena and not the data quality arena, and some of the ways that that colors the considerations about what needs to go into a contract and what the constraints are for these contractual definitions.
[00:08:38] Unknown:
That's an interesting 1 too. Right? Like, our data governance and data quality, are they really different? You know, is 1 a subset of the like, they're very overlapping. And there are a lot of people who are using Great Expectations for things that I think you would think of not as like technical data quality questions, but more as data governance questions. So I actually think there might be more overlap there than your question implies. But just overall, when I look at the kind of surge of new conversation around data contracts, I mean, Chad Sanderson and others kind of pushing this thesis out. The thing that I think they're doing that's newer and provocative is they're pushing harder for stronger schemas and stronger typing upfront than it's been done in the past.
I think that's a really interesting evolution. I think it is probably the right way for things to go, just to say it. But I also think that it's like, as you're seeing from the pushback from people around it, I think it's too much for people to start off with. So this is the thing we've talked about elsewhere, but I think of it like progressive typing. Like if you force everybody to come in and set up strong schemas and strong contracts for all of their data ahead of time, okay, let's freeze the whole organization for 6 months. Let's, you know, have everybody dive into a 1000000 meetings and just do that. It's a non starter, right? Like you just can't do that. But if there are good ways where you can start off by and again, I'm thinking of this through the expectation lens. You start off by putting some expectations that are checked. Initially, they're checked kind of at the downstream right hand side part of your data pipelines. And then over time, you discover that those are breaking, so you work a stage upstream and add some checks there to make sure that things are right. And then after a while, you start to get to the point where everybody's, you know, you have a lot of people who are depending on those. So, okay, let's switch over and not have this just be a notification, but have it be more part of the, like, approval process before we make a change. There are ways that you can layer it in where you don't have to freeze everything to do this big, you know, constitutional convention to figure out your contracts.
That still will eventually get you to the same place. And like I said, I think there's a really interesting analogy back to the way Python and JavaScript are both you know, Pydantic and TypeScript are rolling out everywhere because we've sort of rediscovered that typing is a really, really good idea for complicated systems.
[00:10:53] Unknown:
To that point of the integration layers where you are able to kind of employ these technical checks and validations. I'm curious what you see as the ones that are most valuable and most effective, and some of the ways that teams are thinking about how to model their validations and how to model the data that they're working with to be able to run these checks on it, as well as the question of, you know, is it a compile time or a runtime check?
[00:11:26] Unknown:
Mhmm. Mhmm. I get the answer is yes. So I think people often think about Great Expectations as a testing framework. And if you, like, really kind of bite down on that language, you could end up going down the road of saying, oh, so it's gonna tie into your CICD. And in that world, you'd say developers should write their own tests and and so on. We definitely see great expectations being used that way. It becomes a way to kind of test business logic. In that sense, it ends up being a little bit more overlapping with things like DBT tests, for example. So besides the kind of CICD testing use case, which I would think of as a dev team you know, data dev team testing their own work and making sure that the business logic is under test.
There's also this notion of I think of it from the economic perspective of externalities. You have data flowing from 1 team to other teams. So you have people who are consuming the data, who are not the producers of the data, and that creates the need for a different type of test that I think is a lot closer to what most people are thinking of when they say data contracts. So, in that case, what you look for is team boundaries where, important data is flowing and where there's, you know, risk of change or complexity from 1 team thrashing another team, at least within great expectations. You see somewhat different, like, granular tests being applied to test the different types of interfaces, but it's the same DSL. It's the same basic concept.
Sometimes the cadence of testing is a little bit different, but, yeah, I mean, we've deliberately built the things so that it can address both use cases. You asked another question around compile time versus runtime. I think this is actually 1 of the places where the progressive typing metaphor falls apart a little bit because in progressive typing systems, you know, TypeScript or or Python, you assume a single compiler. Right? Like you or or interpreter rather. And in data systems, the data may be getting dropped into s 3 via FTP and then air flowed into something. And, like, you don't have a single run time for it, unless you're a team that's like completely gone in on Snowflake or completely gone in on Databricks. And I just don't see most teams being a 100% there, which puts you in this interesting place where your testing system needs to or your typing system, which I see as kind of the same thing, needs to live outside a single compiler.
So practically speaking, what most people do is they test the data that's coming in immediately prior to them. That allows you to do sort of like duck typing schema checks to make sure it's not broken. And then eventually, teams start to see, oh man, it'd be nice to catch errors earlier, and they push it upstream and you start to get checks that go earlier and earlier in the process. I don't see a lot of teams doing this yet, but there's some interesting stuff going on with projects like Flight and Dagster, where they're actually building the typing system into the framework itself. And you can start to see the threads through to be able to do things like static analysis to make sure that if somebody changes a type before you run data at all, you've made sure that that's gonna flow all the way through. I think most teams are light years away from actually implementing something like that, but I think that's probably where it's gonna net out eventually. Right? Like, the most efficient version of these systems, I think, has to go there. Absolutely. I agree that the orchestration layer is getting to the point where that is the
[00:14:47] Unknown:
most sensible place to actually execute these checks and validations and embed that understanding about what are the agreements that need to hold true between each of the different stages of a kind of data progression. And another interesting aspect of understanding what validations need to be applied and where and by whom ties into the question of kind of data discovery and data consumption. And, you know, if you're a core data engineering team and you say, okay, I'm going to make sure that all of the data that comes into my system matches these schemas so that I know what I'm doing with it. I write these transformations, and then I have this output, you know, maybe it's the output of your DBT runs. So I know that these are the models that are being generated, and then it gets handed off to, you know, the data analyst or the business intelligence team or a, you know, business user who starts pointing and clicking through Metabase or Superset and building their own dashboards and reports, and then your kind of understanding of the scope of information that you're working with becomes out of date. And so then you have to have this constant upkeep of, Okay, well, what are all the ways that the downstream uses are changing or what are all the new upstream sources that I have to be considering or, you know, pending changes that are going to be coming to me from that upstream source? And I'm wondering how that influences the ways that teams think about kind of the level of strictness and granularity to put into these validations and these contractual agreements versus just having a more loose structure of, okay, I just need to make sure that this is roughly the right look and that I can, you know, massage it into the way that I want it to be, but also making sure that all of the downstream uses are continuing to be well served by the work that I'm putting into all of this kind of upkeep and maintenance?
[00:16:34] Unknown:
I think the closest that I can get on that is the way that I break down, especially data science and data analytics workflows. I think engineering starts to be a little bit of its own animal. But for those types of workflows, it's about question answering. Right? Like, you have some large scale business question that you need to turn into a kind of a list of more precise questions that are gonna be asked to the data. There's judgment involved. It's not a, like, judgment and intuition. So it's it's not like a mechanical exercise. But if you look at the atomic units, it's, okay, what questions am I asking? When you look at testing and data validation, data monitoring, it's fundamentally the same thing. Right? Like basically what you're doing is you're taking a set of questions that through the exploratory process or the requirements gathering process, you've decided are important. And and, like, very few people are this formal about it. But if you, like, really break down the workflow and the life cycle, this is where it lands.
So there's a list of questions that you've decided are important and you wanna ask on a persistent basis. The thing that I think is really interesting, and if you wanted to riff on, like, what's going on in API or in AI and stuff like that these days, I think it just got a lot more interesting, is I think that a lot of the work around that question answering is actually amenable to automation. Not so much picking exactly which questions to ask, but having kind of chunks of questions that you wanna ask together. I think that there are gonna be a set of repeatable workflows that develop out of that. And you already see it in at least we see it in the great expectations community with people doing things like developing data quality scorecards, where they're gonna use a set of expectation suites that are more or less consistent across a lot of different datasets.
You can think of that as packages of questions that they've decided are important to ask together, and they reuse in lots of different places. So, like, I don't have a full answer for how do you keep the dashboard in sync with upstream ELT and then all the t in the middle? That's a complicated question. I think there's a lot of job security for us as data people there. But, fundamentally, I think it's gonna end up building out of those kind of questions as basic units of the work that we're doing.
[00:18:38] Unknown:
Another interesting avenue is the question of, for the data contracts, is it a safety mechanism for the downstream users of the data to act as a set of kind of training wheels or guardrails, or is it a method of maintaining the sanity of the data engineers who have to make sure that the data is healthy and functioning the way that you expect it to?
[00:19:03] Unknown:
I think those interests are aligned. I actually don't think that's an either or proposition. If you wanna break it out, it's gonna depend on kinda who has the power to, like, disrupt other people's workflows in the organization. So in some organizations, you know, data engineering teams, they have their process. Other people can make requests, but, like, they're gonna keep shipping what they were gonna ship. And in that case, I'd say, you know, the team at risk from the externality of bad data is the downstream team that's consuming it. On other teams, you know, data engineers answer to other people. And if, you know, if an exec comes in and says, my dashboard is broken, you must fix it now, throw out all your other work. Okay. Now it's a sanity
[00:19:40] Unknown:
device for the data engineering team. So, yeah. I think it's also interesting to explore the kind of responsibility of the data engineers to provide everything on a silver platter versus the responsibility of the organization to be sophisticated in their usage of the data to understand, you know, what are the signals of data quality so that if they do see an anomaly or some sort of breakage, they can resolve it on their own or they can contribute to that resolution. So, like, to that point of are the data contracts, kind of a safety mechanism early on in your data journey? And as you become more sophisticated in terms of your technical platform capabilities and in the organizational understanding of the data. You know, those constraints can be loosened and so that, you know, there's more of the kind of Python approach of we're all adults here and, you know, if you wanna shoot yourself in the foot, that's your problem.
[00:20:36] Unknown:
So you didn't ever quite say the word, but there, like, the Trino team is talking a lot about data products, and there's the whole data mesh community. And then there's just this good conversation going on in the data sphere about are we services teams or are we product teams? I think this is fundamentally that question. Right? So if you think of your data team as, hey, this is a pool of people, it's a pool of analysts who are gonna answer questions and maybe there's some repeatability to that, but, like, fundamentally, it's a services team that answers questions. Then I think you want to optimize more for flexibility because like, if you do a good job answering them questions today are going to be different than the questions tomorrow.
And yeah, like it, you just gotta stay light on your feet. But if you think of yourself as, hey, I'm a team that is producing a set of data that is gonna be useful in a reusable way for a lot of people over time. Okay. Now that's more product y and you wanna lock it in. Right? And to some extent, the teams that I see this doing this in a more sophisticated way, it's not so much data producer, data consumer. It's data producer as a facilitator among a bunch of different data consumers. Like, hey. There are several different groups who all wanna consume this data. Okay. What's the union of the things that they need? That's also technically feasible.
So I think it depends on whether you think of yourselves more as a services team or more as a product team. And, of course, there are a lot of teams that are mixing and matching those.
[00:21:56] Unknown:
Your introduction of DataMesh into the conversation is an interesting filter for this question of data contracts as well because at that point then the contract is what is the outward facing API to this data product, whatever format that might take, whether it is a table in your Snowflake or an API that is going to be used by some web client, and some of the ways that that changes the thinking about what are the kind of contractual validations that we're trying to do and the guarantees that we're trying to enforce by employing these contractual patterns.
[00:22:34] Unknown:
Yeah. I mean, there there are probably some kind of, like, diehard true believers in those 2 communities who'd like hate on me for saying this, but I think they're very congruent views. I think they fit together really nicely. I think the data mesh community has kind of started with larger, more enterprise organizations and is thinking about kind of workflows and role definition more. And I think the data contracts conversation is kind of building on that to some extent, but it's been more technically minded and saying, like, hey, there's actually a technical fix for this. There are differences, but I see them as relatively small. I think they both point in more or less the same direction.
[00:23:12] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/monte Carlo to learn more. Digging a bit deeper on the kind of tactical aspect of building these contracts and validations, there are a number of interesting challenges around that as far as where to employ the enforcement, what does the deployment and update cycle look like, what are the kind of integrations at the developer level, Who is responsible for adhering to these contracts? Like, is it something where you have, you know, the application developers who are generating the events or the upstream schemas for the tables that are being pulled into the warehouse are, you know, participating in that process. And so, you know, they have integrations with their editors to make sure that if they modify an event schema, it has some sort of notification or alerting to other people who need to take advantage of it or direct integrations with, you know, whether it's the Kafka Schema Registry or some hook into a CI system that is going to make sure that everything runs through. Like, what are some of the challenges that you see teams running into as they try to build more automation around this process, and maybe some of the ways that the data ecosystem is trying to play catch up or, you know, is addressing some of these challenges of how do I actually make sure that this thing holds true from end to end or and what end to end even looks like.
[00:25:29] Unknown:
Yeah. 1 lens just to put on that is, I think this is the problem of trying to apply types without a compiler. And so your compiler has to be social. Right? It's people working out things among themselves. Okay. And I don't wanna overgeneralize, but I do see a few trends there. So 1 is between data engineers or, like, data platform teams and then data scientists, data analysts, That relationship feels quite friendly to me in the sense that the platform team knows that they're there largely in support of the analysts and data scientists. And, like, there's work to do, like, the real work to do to set up good contracting infrastructure, but everybody's there for it.
Downstream, when you get to the business users, I think that there's often more friction there just because there is often lack of understanding from the business users just how hard it is to really guarantee data quality, especially if you've got a deep gag, right? Kind of a long supply chain for your data. There's this notion of like, well, I asked the analyst why the number changed. He couldn't answer. Like, what does that guy even do? I think there's friction there, but I think it's friction mostly because of education. Right? It's business users often just not knowing how hard the problem is. That's a place, by the way, that I think we as data community could just do better, you know, putting it out there and helping people get it.
The other 1 that I think is really interesting that you alluded to is between software engineers and then downstream data teams. That's also 1 where I see again, people probably hate on me for saying this, but like, what I hear a lot from the data team is that the software team just doesn't care. Like they think of it as we're gonna build out this thing. We're gonna build it and make it work right. We're gonna answer to our users. And then if somebody else is consuming the data exhaust, then it's kind of on them to make sure that that's always useful. And I don't remember who said it, but this whole notion of like software eats the world and then data eats software.
I think there's this culture change that's gonna have to happen where the data team ends up kind of having equal clout within engineering to the software team to make those changes happen. And some of the teams where you kind of see these kinds of changes happening are places where you have, like, really strong, assertive, forward looking data leaders who have a lot of clout within their organization. And so they're able to guarantee that kind of the raw materials that they do everything else with downstream coming in from software, stay fit for purpose, stay useful.
But there is this organizational negotiation that has to happen in order for that to go forward.
[00:27:53] Unknown:
Another aspect of how to think about that is the question of push versus pull, where am I the software team defining what this contractual interface looks like, and I'm pushing that to the data engineers to then take and use how they will, or am I the data team pulling on the software team to conform to the data contract that I am trying to enforce? And that also plays into the question of external data sources where I am a user of a SaaS product and I need to consume some of the data from that platform, whether it's, you know, Google Analytics or Salesforce or HubSpot or Segment or whatever it might be.
And they are pushing, you know, an interface that I have to conform to, or am I able to kind of pull on them to adapt to my needs?
[00:28:43] Unknown:
And, again, it's stakeholder negotiation problem. Like, where I've seen it work, most teams start from the notion of actually, most teams start from the notion of I'm gonna put great expectations or testing tool in above myself, like just upstream of myself to defend from bad data. And if something goes wrong, then I will go and talk to that team. And so it's just sort of on the consumer, right? So it's sort of by default, it is a pull motion, But it's a pull motion that sometimes it's a little bit hard to get on that other person's calendar. I think the more that teams take this product mentality of, hey. I'm gonna define the requirements or I'm gonna define what the data is and then push that out to everybody, I think that'll actually be good for organizations.
But what I'd say is you need to have a design process for that, and you need to include the people who are actually consuming the data. So if you just say, you know, the software engineering team is going to define what the logs look like and they're gonna do it without talking to anybody who's using the data downstream, the likelihood that that's really gonna work consistently for the data team later on is just pretty low. Like, the software team won't even know the right questions to ask. I can't tell you how many times I've talked to a team where there's some kind of eventing system, like event logging system, and you build a machine learning model that looks at those events and uses it as 1 of your inputs for classification or or something like that. And the upstream team changes the eventing system because they wanna plug something new.
From their perspective, like, the schema hasn't changed. Like, the events are still the same. So, like, it often doesn't even occur to the software team that they might have just completely broken the machine learning models downstream. So so just to summarize it, like, if it's gonna be a push model, you've gotta have a data design process that includes all the people who are gonna be consuming it. Or and you do see some teams doing this in a disciplined way, saying, we're gonna serve team a and b, but sorry, teams c and d. You are just not 1 of our stakeholders. So if you want to subscribe to this data contract, great. But we're not gonna gather requirements from you. It's hard to do, maybe the right choice in some organizational settings.
[00:30:44] Unknown:
Yeah. And that's an interesting way to view it of bringing it back to the idea of contracts as a legal document. There are people who are kind of participating in that negotiation and in that agreement where, you know, if you aren't a signatory on the contract, then that contract has no applicability to you. And so that question of, okay, I as the software team or I as the data team am establishing this contract, that contract only applies to the people who have explicitly agreed to it and everybody else can either, you know, try to negotiate a new contract or they're just kind of left out in the cold and I'm interested in your thoughts on that.
[00:31:23] Unknown:
Yeah. I mean, it starts to get back towards how do we use data in society. And, like, you start to get close to some data ethics questions there. Right? Just about representation and inclusion. What I would say is I think there's just a baby steps conversation, which is most organizations that I've seen, the software engineering team really doesn't think of the data engineering team as a team that can help them define requirements. They think of them as a team that sort of comes after in the process. And that can work. Right? Like, most companies are doing that. But if you really wanna be a organization that's making the most fluent use of data, Data people have to be stakeholders on the software team. I would just start with that, at least 1 stakeholder group that you have to consider, or else everything else is gonna be way, way, way harder.
[00:32:08] Unknown:
And then that also comes into the question of what do team topologies look like? Like, is it a data team and, you know, all the data people are in 1 spot? Or is it the embedded model of software teams have a data oriented contributor as well, and so that helps to inform the conversation and inform the direction of how they implement their work.
[00:32:29] Unknown:
I've never actually done this, but I've got the suspicion that if you went and you did kind of a like sentiment analysis over the last probably 40 years, the preferred topology would like flip every 5 or 10 years. Right? Like, oh, we wanna be mostly centralized with some feelers out to other groups or, oh, we wanna be mostly embedded with a small working group that can share ideas. Like, that organizational tension, I just I think it's fundamental because data teams have to go back and forth between this notion of services and product. The 1 place where you see it most strongly not happening is around data platform teams, where it's clear that what they do is they build a data platform or data product within the team. It's like the 1 place that you don't hear people asking if that if the topology should shift this time around. From that data platform perspective, that's an interesting lens on data contracts as well, where
[00:33:18] Unknown:
as the platform team, is it your responsibility to help with definition and enforcement of these contracts or are you agnostic to the actual data that is being manipulated through your platform? I mean, you're just providing the underlying infrastructure and services for other people to build on top of it and build and maintain their own contracts?
[00:33:37] Unknown:
Depends on the size of the team and kind of the stage that they're at. But for most companies that are big enough to need a data platform team, when you look at the number of data assets, so like tables or log files or things like that they have, it's, you know, at least dozens, probably 100, maybe 1, 000. And at that point, it's just not feasible for the platform team to really get what's going on in all of those. So often you'll see teams that are sort of specialists in some subset of that. And that puts them in a place where they're dogfooding their own tools and they're in the workflow. They can have kind of empathy with everybody else using it. But and the whole reason you use a platform is because you have a lot of stakeholders. If those stakeholders have worthwhile stuff in their heads, which presumably that's why you hired them, you can't assume that there's gonna be a small group that can somehow know everything that everybody else knows. It's it's too much complexity.
[00:34:27] Unknown:
In your work of building and maintaining the Great Expectations project and its community and having conversations with people about their views and applications of data contracts, what are some of the most interesting or innovative or unexpected ways that you've seen these ideas applied?
[00:34:43] Unknown:
So all of these, I'm going to kind of caveat with your mileage may vary every time Kind of core contributor on the project. Every time I say these things, like some of the really interesting stuff is also stuff that we don't completely support internally. Just saying that, but I talked to an energy company that is looking at using great expectations to do what they call self healing data pipelines. And if this were a marketing team, I would groan and roll my eyes, but it's the engineering team. They've got data feeds coming in from a bunch of places. They change and break with some regularity because they don't control the upstream data sources, but actually the ways that they break are, there's always new random weird stuff out there, but a lot of the ways they break are kind of predictable. So predictable in the sense of, there's a small set of fixes that you can apply. So they're using great expectations to check data, like, very frequently, many times a day, detect those problems.
When possible, immediately apply sort of an automated hotfix and then roll back and replace the corrupted data. Not fully in production yet, but, like, they thought it through in a pretty detailed way. It's a cool use case. I also talked to a team the other day that's publishing data on Amazon Data Exchange. And, basically, what they said is, hey. There's a lot of stuff floating around the Amazon Data Exchange, and the quality of that data varies wildly. And often it changes over time or breaks, and it's amazing place to have something like a data contract. So they're looking at using, Great Expectations data docs as a way of almost like a warranty or quality guarantee on the data that they're publishing through this data exchange.
I think that's a really interesting 1 because we've said for a long time that the, like, the team interfaces are a place where these are particularly valuable. The notion of that being not just across teams, but across companies or across organizations where there's a purchase of data going on, that's interesting. Right? It pushes data contracts to the place where it actually get starts to get closer to the, like, real contracting process of selling data.
[00:36:41] Unknown:
In your work of trying to understand ways that Great Expectations is applicable to this area of data contracts and building these organizational interfaces for data. I'm wondering what is 1 of the most interesting or unexpected or challenging lessons that you've learned in the process. I mean, that's
[00:37:00] Unknown:
5 years of my life at this 0.1 thing that's gone through my head just as we've been talking, that's kind of been growing on me is if you go back a couple years, I wasn't quite sure where orchestrators were gonna net out in all of this. Right? Like, there was a possibility that they were just gonna dissolve in the data warehouse. And I would say that I am a lot more bullish on orchestrators than I was a year ago because this notion of kind of, like, cross tool compilation and having continuity, I think there's a really important role to play there. And, I mean, I know Snowflake is doing more Python. I think they're reaching in this direction, but it's hard for me to see a world where they get there and do it well enough and fast enough to really own that layer and do a good job. So, yeah, I think that, like, the aggressive version of this would be I think there may be a play for the orchestrator types to almost disintermediate some of the workflow tooling that Snowflake and Databricks are building.
We'll see. I think it would still turn out well for Snowflake and Databricks than, like, the underlying compute systems because they're still gonna manage the compute. But there's this interesting workflow kind of orchestration layer growing on top of that where the value prop is different and that, yeah, just a lot of the usage is gonna be different. And I'm more bullish on orchestrators and then everybody who ties into orchestrators. Right? And we, for great expectations, we think of ourselves as being this important data quality layer that travels really well and is super, super developer friendly. So I think to the extent orchestrators do well probably helps us too.
[00:38:28] Unknown:
For people who are exploring the space of data contracts in the manner that we've been discussing today, what are the cases where you see that as the wrong choice?
[00:38:38] Unknown:
I think it's the same thread that has run through the whole conversation in the sense of, I would like to say that there's a version of a data contract that everybody should be using, but that the choice you should be making is how big is it, how much should you invest in it. So if you're a small data team just barely getting started, I think it's healthy to have some tests on your data, right, and think about it almost as an internal contract that doesn't have anybody else involved. Don't need to invest a lot of time in it. You don't need to put you know, do a ton there. But you want the framework in place just from a hygiene perspective so that, you know, your future self won't forget the details that you know today.
And then later on, when other stakeholders come calling and say, hey. We need to use this data or we've started using this data, then you're in a place where you can easily level up that, like, proto contract into something bigger. So I think it's because I think of contracts very much in this, like, progressive way that there's a version that's small enough for everybody.
[00:39:31] Unknown:
As you continue to build and evolve the Great Expectations project and build your kind of commercial product around that? What are some of the ways that you're thinking about evolving the support for data contract definition and validation in the near to medium term?
[00:39:48] Unknown:
I think we're already very powerful and very expressive on the validation side. I mean, huge library of expectations and the fact that you can always build something custom. So the place that I'm most interested in pushing on and working on as a team is tooling to get there faster. So that includes things like data assistance, includes just being really thoughtful about the workflows that people are already using, being in notebooks, but not exclusively in notebooks, and just getting to a place where there's a really good toolset to kind of define and diff and tweak and, you know, work with expectations that make up data contracts in ways that travel well across very large data systems.
[00:40:26] Unknown:
That's 1 more open source. That's an interesting thing to think about too is that for the kind of definition of these data contracts, there are kind of different layers of meaning and understanding that go into it, where if I say I want to make sure that this column has integers that are between 0 and 1000, you know, that's the very literal, you know, this is what I want to happen, but then there's also the contextual aspect of why do I want this to happen. And 1 area that we didn't really explore yet is that question of translating between, you know, what are the contextual and business rules that I'm actually trying to enforce versus the very tactical elements of, you know, what are the specific operations that are being performed to ensure that that rule is being applied and validated and being able to manage that kind of collaboration between the very technical aspects of implementing this with the organizational and contextual aspects of understanding what is the purpose of this.
[00:41:27] Unknown:
Yep. I really like the way you've said that. Kind of those 2 different viewpoints. Because 1 viewpoint, the engineering viewpoint, is gonna be extremely detail oriented. You wanna be, like, very, very much right in the data, you know, getting it exactly right. And then there's also gonna be this organizational view of, hey. We just wanna make sure that we have sanity checked everything and it's good enough, but you're not gonna be like, you know, a compliance officer or somebody like that isn't gonna be able to engage the level of every single check for every single row. I think the notion of data quality scorecards and building tooling that helps people define and reuse those, I think that's a really interesting direction to go.
And 1 of the reasons I like it is because from the perspective of like governance or compliance or consistency, you can give people what they need there. But from the engineering perspective, you can actually give tools that end up basically being bulk editors for the things you can already do. And so it actually helps resolve what was a tension in a lot of the previous generation of compliance and governance products where lawyers would, you know, want 1 thing and that would create a ton of work for the engineers. I think the way we've designed the abstractions for Great Expectations, you can make that workflow a lot easier for everybody.
And so we haven't built this all the way out, but it's 1 of the things we're pushing on, I'm excited about.
[00:42:44] Unknown:
Are there any other aspects of this space of data contracts and the technical and social aspects of building and implementing and maintaining and evolving them that we didn't discuss yet that you'd like to cover before we close out the show?
[00:42:57] Unknown:
I'll do 1 that's specific to us and then 1 that I think is just kind of generally interesting. So the 1 that's specific to us, and this is more around the Great Expectations commercial product, is I'm excited about what happens when you have specific user logins and roles. So you know who is performing an operation related to a data contract and all the things that you can start to do there. But 1 of the things that open source Great Expectations just doesn't do is there's no login, so there's no concept of a user, so you can't keep track of who's saying what or why they're saying it or where they're coming from. Once you layer that in, it actually gives you a lot of, like, really interesting workflows open up. So in in our commercial product, that's the thing that we're pushing on, and we are working with design partners to develop that. So if you're interested, like, if what I'm saying is interesting and feels like a good use case, then, yeah, we'd love to talk. We're being pretty selective right now, but, yeah, we'd love to engage with people.
I think that there is an interesting thread around data contracts and kind of socially responsible use of data that I just haven't seen people really bringing up. Like most of the conversations that I've heard have been about how does this change workflows for data people within organizations, which is very real. And like, I'm totally there for that. But there's another version of this conversation where you say, if you wanted to provide more social accountability and visibility into important datasets or important information processing systems, How could you use data contracts to do that? And I just haven't really heard anybody ask that question in a serious way yet.
There's Elon Musk's thing about, well, we'll open source the algorithms for Twitter and then make it more free speech and transparent. And I kind of think that particular take is hogwash, but there's a version of that that I think could actually be very real, where you're taking these, like, deep information processing systems that right now are extremely opaque to anybody outside the system and using data contracts to bring other people in from a visibility perspective. And at least in kind of techy data Twitter, I haven't heard that talked about much. I think we should. I think there's an interesting conversation to be had there.
[00:45:09] Unknown:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or participate in your early design partnering. I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:28] Unknown:
I'm gonna double down on what I said before. I think orchestrators are going to interesting places, and I think we're I don't know if 2023 is the year that that gets big, but if it's not that year, it's 2024. I just think there's a lot of interesting stuff that's ready to happen there.
[00:45:42] Unknown:
Well, thank you for taking the time today to join me and share your thoughts and experiences on this space of data contracts and the ways that they can be considered and applied and the kind of technical and social aspects that are embedded in that. So I appreciate your time and energy on that and the work that you're doing, your great expectations to help support that conversation and the growth of that layer of the data ecosystem. So thank you again for that, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for asking great questions. It's a pleasure to be here.
[00:46:20] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Abe Gong and Data Contracts
Defining Data Contracts
Organizational Approaches to Data Contracts
Technical Implementations and Validations
Data Contracts as Safety Mechanisms
Data Mesh and Data Contracts
Challenges in Automation and Integration
Push vs. Pull Models in Data Contracts
Data Platform Teams and Contracts
Innovative Uses of Data Contracts
Lessons from Great Expectations
Contextual and Business Rules in Data Contracts
Future Directions and Social Responsibility