Summary
Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Great Expecations is and the origin of the project?
- What has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago?
- Prior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them?
- What are some of the types of checks and assertions that can be made about a pipeline using Great Expectations?
- What are some of the non-obvious use cases for Great Expectations?
- What aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion?
- Can you describe how Great Expectations is implemented?
- For anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments?
- What are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering?
- Can you talk through some of the ways that Great Expectations can be extended?
- What are some notable extensions or integrations of Great Expectations?
- Beyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations?
- What are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used?
- What are the limitations of Great Expectations?
- What are some cases where Great Expectations would be the wrong choice?
- What do you have planned for the future of Great Expectations?
Contact Info
- @jpcampbell42 on Twitter
- jcampbell on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Great Expectations
- Podcast.__init__ Interview on Great Expectations
- Superconductive Health
- Abe Gong
- Pandas
- SQLAlchemy
- PostgreSQL
- RedShift
- BigQuery
- Spark
- Cloudera
- DataBricks
- Great Expectations Data Docs
- Great Expectations Data Profiling
- Apache NiFi
- Amazon Deequ
- Tensorflow Data Validation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Pareteum Global Intelligence, ODSC, and Data Council.
Upcoming events include the software architecture conference, the Stroud Data Conference, and Picon US. Go to data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today, I'm interviewing James Campbell about Great Expectations,
[00:01:42] Unknown:
the open source test framework for your data pipelines, which helps you continually monitor and validate the integrity and quality of your data. So, James, can you start by introducing yourself? Absolutely. It's great to be here. Like you said, my name is James Campbell, and I am currently working at Superconductive. We are really pivoting at this point to focus pretty much full time on great expectations and helping to build out data pipeline tests. My background, academics wise was math and philosophy. And I spent a little over a decade working in the federal government space on national security issues, focusing on analysis first of cyber threat related issues and then political issues.
[00:02:21] Unknown:
That's 1 an interesting combination of focus areas between math and philosophy. And then I'm also sure from your background of working in government and national security issues that there was a strong focus on some of the data quality and some of the problems there. So it seems logical that you would end up working in the space with, with great expectations as something that would be of a primary concern given your background?
[00:02:44] Unknown:
Absolutely. I mean, for me, I've always been very interested in understanding how we know what it is that we know and how we can really convey the confidence that we have in raw data and assessments on that data to other people. It's been a really continuous thread for me, especially like you mentioned in the intelligence community. There's a tremendous focus on the term often used as analytic integrity, or ensuring that the entire process around understanding a complex situation is really clearly articulated to ultimate decision makers. And I think that's an important paradigm for any analytics or data related endeavor because we need to be able to ensure that we're picking the right data for the job, that that we've got all the data that we need to be able to understand what we can. And, of course, that's often not the case, and so we end up needing to find proxies or, in other ways, compromise. And so then we need to figure out how we can convey that effectively to the people who will ultimately be using the data for some sort of a decision. And do you remember how you first got involved in the area of data management?
Absolutely. I think for me, there was a a pretty pivotal time when I was leading a team of data scientists who were training models and and sharing them out with, their colleagues. And in addition to the actual model building process, we had a a whole focus on education and training for people who who would be able to use these kinds of more sophisticated analytic models in their day to day work. And there were a lot of discussions on the team about how we could effectively help them understand, what kinds of data requirements were in place for the kinds of questions that they wanted to ask. And what I I found was that that really meant that we needed to be able to not only understand the data that we had available, but also understand the kinds of data that other people had available. And, that meant that there is a huge amount of essentially metadata management and knowledge management and sharing that needed to happen.
And when you do that across teams and across organizations, it really amplifies the challenge. But, of course, I think that also makes it a a much more interesting and exciting area to get to work in. And so
[00:05:03] Unknown:
in terms of great expectations, it definitely helps to solve some of these issues of the integrity and quality of the data. And I know that particularly with some of the recent releases, it helps to address some of the issues of communications around what the dataset contains and some of the metadata aspects of where it came from, what the context is. But before we get too deep into that, I'm wondering if you can just give a broad explanation about what the Great Expectations project is and some of the origin story of how it came to be. Absolutely. Great Expectations,
[00:05:35] Unknown:
we use the tagline always know what to expect of your data. And it's a project that helps people really clearly state the expectations that they have of data as well as communicate those to other people. I mentioned that, experience I had of leading a team where we were doing model building. And 1 of the key parts of the origin story was wanting to basically be able to say, you know, this data, should have a particular distribution or particular shape around, certain variables if you wanna be able to ask a question. And I found that, again and again, what I wanted to do was not just ship people, you know, an API where they put data in and they get an answer back, but rather ship them an API that they could use responsibly and confidently.
That was a key part of the origin of Great Expectations for me. At the same time, my colleague now, Abe Gong, was working on health care data, and he, again and again, was observing breakages in data pipelines that they were building where an another team, usually, because they were making an improvement or catching an error, would make a change to an upstream data system, and that would trickle down and cause breakages throughout, the rest of of the data pipeline. And he and I were talking about these related problems. And, of course, both of us had experienced the other 1. But we realized, well, you know, both of these would be really well served by this sort of a declarative approach to describing expectations. And, you know, that's that's the origin of the project. Just realizing that this problem was so general that it was coming up not just in different places, in different industries, but with different manifestations. Right? The the same the same underlying problem has these kind of different symptoms.
[00:07:29] Unknown:
And about 2 years ago now, I actually had both you and Abe on an episode of my other show where we talked about some of the work that you were doing with great expectations. And I know that at the time, it was primarily focused on being used in the context of the pandas library and Python and some of the ways that people were doing exploratory analysis within pandas and then being able to integrate great expectations into that workflow and have a set of tests that they could assert at runtime later on once they got to production with it. So I'm wondering if you can just talk a bit about some of the changes in focus or the evolution of the project since we last spoke.
[00:08:08] Unknown:
Absolutely. It's amazing how much things have evolved in in those 2 years and especially in the last 6 months or so. Just like you said, originally, our focus was very much on validation and and supporting exploratory data analysis and that sort of a workflow. And also very much we were Panda centric. There's been evolution in a lot of different dimensions. First 1 is just in terms of the kinds of data that Great Expectations can interact with. From the very beginning of the library, wasn't specific to Pandas. It wasn wasn't specific to Pandas. It wasn't about any particular form of the data. It was really about the semantics of the data, how people understand it, what it means for the context of a particular analysis. So we've been able to to realize that goal a little bit by expanding to now support SQLAlchemy and, by extension, all of the popular big SQL databases. So we have users running great expectations on Postgres, on Redshift, on BigQuery.
Also, we've expanded into Spark. So, whether that's Cloudera clusters or Spark clusters that are, managed by Teams or Databricks, we've got users being able to evaluate expectations on all of those. And I think, again, 1 of the things that's really neat is actually it's the same expectations. Right? So the same expectation suite now can be validated against different manifestations of data. So if you have a small sample of data that you're working with on your development box in pandas, and then you wanna see whether that is, those same expectations are met by a very large dataset out on your spark cluster, you can just seamlessly do that. The next big area that we've we've pushed is in terms of integrations.
It was a it was a big pain point for users, I think, to figure out how can they actually seem, you know, stitch great expectations into their pipelines. And so we've done a lot of work in creating what we call a data context, which manages expectation suites, it manages data sources, and it can bring batches of data together with expectation suites to validate them and then store the validation results, put them up on cloud storage. You could, you know, have all your all of your validations, for example, immediately loaded desk 3 and available to your team. So that's been another big area of development. And, man, again, I I know I'm just going on and on here, but there are a couple other big areas. So 1 of them is, you know, been the thing that I think really people have been able to use to resonate with great expectations and see things move forward, which is data docs, we call it, or the ability to generate human readable, HTML documentation about the artifacts of great expectations. So about expectation suites, about validation results. And, so we we basically generate a static site that you can look at, and it it really helps you get a quick picture of what you you have in your data, as well as that you can share with your team. And then the last area is profiling.
We've done a lot of work to make it so that you can use great expectations, the library, before you've really zeroed in on what the expectations are. So it becomes this iterative process of refinement where in in an initial profiling, we basically say, you know, I expect these these hugely broad ranges. I expect the the mean of a column values, for example, to be between negative infinity and positive infinity. Well, obviously, that's true. But as a result of computing that, we we give you the metric, what the actual observed mean is. And you can use that to, especially when you're combining that with documentation, profile, and get a really robust understanding
[00:11:49] Unknown:
of of your data right away. So there's a lot there. There's a lot of innovation and work that we've been able to do, and it's been a really fun thing to get to focus more on the project. Yeah. The profiling in particular, I imagine, is incredibly valuable for people who are just starting to think about how do I actually get a handle on the data that I'm using and get some sense of what I'm working with, particularly if they're either new to the project or if they've just been running blind for a long time and wanna know how do
[00:12:18] Unknown:
a things that I see in our Slack channel a lot is when somebody will say, you know, I ran this, expectation and, you know, it failing, and I don't know why. And then they look into their data and it's well because it's not true. And and I just never never cease to love that sense of surprise and and excitement that people have when they really encounter their data in a richer way or in a way that they hadn't seen it before. What profiling does is it it just makes that happen across a whole bunch of dimensions all at the same time. Exactly right. I think more and more what we're finding is when a user is first getting started with great expectations, it was intimidating to sit down at a blank notebook and figure out, where do I go? Where do I get how do I get started? And so now what they can do with profiling is start off with a picture of their dataset.
You know, they get to see some of the common values and, you know, which columns are of which types and distributions, and it really gives them a way to dive right in. And then we can actually generate a notebook from that profiling result that becomes the basis of a declarative exploratory process. So we actually can sort of guide you through some of the initial exploration that makes sense based on the columns and types of data that you have.
[00:13:39] Unknown:
And going back to to the beginning of the project at the time that you were working with Abe to define what it is that you were trying to achieve with great expectations. I'm wondering what your experience had been as far as the available state of the art for being able to do profiling or validation and testing of the data that you were working with and maybe any other tools or libraries that are operating in the space either at the time or that have arrived since then?
[00:14:07] Unknown:
Great question. I think there's there's a lot of really good practice that I've seen, you know, in this broader space. I think 1 of the things that's important to mention is there's a huge amount of this kind of work that just happens out of band. So it it's not so much that there is a tool for it even. It's that, what we see are people having meetings, coordination meetings, data integration meetings, big word documents. So I you know, even though it's not an exotic thing to say, I think it's really important to mention that and remember that as a kind of core part of the original state of the art for how people, work on this. The other thing is there's a lot of roll your own. A lot of people who are writing tests for their data just put it into their code, and it's it's kind of indistinguishable from their pipeline code itself, which was 1 of the key problems that we wanted to solve with great expectations is making sure that tests could focus on data instead of being part of the code and have a strong differentiation there. But it's still an important part. So a lot of times when I talk to people about their their current strategy for pipeline testing, they say, oh, yeah. We absolutely do tests. It's it's just that we've written a lot of our own.
There definitely are some, commercial players in this space. A lot of the big ETL pipeline tools, whether that's open source, NiFi, or some of the big commercial players, have data quality components or or plugins that you can use. And I and I I think that's really you know, it's it's obviously, I think, a best practice to have some some sort of test. So I think that's also really important. You know, I think there's there's a lot of things that people do with just metadata management strategies. So, you know, capturing whether that's in a structured way or or a little bit less structured, you know, a a metadata store for datasets. So I I see a lot of work there, which is which is really exciting. And then the the other thing I'll say specifically to your kind of things, how have things evolved is there are 2 other open source projects that I I see having come out of some really big companies that I I think it's exciting. You know, the investment that they're making and their willingness to open source that really demonstrates the scale of the problem and interest here, which and the ones I'd mentioned specifically are, DQ from Amazon and, TensorFlow data validation, which, of course, is very TensorFlow specific, but also gives you a lot of ability to have that kind of insight into the data that you're observing and how it's changed.
[00:16:32] Unknown:
And 1 of the values specifically that great expectations has, which I think can is is hard to overstate is what you were saying earlier about being able to build the expectations against a small dataset and then being able to use those same assertions across different contexts because of the work that you've done to allow great expectations to be used, on different infrastructure, whether it's just with pandas on your local machine or on a spark cluster or integrated into some of these workflow management pipelines. And so more broadly, I'm wondering if you can just talk about some of the types of checks and assertions that are valuable to make that people should be thinking about, and some of the ones that are, notable inclusions in great expectations out of the box? Yeah. Great, great question. And
[00:17:21] Unknown:
and I think, you know, to your point about the ability to work across different sizes of data, I would also add, really quickly that I think 1 of the neat ways that we see that is just making sure that teams are working with the same data. So sometimes it's it's not even about that there are different sizes, but, you know, when when when data got copied from 1 warehouse to another, did all of it move, for example. So to that end, I think 1 of the most important tests that I think is easy to overlook is missing data. You know, do do we have everything that we thought we would have, both in terms of in terms of columns and fill, as well as in terms of date or deliveries.
1 of the things that is really powerful, of course, about distributed systems is they're very failure tolerant. But 1 of the things that that can mean is they're silent to especially small batch delivery failures. You know, I think set membership is another area that there's, like, you know, really basic, but really important kinds of testing. And then I think the more exotic things around distributional expectations are also, really important. With respect to some of the notable expectations that are are in great expectations, I think I would actually highlight things that people have added or extended using the tool as probably the most exciting and innovative parts. So 1 of the things that I've seen that I really thought was was neat was a team that basically just took a whole bunch of regular expressions that they were already using to validate data, but that were basically inscrutable and used those to create custom expectations that were then very meaningful to the team. So, for example, you know, we could say that I expect values in this column to be part of our normalized log structure, and that corresponds to being able to be, you know, matched against a bunch of regexes or or have some other kind of parsing logic applied. But what it's doing is translating between something that a machine is really, really good at checking, but is very hard for a person to understand what it means to something that a human really understands immediately, intuitively. It has all the business, connection for them, but that would have been very hard for a machine to verify.
So it's really, I think, exciting to be able to help link that up.
[00:19:43] Unknown:
And I think that goes into as well some of the non obvious ways that Great Expectations can be beneficial to either an individual or a team who are working with some set of data. And I'm wondering if there are any other useful examples of ways that you've seen great expectations used that are not necessarily evident when you're just thinking specifically in the terms of data testing and data validation.
[00:20:05] Unknown:
Absolutely. I I think 1 team that I saw do something I thought was really clever was they effectively built an expectation suite that didn't have most of the actual expectation values filled in. And then they they took that kind of template, and effectively, it became a questionnaire. What is the minimum value for the churn rate that would be unusual? And sent that to their analysts, the people who were consuming reports built from the data that they managed, in order for them to fill it in. And so it became a a way for there to be a really structured conversation around what these what these 2 different teams understood the data demean and, how it should how it should appear that helped the engineering team understand the business users better and the business users understand the kinds of problems that the engineering teams were facing. So I thought that was a really you know, a fun use case. Another 1 that I that I I've seen that I think is really neat is, well, I call the patterns or fit to purpose. And the basic idea is, you know, we talk about pipeline tests as something you can run. Obviously, you can run-in a pipeline, but usually, you know, before your code or after your code. And, you know, just like with a pipeline, there's a lot of branching. With expectations, there are the same.
There are teams that are gonna use a dataset in 1 way, and so they have some expectations. And another team may use the same data in a very different way and end up having very different expectations, around the same data. So that that ability to elucidate that realization
[00:21:42] Unknown:
or that, reality of data is really fun. Yeah. I definitely see the fact that Great Expectations has this sort of dual purpose of, 1, the operational characteristics of working with your data of ensuring that I'm able to know at runtime that there is some error because either the source data has changed or there's a bug in my manipulation of that data. And so I wanna know that there are these invariants that are being violated, but at the same time, it acts as a communications tool, both in terms of the data docs that you mentioned, but also just in terms of what tests am I actually writing and why do I care about them, and then using that as a means of communicating across team boundaries so that everybody in the organization has a better understanding of what data they're using and how and for what purpose.
[00:22:23] Unknown:
Exactly.
[00:22:24] Unknown:
Along those lines too, it's interesting and useful to talk about some of the aspects of data pipelines in terms of what are some of the cases that can't be validated in a programmatic fashion and some of the edge cases of where great expectations hits its limitations and you actually have to reach to either just a manual check or some other system for being able to ensure that you have a fully healthy pipeline end to end and just some of the ways that great expectations,
[00:22:53] Unknown:
there can be bent to fit those purposes or needs to be worked around or augmented? Yeah. That's a that's a tough question, obviously, for somebody who's building building something because I love to think that, it's great for everything. But, you know, of course of course, there are a lot of of really challenging areas. 1 of them that I think is really interesting and that has been a lot of discussion inside our team is the process of development of a pipeline. So while you're actually still doing coding, it's very tempting, I think, to blur the line between what is effectively a unit test or an integration test around a pipeline and a and a data pipeline test. So I think great expectations is probably not the right choice when you're doing that active development because you may not yet know what the data should look like. And while you're doing that very rapid iteration, that's probably even too much about that. But another area that I think is interesting for this reason is anomaly detection. Anomaly detection is really tempting with great expectations. And I want to I want to love how, how strong of a tool we are there. The end of the day, though, I think anomaly detection is sort of always a very there's a huge amount of precision recall trade off that needs to be dealt with. And so when you have a situation for great expectations, like where you wanna use great expectations, and what you're observing is that your underlying question is, well, you know, what is my kind of precision recall trade off that I have for for and expectation? I think that's another area where what I would just say is we still have a lot of work to do. Because insofar as what we're trying to do is communicate things well, there are solutions. And 1 of the solutions that we're working on right now, basically, is the ability to have multiple, like I was describing earlier, sets of expectations on the same dataset that reflect different cases for different points on that precision recall trade off in that space. And then I guess there's kind of another area, which is the overall coverage or when you want to use great expectations in the context sort of acceptance testing for a final product. I think there's you know, it's always good to make sure you have some point in the process where a human is in the loop for a high stakes decision. And so, I I, you know, I would be wary of somebody saying, oh, well, we've we've got all of our expectations encoded. It's, you know, good to have some meta process around the use of expectations in a pipeline that allows you to to check that and and really assess the coverage
[00:25:38] Unknown:
that you have. Yeah. That's definitely an interesting thing to think about because for people who are used to using unit tests in the context of application has been tested in some fashion. I'm wondering how that manifests in the context of data pipelines and how you identify areas of missing coverage or think about the types of tests that you need to be aware of and that need to be added, particularly for an existing pipeline that you're trying to retrofit this onto?
[00:26:11] Unknown:
I think the issue of of coverage in general is just fascinating in the context of data pipelines because it really opens such a fraught question of what it even means to have asked everything that you can about a a dataset or a pipeline and and a set of codes. So I think what Great Expectations is sort of helping you do is complement the process of 1 kind of testing where you're asking, have I been able to anticipate everything that I see with, another kind of testing where you're asking if what you see is like the kinds of things that you anticipated. Then what that means is I think it really comes back to that question of fit for purpose.
So, for example, if you are going to eventually be using a dataset that you're creating for a machine learning model, there are a variety of features with different levels of importance to that model and, how well each of those features over the the space of your function reflects the kinds of things that you saw in the training dataset. You know, how many elements you see in clusters that you observed in the context of an unsupervised learning problem are good ways for you to diagnose that there may be a problem, but they're really just part of the overall analytic process. And maybe I think the right way to to spin this really positively is that just like what we were saying at the very beginning about some of the more basic tests, you know, are these values the ones that the data provider said they were going to be providing? What we're really allowing is you to have a robust conversation and process of exploring and understanding
[00:27:46] Unknown:
and getting new insights out of data. And now I'm wondering if we can dig a bit deeper into how Great Expectations itself is actually implemented and particularly some of the evolution of the code base moving from your initial implementation of focusing on integrating with Pandas to where it is now and how you're able to maintain the declarative aspects of the tests and proxy that across the different execution,
[00:28:12] Unknown:
context that you're able to run against? Yeah. This is a really fun question. This is what I've been spending a lot of time thinking about right now. And, actually, we're in the process of launching another major refactor to the way that this works. The basic idea for us is that expectations are named and that's what allows us to have the declarative syntax. So there's this concept of the human understandable meaning, what it is that you expect. And as you recall, these expectations, we always use these very long, very explicit names, expect column values to be in set. And we then have a layer of translation that is available per expectation and per back end. 1 of the key changes that's happened is we've introduced an intermediate layer called metrics that allows expectations to be defined in terms of the metrics that they rely on. And then the implementations can translate the process of generating that metric into the language of the particular back end that they're working on. So, for example, if you have an expectation around a column minimum, then rather than, actually translating that directly into a set of comparisons. What we're doing is asking the underlying data source for the minimum value and then comparing that to what you expected. There's there's, of course, a little bit of magic around ensuring that that works and scales appropriately on different back ends and that we can bring back the appropriate information for different kinds of expectations, you know, some of which look like I was just describing with Minh in an in an aggregate way, and some of which, look across rows of data 1 at a time. But that's been a really important part of the evolution. And then where we're going next is to have expectations be able to even encompass a little bit more logic so that they will also, in the in the same part of the code base, contain the translation into the full verbose language, you know, locale specific documentation version of an expectation.
So if the name, expect column values to be less than, is fixed, and that gets translated in 1 way to the back end implementation.
[00:30:32] Unknown:
Parameters, like it usually should be more than 80% of the time or whatever additional parameters are are stated. All that is coming together in the code base to make it much easier for people to extend. And on the concept of extensions, I know that 1 of the things that there's been a recent blog post that you released is in terms of adding a plugin for building automatic data dictionaries, but then there's also the extensibility of it in terms of the integrations that are built for it for particularly for projects such as Daxter or Airflow or the different context that grid expectations is being used within or how it can be implemented as a library. So I'm wondering if you can talk about some of the interfaces that you have available for both extending it via plug ins as well as integrating it with other frameworks.
[00:31:17] Unknown:
Yeah. Great. That's really fun area too. So a lot of that resides in the concept of the data context that I mentioned at the beginning of our conversation. The data context makes it, really easy to have a configuration. It's a YAML based configuration where you can essentially plug different components together. So, for example, you can plug a new data source that, knows how to register with airflow or that knows how to read from a a particular database that you have or an s 3 bucket that you maintain. And so there's this composition element that the data context provides. And then in addition to that, each of the core components of GE are designed to be really friendly for subclassing.
And the data context allows you to dynamically import your extensions of anything. So the data dictionary approach, for example, was effectively a custom renderer. So it took the way that we were building the the documentation pages, and it just modifies a little bit. So it's still using the same underlying great expectations artifacts, the validation results and expectation suites, but it's it's combining the data in a different way and building some new elements and putting them on the top. What we're really hoping is that, as the community gets more and more engaged with great expectations, we can start to develop effectively a gallery or this location where people can post the extensions that they're building and begin to share the, you know, whether it's a new page or artifact or a new data connector or something else, like actually Expectation Suite. So a domain specific set of knowledge that they have, around, you know, for example, public dataset that they've been working with. What are some of the expectations that were useful for their project? And even those could be shared and hosted in the gallery. Yeah. I I definitely think that the extensibility,
[00:33:08] Unknown:
particularly for integrating with some of these workflow management systems is highly valuable for how Great Expectations is positioned. I remember talking to Nick Schrock about with the DAXTER project, and I was seeing that there's an integration with that and then also things like Airflow or Kedro. And so the fact that grid expectations as a standalone application works with SQLAlchemy and Pandas and Spark is definitely valuable, but I imagine that by being able to be plugged into these workflow engines, it actually expands the overall reach beyond just those data sources because of the fact that it's operating within the context of what that framework already knows about as far as the datasets and the information that it has in the context of processing that data. So I'm curious, what are some of the ways beyond the surface level of great expectations as its own application,
[00:34:01] Unknown:
how those integrations expand its overall reach and some of the potential that it has? Well, I think the key thing that it does is it really opens up the idea that great expectations can become a node itself in an overall data processing pipeline. So rather than, you know, I run great expectations, then I run my pipeline, then maybe I run great expectations again. You have a orchestrated pipeline that can have dependencies, that can have conditional flow, and Great Expectations can really play happily in that ecosystem. It can signal back to airflow that a job has successfully produced an artifact. It can put documentation automatically build and put documentation, for example, in a public site and fire off a Slack notification that says that this validation happened and it was successful, and then immediately work together with Airflow to continue processing or delivering data to teams. So it's that ability to have great expectations be a part of this broader toolkit available to data engineers that I think is is where we're going with these plugins. So you get that advantage of the declarative and the, human understandable expectations.
Also, of of the of the, you know, just critical, you know, machine verifiable, expectations. And then it becomes a part of your overall deployed pipeline.
[00:35:25] Unknown:
And I know that we've spoken a bit about some of the interesting or innovative or unexpected ways that GridExpectations is being used within these different contexts of communication and execution, but I'm wondering if there are any other areas that or any other interesting examples that we didn't touch on that you think are worth calling out. I think the most interesting examples
[00:35:44] Unknown:
for me are about the different domains where Great Expectations is being used. You know, frankly, I think we've really covered a lot of the the ways that it gets used, and it's more that people are using Great Expectations from all kinds of teams. Actually, let me revise that. I'll give you I'll give you another specific 1 that I thought was was fun. We had a team that, came into Great Expectations Slack recently and and talked about how they were using GE in a way that I thought was really also, you know, another fun 1, which is they have a centralized data science team because, you know, it's a scarce resource to be able to ask for a specific, deep dive analysis.
And a lot of distributed business units, which all have similar but not the same data. And so the data science team was spending a lot of time translating very similar analyses that all these different business units had, you know, geographically different. But from the business perspective, they were doing very similar things. And so, they needed to find a way to reduce the time that they were spending doing translation for each of these different business units. And Great Expectations is really a very neat fit there because it allows this this 1 centralized team to be able to effectively talk to a lot of different teams. So it's not about about change. It's about a lot of different teams that are doing similar things with slightly different data and really quickly, articulate what it is that they need to see and how that compares to what they're getting from each of teams. And we've already talked a bit about some of the limitations of Great Expectations
[00:37:18] Unknown:
of where there are areas in a pipeline that just need to have some sort of manual intervention or human input. But I'm wondering what are some of the other areas where Great Expectations might not necessarily be the correct choice
[00:37:31] Unknown:
and some of the ways to think about how best to leverage great expectations or where it fits into your workflow. So I think that we talked a bit about anomaly detection and, you you know, the case where you have very dynamic data that you know is kind of changing or evolving over time, and that's okay. But as as long as it, you know, effectively, you're doing some sort of change point detection. That's an area where I think great expectations doesn't have a lot of support yet. Another area that is 1 where I think we will have a lot of support, but it's just not built out yet, but it but it comes up a lot is in multi batch expectations. So expectations that or or basically using metrics or statistics that are noble only when you look at a whole bunch of different batches of data. I'll bleed that question really into maybe a more forward looking positive thing of saying there's a lot that we're doing in that space right now. It's really fun to get to think about those kinds of multi batch problems or even even change point problems. And so we've got some design work that we've been doing recently that can help with with both of those. For now, though, I think those are areas where I would certainly keep a fairly high touch analytic process involved in processes or data flows that have those characteristics.
[00:38:45] Unknown:
And as you look forward to the next steps that you have on the road map for great expectations and some of the overall potential that exists for the project, I'm wondering if you can just talk about what you have in store for the future and some of the ways that people can get involved and help out on that mission. Absolutely. Well, we talked a little bit about this,
[00:39:04] Unknown:
gallery idea, the idea that there will be progressively more information about how to solve particular problems and and how to integrate with different, you know, additional back ends or pipeline running systems. So I see a lot of work that we're moving towards focused on the space of really unlocking, kind of like a flywheel or accelerating a flywheel of community involvement so that more and more people are able to help contribute to driving the project. To to get there, 1 of the things that we're gonna be, I think, focusing on and and where I think people would be really welcome to continue to engage a lot is basically a quality and integration testing program where, you know, as more and more different kinds of data and data systems are plugged in with great expectations, you know, it's really useful. I mean, it's I think about this all the time with SQL where, you know, there's this kind of expected homogeneity of translation, but then that runs into the reality of a variety of different back ends and syntax and so forth. So we're doing a lot in order to kind of ensure that when you state an expectation that you're gonna get exactly what it is that that you stated. On the other hand, we're doing a lot of work just in future iteration, improving the expressivity of expectations. So making it possible to have expectations about time that are immediately, or that are, you know, first class, that adding some features around, you know, support for what I call teams with skin in the game, you know, things where you know you're gonna get a pager call. Well, that that was dating me. And You know you're you know you're gonna get a Slack message, if something goes wrong. And, so, you know, that's alerts and notifications of different kinds, ensuring that there are additional actions available to validation operators and things like that. So that's the other kind of big area. It's just adding adding new features and support. Now, eventually, what I'd love to see is that that there are,
[00:41:02] Unknown:
opportunities to make it just as easy as possible to get great expectations in your environment. You know, some sort of just 1 click spin spin up my expectation store, validation store and all that. And, I think we'll get there too. And 1 of the other things that I think is interesting to briefly touch on is the types of data that are usable with great expectations where a lot of times people are going to be defaulting to things that are either in a SQL database or a textual or numeric data. But then there are also potential for things like binary data or images or videos. And I'm wondering what are some of the ways that great expectations works well with those or some of the limitations to think about what types of datasets are viable for this overall approach to testing.
[00:41:47] Unknown:
That's a great point. And, frankly, probably should have touched it more on the on the limitation side. I think there are a couple of areas where it's challenging to use grid expectations for those reasons. 1 is if you have non tabular data at this point, expectations really are oriented around columns and and rows. Obviously, there are a lot of datasets that are, you know, denormalized in some way or another and have rec structured records or or so forth. So that's an area where there's, you know, kind of some transformation usually that would need to happen before you run Great Expectations. To the point of binary data and images, there aren't actually, any of our core expectations that natively speak about those kinds of datasets.
Now what you can do and what we've seen done a little bit is where effectively great expectations will become a wrapper layer around a more complicated model. So so you effectively have 1 model that is checking data characteristics that are going to be used in another 1. And so in in that context, it becomes more it becomes possible to use, Great Expectations with those kinds of data. But that's another area that's pretty challenging at this point. And are there any other aspects of the great expectations project
[00:43:02] Unknown:
or ways that people should be thinking about testing their overall data pipelines and data quality or effectiveness
[00:43:08] Unknown:
of the communication around how data is being used that we didn't discuss yet that you'd like to cover before we close out the show? I think the only other thing I would touch on is just to reflect a little bit on how much time we spend communicating about data and all the ways that that happens and thinking about how that could be made more efficient. You know, I think there are 2 big sources for inefficiency in that process of understanding and communicating data. And 1 of them, I'll call round trips to domain experts or time where you're going back and forth or sitting down in meetings and having to bring a whole bunch of people together collaboratively. And the other 1 is what I call the cutting room floor or, you know, time where there is some understanding of a dataset that you've built up, but it dies because it doesn't get written down or it doesn't get acted on.
[00:43:58] Unknown:
And so I think Great Expectations is really good for helping to address those things. Alright. Well, for anybody who wants to follow along with the work that you're doing or get involved in the Great Expectations project or just get in touch with you, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, despite all the work that we're doing and a lot of people are doing in data quality, I still think the biggest gap is
[00:44:26] Unknown:
in bridging the the world of the data collection and processing systems of the computing with the world of the people. I mean, at the end of the day, I think it's all about what we as as humans understand and as decision makers and how we see the world that we're trying to to do in this whole field. And, so I still think having people understand how models work, you know, explainability and having, machines be able to understand intent, are the things that you know, just gonna take a huge amount of work in a variety of different
[00:45:04] Unknown:
fields, and there's no silver bullet in that. But I think that's gonna be the big project area that I'm excited to get to continue working in and contributing toward. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Great Expectations. It's definitely great to see that it has continued to progress, and it it is, growing in terms of the mind share of what people are using for testing their data and the fact that that's something that needs to happen. So I appreciate all of your efforts on that front, and I hope you enjoy the rest of your day. Thank you so much, Tobias. It's a pleasure to get to talk to you and to get to be back on your show. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to James Campbell and Great Expectations
Evolution and Features of Great Expectations
State of the Art in Data Validation and Profiling
Non-Obvious Benefits and Use Cases
Implementation and Technical Details
Integrations and Extensibility
Future Roadmap and Community Involvement