Pay Down Technical Debt In Your Data Pipeline With Great Expectations

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.

Go Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Pareteum Global Intelligence,

ODSC,

and Data Council.

Upcoming events include the software architecture conference, the Stroud Data Conference, and Picon US.

Go to data engineering podcast.com

slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy, and today, I'm interviewing James Campbell about Great Expectations,

the open source test framework for your data pipelines, which helps you continually monitor and validate the integrity and quality of your data. So, James, can you start by introducing yourself? Absolutely. It's great to be here. Like you said, my name is James Campbell, and I am currently working at Superconductive. We are really pivoting at this point to focus pretty much full time on great expectations

and helping to build out data pipeline tests. My background, academics

wise was math and philosophy.

And I spent a little over

a decade working in the federal government space on national security issues, focusing on

analysis first of cyber threat related issues and then political issues.

That's 1 an interesting combination

of focus areas between math and philosophy. And then I'm also sure from your background of working in government and national security issues that there was a strong focus on some of the data quality and some of the problems there. So it seems logical that you would end up working in the space with, with great expectations as something that would be of a primary concern given your background?

Absolutely. I mean, for me, I've always been very interested in understanding

how we know what it is that we know and how we can really convey the confidence that we have in raw data and assessments on that data to other people. It's been a really continuous thread for me, especially like you mentioned in the intelligence community. There's a tremendous

focus on the term often used as analytic integrity, or ensuring that the entire process around

understanding

a complex situation

is really clearly articulated

to ultimate decision makers. And I think that's an important paradigm

for any analytics or data

related endeavor because we need to be able to ensure that we're picking the right data for the job, that that we've got all the data that we need to be able to understand what we can. And, of course, that's often not the case, and so we end up needing to find proxies or, in other ways, compromise. And so then we need to figure out how we can convey that effectively to the people who will ultimately be using the data for some sort of a decision. And do you remember how you first got involved in the area of data management?

Absolutely. I think for me, there was a a pretty pivotal time when I was leading a team of data scientists

who

were training models and and sharing them out with,

their colleagues. And in addition to the actual model building process, we had a a whole focus on education and training for people who who would be able to use these kinds of more sophisticated analytic models in their day to day work. And

there were a lot of discussions on the team about how we

could effectively help them understand,

what kinds of data

requirements were in place for the kinds of questions that they wanted to ask. And what I I found was that that really meant that we needed to be able to not only understand the data that we had available, but also understand the kinds of data that other people had available. And,

that meant that there is a huge amount of essentially metadata management

and knowledge management and sharing that needed to happen.

And when you do that across teams and across organizations, it really amplifies

the challenge. But, of course, I think that also makes it a a much more interesting and exciting area to get to work in. And so

in terms of great expectations, it definitely helps to solve some of these issues of the integrity and quality of the data. And I know that particularly with some of the recent releases, it helps to address some of the issues of communications

around what the dataset contains

and some of the

metadata aspects of

where it came from, what the context is. But before we get too deep into that, I'm wondering if you can just give a broad explanation about what the Great Expectations project is and some of the origin story of how it came to be. Absolutely. Great Expectations,

we use the tagline always know what to expect of your data. And it's a project that helps people

really clearly

state the expectations that they have of data as well as communicate those to other people. I mentioned that,

experience I had of leading a team where we were doing model building. And 1 of the key parts of the origin story was

wanting to basically be able to say, you know, this data,

should have a particular distribution or particular shape around,

certain variables if you wanna be able to ask a question. And I found that, again and again, what I wanted to do was not just ship people, you know, an API where they put data in and they get an answer back,

but rather

ship them

an API that they could use responsibly and confidently.

That was a key part of the origin of Great Expectations for me. At the same time,

my colleague now, Abe Gong, was working on health care data, and he, again and again, was observing

breakages in data pipelines that they were building

where an another team,

usually, because they were making an improvement or catching an error,

would make a change to an upstream data system,

and that would trickle down and cause breakages throughout,

the rest of of the data pipeline.

And he and I were talking about these related problems. And, of course, both of us had experienced the other 1. But we realized, well, you know, both of these would be really well served by this sort of a declarative approach to describing expectations.

And, you know, that's that's the origin of the project. Just realizing that this problem was so general that it was coming up not just in different places, in different industries,

but with different manifestations. Right? The the same the same underlying problem

has these kind of different symptoms.

And

about 2 years ago now, I actually had both you and Abe on an episode of my other show where we talked about some of the work that you were doing with great expectations. And I know that at the time, it was primarily focused on being used in the context of the pandas library and Python and some of the ways that people were doing exploratory analysis within pandas and then being able to

integrate great expectations into that workflow and have a set of tests that they could assert at runtime later on once they got to production with it. So I'm wondering if you can just talk a bit about some of the

changes in focus or the evolution of the project since we last spoke.

Absolutely. It's amazing how much things have evolved in in those 2 years and especially in the last 6 months or so. Just like you said, originally, our focus was very much on

validation

and and supporting exploratory data analysis and that sort of a workflow. And also very much we were Panda centric. There's been evolution in a lot of different dimensions.

First 1 is just in terms of the kinds of data that Great Expectations can interact with. From the very beginning of the library,

wasn't specific to Pandas. It wasn wasn't specific to Pandas. It wasn't about any particular

form of the data. It was really about

the semantics of the data, how

people understand it, what it means for the context of a particular analysis. So we've been able to to realize that goal a little bit by expanding to now support

SQLAlchemy

and, by extension,

all of the popular

big SQL databases. So we have users running great expectations on Postgres, on Redshift, on BigQuery.

Also, we've expanded into Spark. So,

whether that's Cloudera clusters or Spark clusters that are,

managed by Teams or Databricks, we've got users being able to evaluate expectations on all of those. And I think, again, 1 of the things that's really neat is actually it's the same expectations. Right? So the same expectation suite now can be

validated against different manifestations of data. So if you have a small sample of data that you're working with on your development box in pandas,

and then you wanna see whether that is,

those same expectations are met by a very large dataset

out on your spark cluster,

you can just seamlessly do that. The next big area that we've we've pushed is in terms of integrations.

It was a it was a big pain point for users, I think, to figure out how can they

actually seem,

you know, stitch great expectations

into their pipelines.

And so we've done a lot of work in creating what we call a data context,

which manages expectation suites, it manages data sources,

and it can bring

batches of data together with expectation suites to validate them and then store the validation results, put them up on cloud storage. You could, you know, have all your all of your validations, for example, immediately loaded desk 3 and available to your team. So that's been another big area of development. And, man, again, I I know I'm just going on and on here, but there are a couple other big areas. So 1 of them is,

you know, been the thing that I think really people have been able to use to resonate with great expectations and see things move forward, which is

data docs, we call it, or the ability to

generate

human readable,

HTML documentation

about the artifacts of great expectations. So about expectation suites, about validation results. And, so we we basically generate a static site that you can look at, and it it really helps you get a quick picture of what you you have in your data, as well as that you can share with your team. And then the last area is profiling.

We've done a lot of work to make it so that you can use

great expectations, the library,

before you've really zeroed in on what the expectations are. So it becomes this iterative process of refinement where in in an initial profiling,

we basically say, you know, I expect

these these hugely broad ranges. I expect the the mean of a column values, for example, to be between negative infinity and positive infinity. Well, obviously, that's true. But as a result of computing that, we we give you the metric, what the actual observed mean is. And you can use that to,

especially when you're combining that with documentation,

profile, and get a really robust understanding

of of your data right away. So there's a lot there. There's a lot of innovation and work that we've been able to do, and it's been a really fun thing to get to focus more on the project. Yeah. The profiling in particular, I imagine, is incredibly valuable for people who are just starting to think about how do I actually get a handle on the data that I'm using and get some sense of what I'm working with, particularly if they're either new to the project or if they've just been running blind for a long time and wanna know how do

a

things that I see in our Slack channel a lot is when somebody will say, you know, I ran this, expectation

and, you know, it failing, and I don't know why.

And then they look into their data and it's well because it's not true. And and I just never never cease to love that sense of surprise and and excitement that people have when they really encounter their data in a richer way or in a way that they hadn't seen it before. What profiling does is it it just makes that happen across a whole bunch of dimensions all at the same time. Exactly right. I think more and more what we're finding is when a user is first getting started with great expectations,

it was intimidating

to sit down at a blank notebook and figure out, where do I go? Where do I get how do I get started? And so now what they can do with profiling is start off with a picture of their dataset.

You know, they get to see some of the common values and, you know, which columns are of which types and distributions,

and it really gives them a way to dive right in. And then we can actually generate

a

notebook

from that profiling result that becomes the basis of a declarative

exploratory process. So we actually can sort of guide you through

some of the initial exploration that makes sense

based on the columns and types of data that you have.

And going back to to the beginning of the project at the time that you were working with Abe to define what it is that you were trying to achieve with great expectations. I'm wondering what your experience had been as far as the available state of the art for being able to do profiling or validation and testing of the data that you were working with and maybe any other tools or libraries that are operating in the space either at the time or that have arrived since then?

Great question. I think there's there's a lot of really good practice that I've seen, you know, in this broader space. I think 1 of the things that's important to mention is there's a huge amount of this kind of work that just happens out of band. So it it's not so much that there is

a tool for it even. It's that,

what we see are people having meetings, coordination meetings, data integration meetings,

big word documents.

So I you know, even though it's not an exotic thing to say, I think it's really important to mention that and remember that as a kind of core part of the original state of the art for how people, work on this. The other thing is there's a lot of roll your own. A lot of people who are writing tests for their data just put it into their code, and it's it's kind of indistinguishable

from their pipeline code itself, which was 1 of the key problems that we wanted to solve with great expectations is making sure that tests could focus on data instead of being part of the code and have a strong differentiation there. But it's still an important part. So a lot of times when I talk to people

about their their current strategy for pipeline testing, they say, oh, yeah. We absolutely do tests. It's it's just that we've written a lot of our own.

There definitely are

some, commercial players in this space. A lot of the big ETL

pipeline tools, whether that's open source,

NiFi, or some of the big commercial players,

have data quality components or or plugins

that you can use. And I and I I think that's really you know, it's it's obviously, I think, a best practice to have some some sort of test. So I think that's also really important. You know, I think there's there's a lot of things that people do with just metadata management strategies.

So, you know, capturing whether that's in a structured way or

or a little bit less structured, you know, a a metadata store for datasets. So I I see a lot of work there, which is which is really exciting. And then the the other thing I'll say specifically to your kind of things, how have things evolved is there are 2 other

open source projects that I I see having come out of some really big companies that I I think it's exciting. You know, the investment that they're making and their willingness to open source that really demonstrates the scale of the problem and interest here, which and the ones I'd mentioned specifically are, DQ from Amazon

and, TensorFlow data validation, which, of course, is very TensorFlow specific, but also gives you a lot of ability to have that kind of insight into the data that you're observing and how it's changed.

And 1 of the values specifically that great expectations

has, which I think can is is hard to overstate is what you were saying earlier about being able to build the expectations

against a small dataset and then being able to use those same assertions across different contexts because of the work that you've done to allow great expectations to be used,

on different infrastructure, whether it's just with pandas on your local machine or on a spark cluster or integrated into some of these workflow management pipelines. And so more broadly, I'm wondering if you can just talk about some of the types of checks and assertions that are

valuable to make that people should be thinking about, and some of the ones that are,

notable inclusions

in great expectations out of the box? Yeah. Great, great question. And

and I think, you know, to your point about the ability to work across different sizes of data, I would also add, really quickly that I think 1 of the neat ways that we see that is just making sure that teams are working with the same data.

So sometimes it's it's not even about

that there are different sizes, but, you know, when when

when data got copied from 1 warehouse to another,

did all of it move, for example. So to that end, I think 1 of the most important tests that I think is easy to overlook is missing data. You know, do do we have everything that we thought we would have,

both in terms of

in terms of columns

and fill, as well as in terms of date or deliveries.

1 of the things that is really powerful, of course, about distributed systems is they're very failure tolerant. But 1 of the things that that can mean is they're silent to especially small batch delivery failures. You know, I think set membership is another area that there's, like, you know, really basic, but really important kinds of testing. And then I think the more exotic things around distributional expectations are also,

really important.

With respect to

some of the notable

expectations that are are in great expectations, I think I would actually highlight things that

people have added or extended using the tool as probably the most exciting and innovative parts. So

1 of the things that I've seen that I really thought was was neat was a team that basically just took a whole bunch of regular expressions that they were already using to validate data, but that were basically inscrutable

and used those to create custom expectations

that were then very meaningful to the team. So, for example, you know, we could say that I expect

values in this column

to be part of our normalized

log structure,

and that corresponds to being able to be, you know, matched against a bunch of regexes or or have some other kind of parsing logic applied. But what it's doing is translating between

something that a machine is really, really good at checking,

but is very hard for a person to understand what it means to something that a human really understands

immediately, intuitively. It has all the business,

connection for them, but that would have been very hard for a machine to verify.

So it's really, I think, exciting to be able to help link that up.

And I think that goes into as well some of the non obvious ways that Great Expectations can be beneficial

to either an individual or a team who are working with some set of data. And I'm wondering if there are any other useful examples of ways that you've seen great expectations used that are not necessarily evident when you're just thinking specifically in the terms of data testing and data validation.

Absolutely. I I think 1 team that I saw do something I thought was really clever was

they effectively built an expectation suite that

didn't have most of the actual expectation values filled in. And then they they took that kind of template,

and effectively, it became a questionnaire.

What is the minimum value for

the churn rate that would be unusual?

And sent that to

their analysts, the people who were consuming reports

built from the data that they managed, in order for them to fill it in. And so it became a a way for there to be a really

structured conversation

around what these what these 2 different

teams understood the data demean

and, how it should how it should appear

that helped

the engineering team understand the business users better and the business users understand

the kinds of problems that the engineering teams were facing. So I thought that was a really you know, a fun use case. Another 1 that I that I I've seen that I think is really neat is, well, I call the patterns or fit to purpose. And the basic idea is, you know, we talk about

pipeline tests as something you can run. Obviously, you can run-in a pipeline, but usually, you know, before

your code or after your code. And, you know, just like with a pipeline, there's a lot of branching. With expectations, there are the same.

There are teams that are gonna use a dataset in 1 way, and so they have some expectations. And another team may use the same data in a very different way and end up having very different expectations,

around the same data. So that that

ability to elucidate that realization

or that, reality of data is really fun. Yeah. I definitely see the fact that Great Expectations

has this sort of dual purpose of, 1, the operational characteristics of working with your data of ensuring that I'm able to know at runtime that there is some error because either the source data has changed or there's a bug in my manipulation of that data. And so I wanna know that there are these invariants that are being violated, but at the same time, it acts as a communications

tool, both in terms of the data docs that you mentioned, but also just in terms of what tests am I actually writing and why do I care about them, and then using that as a means of communicating across team boundaries so that everybody in the organization has a better understanding of what data they're using and how and for what purpose.

Exactly.

Along those lines too, it's interesting and useful

to talk about some of the aspects of data pipelines

in terms of what are some of the cases that can't be validated in a programmatic fashion and some of the edge cases

of where great expectations

hits its limitations and you actually have to reach to either just a manual check or some other system for being able to ensure that you have a fully healthy pipeline end to end and just some of the ways that great expectations,

there can be bent to fit those purposes or needs to be worked around or augmented? Yeah. That's a that's a tough question, obviously, for somebody who's building building something because I love to think that, it's great for everything. But, you know, of course of course, there are a lot of of really challenging areas. 1 of them that I think is really interesting and that

has been a lot of discussion inside our team

is

the process of development of a pipeline. So while you're actually still doing coding, it's very tempting, I think, to

blur the line between what is effectively

a unit test or an integration test around a pipeline and a and a data pipeline test. So I think great expectations

is probably not the right choice when you're doing that

active development because

you may not yet know

what the data should look like. And while you're doing that very rapid iteration, that's probably even too much about that. But another area that I think is interesting

for this reason is anomaly detection. Anomaly detection

is

really tempting with great expectations. And I want to I want to love how,

how strong of a tool we are there. The end of the day, though, I think anomaly detection is sort of always

a very

there's a huge amount of precision recall trade off that needs to be dealt with. And so

when you have a situation for great expectations,

like where you wanna use great expectations, and what you're observing is

that your underlying question

is, well, you know, what is my kind of precision recall trade off that I have for for and expectation? I think that's another area where what I would just say is we still have a lot of work to do. Because insofar as what we're trying to do is communicate things well, there are solutions. And 1 of the solutions that we're working on right now, basically, is the ability to have multiple, like I was describing earlier, sets of expectations on the same dataset

that reflect

different cases for different points on that precision recall trade off in that space. And then I guess there's kind of another area, which is the overall coverage

or when you want to use great expectations in the context

sort of acceptance testing for a final product. I think there's you know, it's always good to make sure you have some point in the process where a human is in the loop for a high stakes decision. And so,

I I,

you know, I would be wary of somebody saying, oh, well, we've we've got all of our expectations

encoded. It's, you know, good to have some meta process

around the use of expectations in a pipeline that allows you to to check that and and really assess the coverage

that you have. Yeah. That's definitely an interesting thing to think about because for people who are used to using unit tests in the context of

application has been tested in some fashion. I'm wondering how that manifests in the context of data pipelines and how you identify

areas of missing coverage or think about the types of tests that you need to be aware of and that need to be added, particularly for an existing pipeline that you're trying to retrofit this onto?

I think the issue of of coverage in general is just fascinating in the context of data pipelines because it really opens such a fraught question of what it even means to have asked everything that you can about a a dataset

or a pipeline and and a set of codes. So I think what Great Expectations is sort of helping you do is complement the process of 1 kind of testing where you're asking, have I been able to anticipate everything that I see

with, another kind of testing

where you're asking if what you see is like the kinds of things that you anticipated.

Then

what that means is I think it really comes back to that question of fit for purpose.

So, for example, if you are going to eventually be using a dataset that you're creating for

a machine learning model,

there are a variety of features with different levels of importance to that model and,

how well each of those

features over the the space of your function reflects the kinds of things that you saw in the training dataset. You know, how many elements you see in clusters that you observed in the context of an unsupervised

learning problem

are good ways for you to diagnose that there may be a problem, but they're really just part of the overall analytic process. And maybe I think the right way to to spin this really positively is that just like what we were saying at the very beginning about some of the more basic tests, you know, are these values

the ones that the data provider said they were going to be providing? What we're really allowing is you to have a robust

conversation and process of exploring and understanding

and getting new insights out of data. And now I'm wondering if we can dig a bit deeper into how Great Expectations itself

is actually implemented

and particularly some of the evolution of the code base moving from your initial implementation

of focusing on integrating with Pandas to where it is now and how you're able to

maintain the declarative

aspects of the tests

and proxy that across the different execution,

context that you're able to run against? Yeah. This is a really fun question. This is what I've been spending a lot of time thinking about right now. And, actually, we're in the process of launching another major refactor to the way that this works. The basic idea for us is that expectations

are named and that's what allows us to have the declarative syntax. So there's this concept

of the human understandable meaning, what it is that you expect. And as you recall, these expectations, we always use these very long, very explicit names, expect column values to be in set. And

we then have

a layer of

translation

that is

available per expectation and per back end. 1 of the key changes that's happened is we've introduced an intermediate layer called metrics

that allows expectations to be defined in terms of the metrics that they rely on. And then the implementations

can translate the process of generating that metric into the language of the particular back end that they're working on. So, for example, if you have an expectation around a column minimum,

then rather than,

actually

translating that directly

into

a set of comparisons. What we're doing is asking the underlying data source

for the minimum value and then comparing that to what you expected. There's there's, of course, a little bit of magic around ensuring that that works and scales appropriately on different back ends and that we can bring back the appropriate information for different kinds of expectations, you know, some of which look like I was just describing with Minh in an in an aggregate way, and some of which, look across

rows of data

1 at a time. But that's been a really important part of the evolution. And then where we're going next is

to have expectations

be able to even encompass a little bit more logic so that they will also, in the in the same part of the code base, contain the translation

into the full verbose

language, you know, locale specific

documentation version of an expectation.

So if the name, expect column values to be less than, is

fixed,

and that gets translated in 1 way to the back end implementation.

Parameters, like it usually should be more than 80% of the time or whatever additional parameters are are stated. All that is coming together in the code base to make it much easier for people to extend. And on the concept of extensions, I know that 1 of the things that there's been a recent blog post that you released is in terms of adding a plugin for building automatic data dictionaries, but then there's also

the extensibility

of it in terms of the integrations

that are built for it for particularly for projects such as Daxter or Airflow or the different context that grid expectations is being used within or how it can be implemented as a library. So I'm wondering if you can talk about some of the interfaces that you have available for both extending it via plug ins as well as integrating it with other frameworks.

Yeah. Great. That's really fun area too. So

a lot of that resides in the concept of the data context that I mentioned at the beginning of our conversation.

The data context makes it, really easy to have a configuration. It's a YAML based configuration

where you can

essentially plug different components

together.

So, for example, you can plug a new data source

that,

knows how to register with airflow or that knows how to read from

a a particular database that you have or an s 3 bucket that you maintain.

And

so

there's this composition element that the data context provides. And then in addition to that, each of the core components of GE are designed to be really friendly for subclassing.

And the data context allows you to

dynamically import your extensions of anything. So the data dictionary approach, for example, was effectively a custom renderer. So it took the way that we were building the the documentation pages, and it just modifies a little bit. So it's still using the same underlying great expectations artifacts, the validation results and expectation suites, but it's it's combining the data in a different way and building some new elements and putting them on the top. What we're really hoping is that,

as the community gets more and more engaged with great expectations, we can start to develop effectively a gallery or this location where people can post the extensions that they're building

and begin to share the, you know, whether it's a new page or artifact or a new data connector or something else, like actually Expectation Suite. So a domain specific set of knowledge that they have,

around, you know, for example, public dataset that they've been working with. What are some of the expectations that were useful for their project? And even those could be shared and hosted in the gallery. Yeah. I I definitely think that the extensibility,

particularly for integrating with some of these workflow management systems is highly valuable for how Great Expectations is positioned.

I remember talking to Nick Schrock about with the DAXTER project, and I was seeing that there's an integration with that and then also things like Airflow or Kedro.

And so the fact that grid expectations

as a standalone application

works with

SQLAlchemy

and Pandas and Spark is definitely valuable, but I imagine that by being able to be plugged into these workflow engines, it actually

expands the overall reach beyond just those data sources because of the fact that it's operating within the context of what that framework already knows about as far as the datasets and the information that it has in the context of processing that data. So I'm curious, what are some of the ways

beyond the surface level of great expectations as its own application,

how those integrations expand its overall reach and some of the potential that it has? Well, I think the key thing that it does is it really opens up the idea that great expectations

can become a node itself

in an overall

data processing pipeline.

So rather than, you know, I run great expectations,

then I run my pipeline, then maybe I run great expectations again. You have a

orchestrated pipeline that can have dependencies,

that can have conditional

flow,

and Great Expectations can really play happily in that ecosystem. It can signal

back to airflow that a job has successfully produced an artifact. It can

put documentation

automatically build and put documentation,

for example,

in a public site and fire off a Slack notification that says that this validation happened and it was successful, and then immediately work together with Airflow to continue processing or delivering data to teams. So it's that ability to have great expectations

be a part of this broader toolkit available to data engineers that I think is is where we're going with these plugins. So you get that advantage of the declarative

and the, human understandable

expectations.

Also, of of the of the, you know, just critical,

you know, machine verifiable,

expectations. And then it becomes a part of your overall deployed pipeline.

And I know that we've spoken a bit about some of the interesting or innovative or unexpected ways that GridExpectations

is being used within these different contexts of communication and execution, but I'm wondering if there are any other

areas that or any other interesting examples that we didn't touch on that you think are worth calling out. I think the most interesting examples

for me are

about the different domains where Great Expectations is being used. You know, frankly, I think we've really covered a lot of the the ways that it gets used, and it's more that people are using Great Expectations

from all kinds of teams. Actually, let me revise that. I'll give you I'll give you another specific 1 that I thought was was fun. We had a

team that,

came into Great Expectations Slack recently and and talked about how they were using GE in a way that I thought was really also, you know, another fun 1, which is they have a centralized data science team because, you know, it's a scarce resource to be able to ask for

a specific,

deep dive analysis.

And a lot of distributed business units, which all have similar but not the same data.

And so

the data science team was spending a lot of time

translating

very similar analyses that all these different business units had, you know, geographically different. But from the business perspective, they were doing very similar things. And so, they needed to find a way to reduce the time that they were spending

doing translation for each of these different business units. And Great Expectations is really

a very neat fit there because it allows this this 1 centralized team to be able to effectively talk to a lot of different teams. So it's not about about change. It's about a lot of different teams that are doing similar things with slightly different data and really quickly,

articulate what it is that they need to see and how that compares to what they're getting from each of teams. And we've already talked a bit about some of the limitations of Great Expectations

of where there are areas in a pipeline that just need to have some sort of manual intervention or human input. But I'm wondering what are some of the other areas where Great Expectations

might not necessarily be the correct choice

and some of the ways to think about how best to leverage great expectations or where it fits into your workflow. So I think that we talked a bit about anomaly detection and, you you know, the case where you have very

dynamic data

that you know is kind of changing or evolving over time, and that's okay. But as as long as it, you know,

effectively, you're doing some sort of change point detection. That's an area where I think great expectations doesn't have a lot of support yet. Another area that is 1 where I think we will have a lot of support, but it's just not built out yet, but it but it comes up a lot is

in multi batch expectations. So expectations that or or basically using metrics or statistics that are noble only when you look at a whole bunch of different batches of data. I'll bleed that question

really into maybe a more forward looking positive thing of saying there's a lot that we're doing in that space right now. It's really fun to get to think about those kinds of multi batch problems or even even change point problems. And so we've got some design work that we've been doing recently that can

help with with both of those. For now, though, I think those are areas where I would certainly keep a fairly

high touch analytic

process involved in processes or data flows that have those characteristics.

And as you look forward to

the next steps that you have on the road map for great expectations

and some of the overall potential that exists for the project, I'm wondering if you can just talk about what you have in store for the future and some of the ways that people can get involved and help out on that mission. Absolutely. Well, we talked a little bit about this,

gallery idea, the idea that there will be progressively more information

about how to solve particular problems and and how to integrate with different, you know, additional

back ends or pipeline running systems.

So I see a lot of work that we're moving towards focused on the space of really unlocking,

kind of like a flywheel or accelerating a flywheel of community involvement so that more and more people are able to help contribute to driving the project. To to get there, 1 of the things that we're gonna be, I think, focusing on and and where I think people would be really welcome to continue to engage a lot is basically a quality and integration testing program where, you know, as more and more different kinds of

data and data systems

are plugged in with great expectations,

you know, it's really useful. I mean, it's I think about this all the time with SQL where, you know, there's this kind of expected

homogeneity of translation, but then that runs

into the reality of a variety of different back ends and syntax and so forth. So we're doing a lot in order to kind of ensure that when you state an expectation

that you're gonna get exactly what it is that that you stated. On the other hand, we're doing a lot of work just in future iteration, improving the expressivity of expectations. So making it possible to have expectations about time that are immediately,

or that are, you know, first class,

that adding some features around, you know, support for what I call teams with skin in the game, you know, things where you know you're gonna get a pager call. Well, that that was dating me. And You know you're you know you're gonna get a Slack message,

if

something goes wrong. And,

so, you know, that's alerts and notifications of different kinds,

ensuring that there are additional actions available to validation

operators and things like that. So that's the other kind of big area. It's just adding

adding new features and support. Now, eventually, what I'd love to see is that that there are,

opportunities to make it just as easy as possible to get great expectations in your environment. You know, some sort of just 1 click spin spin up my expectation store, validation store and all that. And, I think we'll get there too. And 1 of the other things that I think is interesting to briefly touch on is the types of data that are usable with great expectations where a lot of times people are going to be defaulting to things that are either in a SQL database or a textual or numeric data. But then there are also potential for things like binary data or images or videos. And I'm wondering what are some of the ways that great expectations works well with those or some of the limitations to think about what types of datasets are viable for this overall

approach to testing.

That's a great point. And, frankly, probably should have touched it more on the on the limitation side. I think there are a couple of areas where it's challenging to use grid expectations

for those reasons. 1 is if you have non tabular data at this point, expectations

really are oriented around

columns and and rows. Obviously, there are a lot of datasets that are, you know, denormalized in some way or another and

have

rec structured records

or or so forth. So that's an area where there's, you know, kind of some transformation usually that would need to happen before you run Great Expectations.

To the point of binary data and images,

there aren't actually,

any of our core expectations that natively speak about those kinds of datasets.

Now what you can do and what we've seen done a little bit is where effectively great expectations will become a wrapper layer around a more

complicated model. So so you effectively have 1 model that is checking data characteristics

that are going to be used in another 1. And so in in that context, it becomes more it becomes possible to use, Great Expectations with those kinds of data. But that's another area that's pretty challenging at this point. And are there any other aspects of the great expectations project

or ways that people should be thinking about testing their overall data pipelines and data quality

or effectiveness

of the communication around how data is being used that we didn't discuss yet that you'd like to cover before we close out the show? I think the only other thing I would touch on is just to reflect a little bit on how much time we spend communicating about data and all the ways that that happens

and thinking about how that could be made more efficient. You know, I think there are 2 big sources for

inefficiency in that process of understanding and communicating data. And 1 of them, I'll call

round trips to domain experts or time where you're going back and forth or sitting down in meetings and having to bring a whole bunch of people together collaboratively.

And the other 1 is what I call the cutting room floor or, you know, time where there is some understanding

of a dataset that you've built up, but it dies because it doesn't get written down or it doesn't get acted on.

And so I think Great Expectations is really good for helping to address those things. Alright. Well, for anybody who wants to follow along with the work that you're doing or get involved in the Great Expectations project or just get in touch with you, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, despite all the work that we're doing and a lot of people are doing in data quality, I still think the biggest gap is

in bridging the the world of the data collection

and processing systems of the computing with the world of the people. I mean, at the end of the day, I think it's all about what we as as humans understand and as decision makers

and how we see the world that we're trying to to do in this whole field. And,

so

I still think

having people understand

how models work, you know, explainability

and having,

machines be able to understand

intent,

are the things that you know, just gonna take a huge amount of work in a variety of different

fields, and there's no silver bullet in that. But I think that's gonna be the big project area that I'm excited to get to continue working in and contributing toward. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Great Expectations. It's definitely great to see that it has continued to progress, and it it is, growing in terms of the mind share of what people are using for testing their data and the fact that that's something that needs to happen. So I appreciate all of your efforts on that front, and I hope you enjoy the rest of your day. Thank you so much, Tobias. It's a pleasure to get to talk to you and to get to be back on your show.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links