Summary
Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.)
- What are some of the types of contracts that can, and should, be defined and enforced in data workflows?
- What are the boundaries where we should think about establishing those contracts?
- What is the utility of column and table names for defining and enforcing contracts in analytical work?
- What is the process for establishing contractual elements in a naming schema?
- Who should be involved in that design process?
- Who are the participants in the communication paths for column naming contracts?
- What are some examples of context and details that can’t be captured in column names?
- What are some options for managing that additional information and linking it to the naming contracts?
- Can you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects?
- How does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows
- How are you using dbtplyr in your own work?
- How do you handle the work of building transformations to make data comply with contracts?
- What are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers?
- What are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr?
- When is dbtplyr the wrong choice?
- What do you have planned for the future of dbtplyr?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse. So, Emily, can you start by introducing yourself?
[00:02:07] Unknown:
Hi. Yes. My name is Emily Reuter, and I currently work at Capital 1 as a senior analytics man manager.
[00:02:15] Unknown:
And do you remember how you first got involved in data management?
[00:02:18] Unknown:
Yeah. Definitely. My journey to data management, I'd say, was actually very accidental in a way. Back thinking about school, I was really, really focused on kind of the last mile of statistical modeling, did a lot of mathematics and a lot of statistics. But then I think when I got to industry, something very interesting happened. My program in school, if anything, focused very, very much on the theory instead of the application. Had spent a lot of time proving theorems and think about conversions, that sort of thing. But then when I got into the real world, I realized the hard part was almost somewhere else entirely.
It was about managing data, accessing it, understanding what it means, and just seeing those same stumbling blocks come up over and over and over again of just imprecise definitions. Data kind of defying expectations really kept pulling me further and further upstream to understand both why that was happening
[00:03:11] Unknown:
and what we could do about it. That brings us to the topic we're discussing today of the idea of data contracts and controlled vocabularies. And before we get too far into that, I'm wondering if you can start by discussing some of the anti patterns that you've seen in data warehouses and data lakes and some of the naming conventions and sort of lack of agreement in terms of how the data is laid out and some of the contracts, how that all relates to the data modeling approach?
[00:03:38] Unknown:
So overall, I'd say I tend to think about column naming antipatterns in 4 kind of general categories. First there, I think there's sort of typos and internal inconsistency of just, say, you have, like, multiple variables in the same dataset that the part's name maybe aren't even arranged in the same way, like logins_12 for the past 12 days, logins underscore 14 for the past 14 days, and suddenly you have, like, logins underscore 16 underscore days. Bucket of just, like, typos and inconsistencies. 2nd, I think we have misleading naming, like, name components where you think something means 1 thing, but it really means another.
Then the last 2 kind of buckets I tend to think about are when the same name can be used to represent multiple different concepts. If you think about an ecommerce company, something like as simple as, like, login might mean many, many different things to many, many different people and represent a different type of event within a, like, broader login process. And finally, we have the same concept being represented by multiple names, something like account maybe getting abbreviated sometimes as ACCT, ACNT, sometimes spelled out in full, etcetera.
So I think sometimes those can be more just annoying, sometimes they can be misleading, or sometimes they can just be confusing when the definitions aren't precise. And I think there's actually a really interesting interplay between that and different modeling approaches. As you mentioned, I tend to think of sort of the different modeling approaches as we think about a star schema or a snow schema or a data vault as kind of this spectrum of what's represented as a table versus what's represented as a column in a table. And, of course, I'm kinda doing a lot of hand waving, like, very, very speaking in broad terms in terms of those modeling approaches. But I think for the actual data producer, any of those are fine as long as we have a very good mental model or an r diagram that sort of thinks about those linkages.
But once those tables start to get denormalized and pulled into, like, higher level views that users can access and are maybe intended for more specific purposes. A lot of the metadata and lineage that previously existed in those links between tables and understanding how they interact kind of goes away as those tables are flattened out and only really the metadata in the column names themselves remains. So unless we kind of proactively fight against it, we have this kind of inherent loss of information throughout any sort of kind of data modeling process.
And finally, beyond what I'd call anti patterns, I think there's also an element of, like, opportunity costs. Again, going back to my background and my initial point of view coming from a more math and theory view of the world where concepts have these hyper specific names and definitions and meanings. Every time that our variables and our columns don't really have that same type of clarity, We're kind of just missing out on providing more metadata information to our users, which is ironic when our job as data producers is literally to provide as much information to users as possible.
[00:07:03] Unknown:
And in terms of the avenues for being able to identify some of these contracts and methods for conveying that information. As you mentioned, there are things like data catalogs and metadata management. Obviously, there are the table and column names. And I'm wondering what you see as some of the types of contracts that can and should be defined and some of the ways that they can be enforced in establishing and maintaining these different data workflows and ways that you can convey that necessary and useful information in the way that the data is stored and represented?
[00:07:37] Unknown:
Let me start out with what sort of contracts should be enforced. And I think there's sort of, in my mind, 3 main parts to that 1. 1st, there's basic things that are probably what people think about now in terms of standard data quality checks. So we have things like nulls, how they're encoded, whether or not they're allowed to exist in certain columns, and correct data types. We definitely wanna know if the date's really a date or time stamp or a string representing a time stamp because that can have pretty foundational implications on how we can further process that variable, what sort of logic we can use to filter it, etcetera.
The second type of contracts that we'd wanna enforce is thinking about what are our foundational business concepts. So things like, is there a single source of truth definition for what is a login or what is a user? I think this is something that until you've experienced it, it sounds just kind of absurd that a large organization wouldn't have that. But so often, a lot of those definitions can end up being very team specific or even person internalized something and made an assumption in their own mind specific. And enforcing the contract for what do those concepts actually mean can be really important and really powerful.
And thirdly, I think there is also space for contracts around the implications of how we chose to define some of those concepts. So, for example, if I think about what constitutes a successful login, depending on if I'm thinking about the first step in the process or the last step in the process, that also has implications on when will that data load. When is it fully available? When is it only partially available? How soon after a whole record is created? Should I be able to, like, trust an individual field in that table? So I think that's the kind of the third category of how do we enforce and communicate the interaction between some of our more definitional aspects.
[00:09:38] Unknown:
It's interesting that you're bringing in the idea of some of these business definitions, because that's also something that comes up a lot in the conversations around the semantic layer or what a lot of people are calling the metrics layer in some of the recent businesses and tools that have been coming out. And I'm wondering what your thoughts are in terms of the utility of that dedicated metrics layer versus just encoding some of those business rules in these column level contracts that we're discussing here?
[00:10:08] Unknown:
Yeah. Definitely. And I think probably 1 theme throughout all of this kind of philosophy I've come up with is I think I'd like to really front load as many of these things to happen as soon as possible. So having some of those kind of foundational concepts, like if I'm trying to model a customer, I'd better know what a customer is. I'd like the earliest state engineer in my data pipeline to know what that definition of a customer is to drive as much consistency throughout platform as possible. Of course, I think that if I contrast that with the metric layer, I think to me, probably, the rub really comes down to some types of transformations, especially those involving aggregations or that sort of thing.
Obviously, some things, like, necessarily can't happen at the beginning of the pipeline, have to really happen in that last mile when you know what sort of grain of data you wanna be working with and that sort of thing. But I think by and large, since I'm someone that really, really values consistency, not in, like, necessarily the database sense, but just the single source of truth, like semantic sense. I think when possible, I like to really, like, pull those things up early into the process.
[00:11:25] Unknown:
It also brings up the question of what are the boundary points where we need to be thinking about the definition and enforcement of these contracts and how as you mentioned, you'd like to be able to pull that in as early as possible, but you also want to make sure that these contracts exist at the points where you're handing it off between teams or between data producers and consumers. And I'm wondering, what are some of the useful strategies you've seen for identifying these boundary conditions, and what types of information need to be present at maybe some of these different handoffs where maybe it's just a boundary between data engineering teams versus between the data engineers and the data analysts or between the data analysts and the business users?
[00:12:07] Unknown:
I think necessarily and almost increasingly so, there definitely are a lot of those boundaries in building up any sort of a data infrastructure. And in my mind, I think probably 1 of the number 1 threats to data quality is not having that alignment really throughout and across all those boundaries. I know even in my own career and I've been on, I think, both sides of this both as a data user and a data producer. A lot of times users will end up looking at a piece of data and thinking that data is low quality, thinking it's strong, just because it doesn't adhere to their specific mental model of what the data should be. I suppose another way to say that is, like, data seems wrong to users so long as it is not fit for their purpose, never minding what the actual requirements were of that data when it was built.
So I think, especially as kind of data pipelines grow more complex and there are more handoffs, kind of making sure everyone's really aligned with the same baseline of last mile goals of the user is really, really important. And something I really like about the column name contract framework that I know we'll be getting more into in a little bit is just defining in 1 place with kind of a more cross functional group of stakeholders, what all those requirements are, what all the these definitions are of these concepts, and being sure everyone's kind of from the first person who touches the data all the way on down has that shared understanding.
[00:13:39] Unknown:
To the point of using the column names as 1 of those contractual elements, I'm curious what you have found to be some of the benefits of that and some of the potential trade offs versus maybe establishing these contracts in some sort of documentation process or procedural elements or having some metadata layer enforcements or enforcing them using something like using a tool such as Great Expectations or whatever your pipeline orchestrator might be?
[00:14:07] Unknown:
Yes. Definitely. And I think everything I really advocate for is definitely more to be used in tandem with a lot of these great tools you mentioned, like great expectations instead of on its own. I think the main kind of my main push is on better encoding the requirements for our data in the data itself to then be used kind of throughout all the best quality tools we have today of what's now called the modern data stack. But I think the utility to really embedding a lot of meaning in these column name contracts, I'd say there's both a philosophical and a pragmatic aspect to it. On the philosophical side and I guess first, let me take a step back and maybe explain, kind of before I answer that question, what actually the common name contracts are and how they work. I think it may make the second part make a little more sense. The general approach I'd like to advocate for and have currently been experimenting with a lot is breaking column names into stubs that communicate different types of information at different levels.
So, for example, we could define 1 layer of stubs in a controlled vocabulary that communicates how variables are encoded. Like, an end variable could represent something that's a non null binary indicator and an n variable could represent always positive integer quantities that represent count. Then we could define another layer of this controlled vocabulary that maybe represents more like entities, things like entities. Again, if we're working in a very denormalized table where maybe we have fields that are representing multiple different entities in the same table and attach those really rich and precise definitions of these business concepts. Again, what's the user? What's the login?
All embedded in a single word in the column name. And you can imagine subsequent layers that have more modifiers, at some point, maybe aren't predefined to really capture the nuance of the variable. And ultimately then we can define our column names and these really complex concepts just by permuting these sort of more atomic blocks. So going back to what's kind of the utility of this approach versus other approaches, On the philosophical side, I think that what I really like about this approach is that column names are something that can persist throughout the data pipeline so they can pretty seamlessly serve as a latent form of communication that shares this context between stakeholders who don't often have the chance to interact with each other, both either at the same point in time just because they're on different departments or across time that those column names persist even when the team that built the table is long gone. They can also be used by all parties in the data production process, and this is where it really connects back something like a great expectations, like you mentioned, where not only can these column names help end users understand what's in the data.
Something that I think is even more exciting is they can help define intent for developers and help them act on intent by programmatically transforming variables based on their names. So I think sometimes frameworks that kind of just provide that, like, last mile value to end users, there's definitely a big, like, effort versus value trade off. Hard to kind of, like, get alignment and get enforcement. But I think common name contracts are something that we can really embed with how data is actually produced. So going back to great expectations, something like, if we have a variable that starts with end, it's automatically all end variables are checked with a not null check and so forth and so on. Now on the more pragmatic side and sort of the less grandiose philosophical view of the world, a lot of the same these same concepts and benefits could definitely be achieved through the standard tools you mentioned, like having good metadata, having good data discoverability tools and data dictionaries, looking at data lineage.
However, I think all of those solutions are predicated on users really reading through all those resources, really embracing the documentation, and trying to understand what's going on. But I think often we do have to be honest that users don't always do that, and probably I'd have to admit when I'm on the user side, I'm probably guilty of that as well. That you come to any dataset with your preconceived notions of how it works, and those can be sort of hardship. But attaching these all this meaning to column names tries to make the data's meaning more intuitive, just in the way that if you get a new iPhone, you're hopefully not expected to read along the user guide.
It just should be like so intuitive that it kind of pulls you along in the process.
[00:18:57] Unknown:
And in terms of the actual contracts that you're establishing and the vocabulary that you're using, you mentioned, for instance, having IND to represent a binary indicator. And I'm wondering what the process is in being able to establish these vocabulary elements and their meanings and who should be involved in that process to be able to say, these are the terms that we're going to use. This is how they're going to be represented. This is where those definitions are going to be stored so that everybody knows when they go to a column name, these are the different components of the name. This is what each of those components means. Maybe there is some semantic meaning in terms of the ordering of those components and just how that's established or if you think that it's something that we should maybe, as an industry, work to establish a universal baseline of what these terms mean and what their sequencing is supposed to represent.
[00:19:55] Unknown:
I think to answer the last question first, while I absolutely love the idea of widespread standards, I do think because of just the nuances of different domains, there's probably a lot of value in having or in at least trying to have companies or companies or organizations define their own data dictionaries that make the most sense for their organizations. I know some people, for example, I always tend to put kind of that more type layer first just because it makes more sense personally to me. Those are typically the types of indicators that are hinting to me things like transformations that I wanna do with the data, but that doesn't always make sense to everyone. So I think I tend to shy away from having industry wide definitions just because there is certainly a risk to this approach of if it doesn't feel right and feel good and feel like a perfect fit for your use case, people might issue it altogether.
But for the process of actually defining a controlled vocabulary and these stubs to name different concepts. I definitely think that should be a cross functional effort between data producers and the different data consumers within an organization. I tend to think of it much in the same way that 1 would think about typical or requirements gathering for any sort of a large data project build, data warehouse build, etcetera. But I think ideally, it can actually have some positive externalities that typical requirements gathering doesn't have because it can bring those cross functional stakeholders into the same room and push for more precise and universal definitions of different business concepts and metrics just as people debate.
How should these concepts be represented in a table? So I do think data can actually be very influential. It's almost the common language of how a business tends to think about its processes. So really taking the time as hard or painful as it might be to bring all those people together and come to common understanding of how the data should look, I think can actually ultimately help influence how an organization thinks. And the second benefit I'd say in this kind of alternative process to typical requirements gathering is it is a little bit more immutable that, like, if I sit down, maybe as a product manager with my stakeholders stay, we decide the specific fields that needs to be in a table and my engineers go build it.
If someone comes along later and wants more fields in the table, wants different information in that table, there's lots kind of more reinventing the wheel, thinking about making all of those decisions again. But I think by defining and documenting a controlled vocabulary, we can get a little bit closer to immutable and evergreen documentation where that definition and that almost specification can exist even as the underlying specifics of exactly where that data is coming from, which fields are including, some of the more transient aspects of data
[00:23:04] Unknown:
can change while we still kind of have that controlled vocabulary to cling on to. There's definitely a lot of useful information that you can encode about the data and its format and the data type for a given column and maybe some of the boundary conditions for, you know, this value needs to be within this set of ranges. But what are some of the types of information that you found are either difficult or impractical to be able to actually embed in these vocabularies that does still need to live in some sort of documentation system or external metadata catalog?
[00:23:40] Unknown:
I think 1 of the most important aspects really of this whole philosophy is to avoid lying to yourself. As I started to build all this out for myself, something that I continued to challenge myself on was the moment that you start violating any of your contracts and you deviate at all, in some ways, it's a house of cards where the whole thing falls down because no 1 wants names that are like, oh, these uphold valid contracts 75% of the time, but you don't know which 75% of the time. So as I think about what's impractical, I really think about I mean, just as if I were making a promise to someone in real life, I think about what are promises that I might make in the data that I truly cannot uphold. And I think a good example of that is probably something like a uniqueness check.
So, for example, if we're building an account level table and we had a field like ID underscore account, maybe the account field, like, is unique, and it could be tempting to say, like, oh, all ID field should be unique. However, you know, joined different data sources is probably 1 of the most common operations in all of data management or data wrangling. And we don't want a situation where as soon as somebody joins that field to another table, it's sort of broken that uniqueness contract. There are ID accounts that have 2, 3, or 4 records genuinely associated to them.
And then you get into weird situation where it's like, well, should someone rename that variable? You don't always wanna have to rename a variable every time you joined a table just because it's function changed, etcetera. So I think probably properties that can but don't always have to be true of a variable in different circumstances are definitely something that can't be captured at the name level and kind of do have to be handled in more of the traditional data dictionary type approach.
[00:25:39] Unknown:
And so in terms of the actual work of ensuring that your data conforms to the contracts that you're specifying in the column and table names. I know that you built the dbt player package for the dbt ecosystem. But before we dig too much into that, I'm wondering if you can talk to some of the work that's involved in actually taking your source raw data that doesn't necessarily fit the specifications that you want in your more managed and transformed data and actually taking it from that source, manipulating it to ensure that it conforms to the contracts that you're specifying, and loading it into the destination tables and maintaining that overall process and maintaining the integrity of those contracts as you continue to evolve the schema, incorporate new data, or as there are anomalies in the sources systems that you're pulling from?
[00:26:34] Unknown:
Definitely. And I think that question really highlights a few key challenges that certainly inspired the DBT flyer tool that you alluded to that we'll touch on in a little bit. But I think itemizing listing a few of those key challenges, We are pulling data from many different source systems that won't follow these conventions. So something I think I like to do somewhat early on in the process is get the names of the different fields more like what their target names will look like. Because as I mentioned, some of these stubs can sort of serve like a config file almost where they can tell the data producer then how certain fields need to be wrangled to get them to their destination state. So for example, if I have a field that should be account value, it now starts with an n, so I know eventually that field needs to get transformed to an integer.
The second challenge that you mentioned is, you know, making sure that if data comes in corrupted, how does that get cleaned up throughout the process if there are anomalies in the data? And I think that's where in pipeline testing can be very powerful because with this approach, it gives us another layer of abstraction that we're no longer saying, like, well, I wanna check this column that it has this property. But we can expand some of those data checks to say, well, for anything that starts with an n that I know is a count variable, I can go ahead and run a data check-in my pipeline to make sure that it has no fractional component to that number. Whether it's an integer today, whether it's float today, no matter what form it is today, I can go ahead and run that check for any variable that starts with an n.
And then once I have that comfort that all these variables seem like integers, I can go ahead and have a transformation step in my data pipeline where I actually cast all of those variables to integers. And the other sort of power of this type of approach is since I'm not anymore itemizing those column names, we can make ourselves a little more resilient to changes in schemas like you mentioned, where now if I go in and I, for some reason, remove a couple of columns that started then or I add multiple columns that started then, it doesn't actually really matter or change anything because my ideally, the code in my pipeline is just saying, I'm going to do this for any variables that have this certain structure, which makes it a little bit more robust to changes.
[00:29:09] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. In terms of the actual enforcement of these contracts once they're defined, I'm curious what the process is for actually going from the conceptual element of these are the components that we want to ensure. This is the vocabulary that we've established to be able to signify these semantic meanings. And then taking that prose and conceptual idea of what the vocabulary represents and then actually enforcing that in a programmatic fashion and then maybe even being able to share some of those programmatic elements between different tools where maybe you have a definition in your great expectations, assertions that says this element of this table name means this certain semantic ideal. So taking, for example, the counts to say this can only represent integer values and then also being able to identify and enforce that in DBT player or in the downstream data analysis or data science tooling to say, I'm selecting from this column, so I know that I need to expect this type of data within, you know, these ranges or, you know, maybe this is a string value, but it's an enum type, so I only expect, you know, these 3 values to be present in this column.
[00:31:22] Unknown:
The question of, I think, interoperability across tools is very interesting and honestly, it's not something that I've personally delved into a lot yet. I've implemented kind of helper tools to, like, work like this in both DBT and small experimental art package that's doing similar stuff. I think to really take this approach to the next level, there would have to be a little bit more of a universal spec or configuration file or something, a slightly more standardized way to define these expectations that then could automatically generate the DBT code that you need, the great expectations code you need.
Right now, in my experience, honestly, I've mostly done that kind of step manually of automating within each tool, how these transformations get applied, but have not honestly done a lot of work yet with interoperability.
[00:32:22] Unknown:
Yeah. I can definitely see an opportunity for the metadata layer being the common glue to be able to maintain those specifications and tie it together, maybe using something like the open metadata specification and reference implementation where you say, this is the JSON schema document that says these are the valid enum types for this vocabulary element. So if you see a column with this attribute in the name, then you can validate that these are the only pieces of information that you can consume or inject from that location. And so digging more into the dbtplyr project itself, I'm wondering if you can talk about some of the work that you've done there and some of the story behind how you decided to actually implement it and some of the ways that it's being applied in your own work.
[00:33:12] Unknown:
Yeah. Definitely. So far, we've talked mostly about how call name contracts can be that interface for end users, how it can help them, you know, understand the data better, use the data kind of more correctly for their purposes. And we just started to tip our toe into how these same contracts can guide producers into knowing what sorts of transformations need to get done. DBT prior makes it possible to actually not only think about these contracts as configs, but to incorporate them in the code itself within a DBT project. And I think the real motivation for this was just my thinking about that nominally having these contracts is almost absolutely the worst case scenario if you cannot enforce them and if you risk breaking them to users because the kind of 1 risk of this approach is leading to even more complacency of data users of, like, oh, if I know this variable is supposed to not have nulls, maybe I stop doing the not null check, which is both a great time savings hack if it's true and a kind of scary result if it's strong.
So with DBT Prior, my goal was to think about how we can make as many of the kind of transformations and wranglings that you need for common name contracts as automated as possible. And both the basic idea and the API for this package were really, really heavily inspired by the r package dplyr. They don't talk explicitly in that tool about column name contracts, but there are a lot of really nice semantics in the language and in a lot of packages, both dplyr and a lot of it sort of companion packages that have stolen its semantics for operating on column names.
And there were a lot of things that I found when I kind of switch between languages and move back and forth that I really, really missed when I was working in SQL. So some of the features that it offers are it provides a number of macros to help select a list of column names based on properties of the name itself. So in a DBT macro, you can extract all of the column names from the relationship you're working with and subset them with macros called things like starts with, ends with, contains. So I can easily make a short list of all of the columns in the table I'm currently working on that say start with all variables that start with I n d. Next, it provides a number of iterators that help us supply transformations in either a select clause or different filters in a where clause on a list of variables instead of individual variables at once.
So in that DBT model, we can truly write code that's just like, select the sum of each of the variables with that and your starts with I n d instead of having to manually list those out 1 by 1 by 1. So, overall, like, I think the concept behind DBK plier is pretty simple. A lot of the code on the back end is just Ginger code calling regex code in Python to subset these lists variable names. But the way it becomes very powerful is when we think about the interaction with DBTs inbuilt macros in testing functionality, because you can programmatically apply transformations and validate your results.
So, for example, going back to the example that I've used a couple of times before about wanting our n variables to be integers, but wanting to make sure we aren't casting non integers, 2 integers, and just covering up our mistakes. With DBTs pipeline, we can write a test that, like, runs before transformation is executed, and we can double check that all variables that start with n in a certain table have no fractional component. And we can write that in a robust way that doesn't care if we have 1 variable that starts with n or 10 variables that start with n or if the number of variables that start with n change. So that can do the just, like, simple 2 or 3 lines of code. And then in our next model, after we've assured that's true, we can write a dbt model that casts all the end variables to that integer data type.
And doing this sort of thing programmatically can help us avoid typos and ensure that if our model changes or variables are added to our pipeline, it still works. And, of course, you can think about like, I take that as an example just because it's relatively simple, but you can think about in similar ways doing far more complicated transformations in that same sort of programmatic way, but still achieving kind of the standardization and the consistency that we're striving for. And furthermore, you can go as far as to have tests, not just for the data values, but for the data structure and the data schema themselves.
You can write tests using dbt and dbtfire to make sure that only valid stubs are included in any names in your dataset just by querying the table's information schema, splitting out the stubs into different components, and checking those against your controlled vocabulary. Or you could check that, again, with the information schema, you could try tests that certain stubs always correspond to certain data types. So, again, it's the sort of funny thing because I think the tool has really rather simplistic functionality at the surface level, but definitely some really interesting interactions with DBT.
There's a lot of safeguards and a lot of automation into this process.
[00:39:01] Unknown:
And so going back to what we were discussing before about the limitations of the vocabularies and being able to encapsulate some of the contextual and semantic elements and when you need to actually fall back to documentation or additional information in the metadata layer. What have you found to be some of the ways to be able to link the column names and the vocabulary elements to the necessary information for when you do need to be able to dig into what the more nuanced meaning is for a given column or table or piece of information?
[00:39:41] Unknown:
Honestly, right now, I think I do have a bit of a, like, simplistic system for that. Another nice functionality I really like of DBT is its ability to automatically generate really nice documentation websites. You can embed any sort of custom information you want in those websites with just basic markdowns and tax. So very easily by putting something at 1 of the higher levels of your DBK project, you can have almost 1 of the main pages of your documentation site, feature more of the, like, kind of controlled vocabulary, universal source of truth definitions of these foundational concepts, but still have a page that you can kind of click into more specifically for every single dataset and read those more traditional definitions each variable.
And I think then there's stylistic choices to be made there too of, like, when you're defining kind of the metadata for each and every individual variable, do you define it in full? Do you only define things that go above and beyond? What are already implied by the controlled vocabulary? Just depending on the the level of verbosity you wanna include. That's obviously, I think, not the perfect approach because it still sort of leaves questions in people's mind. Like, you don't know which columns have that extra bit of kind of specialized information. You don't know exactly where there's more to the story, but that's at least the first pass of getting all information kind of collocated in 1 single place.
[00:41:15] Unknown:
In terms of the challenges that you've seen in being able to agree on the necessary semantic elements to encapsulate the majority of concerns that you need to encode in your schemas. What are some of the sort of social and organizational
[00:41:39] Unknown:
complexities that arise from that? And I love how you asked that question too because I think, like, so many challenges in data and technology, the hardest challenges really are the intersection of technology and organizational culture. As with many things, I think the challenge is is a lot more work probably upfront getting that initial alignment and the initial investment that it's kind of worth getting a lot of stakeholders in the room, hashing out these definitions, and incentivizing slash kind of really pushing towards all or nothing adherence to those definitions.
I think definitely just painting as vivid of a picture as possible of the value that comes out the other end, how this can really help all stakeholders in project. I think it's really important to help people understand that it's worth different business units getting involved in the discussion because this can actually help their users work more efficiently, understand the data better, etcetera. But at the same time, helping developers understand that it's not just syntactic or semantic sugar for, like, the end user, but it's also a tool that can really help them, can start to get a little bit more of that alignment.
However, I think something I've learned about in the many different ways that I've tried to, like, push standardization in my own work and in the teams I've been on is no 1 will ever want to play the role of that enforcer. And at the scale you're trying to get unification, there's just no way, you know, like I don't think this is even a solution that you can just, like, pure code review your way out of. So I think as much as you can overach come those challenges, I think mostly through automation in ways that having kind of this automated testing of the fields, of the schemas themselves is probably, like, the only way you can truly achieve kind of the consistency and adherence as it like, to have kind of, I guess, both the carrots and the sticks approach.
[00:43:49] Unknown:
And then 1 of the other interesting things worth digging into is some of the stylistic aspects of defining these vocabulary terms where in some of the posts that I've seen, you write about it using very succinct abbreviations. So for instance, you mentioned having an n prefix to indicate a numeric or account variable or I n d for a boolean indicator. I'm curious what your thoughts are on the utility of keeping those vocabulary elements concise and abbreviated versus spelling them out and having them be more verbose at the risk of having the overall column names be on the longer side?
[00:44:32] Unknown:
That's definitely a tricky question that, again, like, there's a part of me that always craves the level of precision that requires a certain amount of verbosity. I'm afraid that's probably imminently clear despite way I've been answering some of these questions. But I think at the same time, like, thinking of the perspective of the data user, especially when you stop, like, iterating over columns and saying just, oh, yes. Some all end columns. And you remember at the end of the day, someone's probably trying to have to type the names out in full.
I think there is definitely a fine line to walk. I think already this approach tends to produce longer column names than exists probably in the wild per se. So I think knowing that we're already including kind of more components in the name, I think I tend to at least keep each of those components a little bit on the shorter side. Just recognizing that probably no matter how long I made them, the stubs themselves might not be completely self evident without someone reading through the controlled vocabulary to understand both the semantics and the contracts that come along with them.
[00:45:47] Unknown:
And then 1 other aspect of this is the management of the evolution of the vocabulary where you'll go through an initial phase of establishing these are the core elements of the vocabulary because they contain the concepts that we need to encode for the majority of our data. But now as our data usage expands or as our sophistication grows, we need to be able to have more concepts encoded and just making sure that you are able to maintain the evolution, not just of the schema that the data resides in, but also the vocabulary elements that are required to be able to convey the necessary context about that evolved schema.
[00:46:28] Unknown:
I think that that is probably 1 of the biggest challenges of this approach in the sense that, like, it can be a little bit all or nothing. And it's 1 of those things where if you start to implement when you're kind of too mature in your data life cycle, you probably won't pick out those best concept definitions. If you start to implement when you're too mature, there's a lot of, like, baggage and backtracking required. But I think probably 1 of the ways to keep things as flexible as possible, there are probably some really good analogies to how we think about something like a data vault, which can also keep your data schemas flexible as possible.
And the analogy I draw there is I think in data modeling, when we wanna keep things flexible, we tend to break things up into more smaller, like, unitary modular components that can then be, you know, pieced back together when they need to be pieced back together. And I think similarly with column names, it's easier to keep that consistency over time while keeping things flexible if you don't overload any 1 stub with too many different meanings. Because, again, if you think about pretty, like, insular modular concepts, going back to that I n d variable, like, binary variable, non null. Maybe if we need a binary variable with nulls, we can add a bin variable.
But by not adding kind of, like, too many contracts or too many conditions to any 1 stub, I think we can achieve a lot of that flexibility by how we, like, permute and combine those concepts without too much creeper explosion in the number of concepts we define or worst case in having to change them later.
[00:48:16] Unknown:
Yeah. And going back to the point about maintaining some level of terseness in the vocabulary elements, I think that if you start to get verbose in the definitions, then you run the risk of having concepts that span multiple words, which makes it very difficult to be able to understand what are the actual tokens that I need to split this column name on to understand all of the elements. Where if I have a 2 word vocabulary element, I either need to smush it altogether or I need to have an underscore and say, I don't split this underscore. I just split all the other underscores. And so just being able to think through some of the kind of tokenization and programmatic language design elements that go into it as well.
[00:48:56] Unknown:
Yeah. No. Definitely. A lot of choices to be made, and I think it also really hearkens back to your point that, like, not every bit of context can probably be captured in the names. Like, when you have to optimize, I think probably, like, always better to optimize for communicating things that are true and are helpful even if they aren't necessarily
[00:49:20] Unknown:
the full, full, full story. Yeah. Maybe you need to have a vocabulary element that says, this name doesn't tell you everything, and you need to actually look up the documentation.
[00:49:30] Unknown:
Possibly.
[00:49:31] Unknown:
And so in terms of your experience of working through the evolution of this philosophy and codifying it in DBT player and using it in your own work and seeing other people start to adopt it, what are some of the most interesting or innovative or unexpected ways that you've seen either these controlled vocabularies and or the DBT player package used? I think 1 of the most interesting use cases in my mind,
[00:49:56] Unknown:
just because when I initially was working on this idea, I was definitely working on it very specifically and very narrowly more in the data warehouse domain, is the way that I've seen people use and people have shared with me the benefits in more complex pipelines where they're moving data in and out numerous or in and out for different use cases of more or less structured formats. I think, you know, obviously, people that work primarily in databases are fairly lucky because even just the basic kind of, like, type information about different variables is in fact stored in the database.
But the more that you're working with unstructured data or even putting your data in and out of, like, flat files where you don't have a concept of that schema, I think column names can be an even more potent way to still capture and save some of the, like, correct expectations of how should this variable work, what kind of variable is it. Those can kind of be resurrected once your data is moved to a system that kind of has and expects those features again. In your work of
[00:51:04] Unknown:
using DBT player and controlled vocabularies, what have been some of the interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:12] Unknown:
I think the biggest challenges really are around how you absolutely make sure you're driving consistency with this approach and also that you're finding the right level of abstraction. And I think those 2 things are pretty intimately tied because if you define that really, really good vocabulary that works perfectly for your organization, fits like gloves, it'll probably be fairly easy to get adopted. But as we know in, like, all of the gnarly reality of and messiness of, like, real world data, if your vocabulary isn't a good fit for the org, you probably won't get a lot of adoption.
Things will start to drift apart. So I think both kind of approaching it humbly, not being too overly grand in the vision, and just figuring out kind of really no regrets abstractions you can use. And like we were saying before, picking things that are, like, generally in the right direction, communicating useful information even if they don't tell us everything can definitely be helpful to not get too terribly ambitious. And then I do think, secondly, in terms of driving that consistency, it has really been a challenge just to make sure that these even data producers really understand all the ins and outs of the vocabulary and implement it correctly.
But I think the more you can work kind of with a shared code base, shared dbt macros, and have the really robust testing of your outputs is definitely a major advantage. Because this system, like, professes having a correct answer, it's at least above and somewhat easy thing to check whether or not it is going right. So I think just leveraging the advantages that can be very useful. And if things you know, if a lot of those tests are failing, if there is a lot of data that isn't becoming compliant, just honestly, like, talking and understanding with different teams and data for data producers, why that process may not be working for them is definitely also a critical aspect for success.
[00:53:19] Unknown:
And for people who are looking for a way to maintain some of the consistency in their data approach and be able to share some of the semantic information, What are the cases where controlled vocabularies or the DBT player project are the wrong choice, and they might be better served with maybe some library code in their data pipeline or orchestration layers or relying on their metadata system and data catalog to be able to share that information or leaning on these newer metrics layers as the source of truth for these business concerns?
[00:53:57] Unknown:
I think it probably depends a lot what stage in your data life cycle you're at. If you're too immature, say if you're a super fast growing startup, you barely even know what your product is going to look like a month from now, It's possibly kind of too soon to optimize, too soon to implement an approach like this just because as we talked about with fast changing schemas and definitions, just the requirements that may make sense could change fairly dramatically, especially when we think about things like what is an entity.
On the flip side, I think if you're in a really, really mature org with a ton of data produced under different processes, There's probably still a place for this approach if you're starting a new project, building a new warehouse or schema, but there you definitely have to be a lot more careful and explicit with your users, what tables this approach applies to and which it doesn't apply to. So I think that's all to say. They're probably, like, many points on the maturity curve where, like, if you're changing things too quickly or you're unable to change things even very slowly, it's probably the wrong choice because it's not the world's most flexible solution to a problem.
But if you're kind of reaching a point where maybe you've had some tables, maybe you have some data infrastructure, but you're at a point as a company where you want to kind of take your data infrastructure to the next level, I think that can probably be more of the sweet spot where it's a relevant tool, where you have a good grasp on your domain, but you aren't really too locked into anything or too changed at first.
[00:55:41] Unknown:
As you continue to work with this data modeling approach and evolve the dbt plier package and start to popularize the controlled vocabularies utility both within your own organization and in the community? What are some of the things you have planned for the near to medium term or that you would like to see other people in the ecosystem contribute?
[00:56:03] Unknown:
I think the biggest thing in my mind there is the kind of philosophy, I think, of this approach far outstrips what TBC player is currently capable of doing. Right now, I think it fairly effectively handles the middle step of transforming and testing data based on a controlled vocabulary, but not really the first through the last step. So I think things that really could improve this package that I'd love to work towards in the future are providing a clear, like, specification or definition of that controlled vocabulary. Probably in DB Key, that would be in a YAML file since that would be very consistent with their overall style.
And I think I would love to see that sort of specification or definition, both find ways for it to more proactively guide these transformation testing steps instead of users using d vg pliers to write those out manually. And also to our conversation about documentation, figure out how to bring that a little bit more holistically and seamlessly in an automated fashion into the DB2 Fire docs.
[00:57:12] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:27] Unknown:
Definitely. I think another area that I'm very passionate about that I'm definitely watching closely right now are the many different data quality tools that exist today. I think the past couple of years have seen an explosion of really powerful tools for monitoring data quality. The thing that I feel like there's a substantial gap in is what those tools are actually testing for. I think there's a lot of talk right now about thinking about data quality is like unit tests, but to me that feels like a slightness from the standpoint that data isn't always necessarily a single unit. I think a lot of the data quality tools that exist today have a lot of really powerful road wise or record wise checks.
But what I tend to find when I think about the data quality issues that I've encountered in my work, a lot of it is more driven by internal consistency of data and how that data kind of relates to itself and relates to the domain that it's in. So, for example, I think a lot of businesses work with panel data where you're tracking a similar set of users over time, or you at least have some interrelationship between rows. I find a lot of data quality tools are somewhat immature in terms of being able to specify more domain driven and custom data quality checks that kind of help you test the relationship between rows instead of just singular rows at a time. So I think that's definitely 1 space where I'm really interested to see where it goes and would really be excited to see more tools that allow you to, like, embed kind of that domain specific context in the workflow.
[00:59:11] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on DBT flyer and the controlled vocabularies, specifications that you have been experimenting with and helping to popularize. It's definitely a very interesting and useful approach to establishing and maintaining these data contracts, which I think is something that is not as widespread as it can and should be. So I appreciate the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you. You too. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Emily Riederer: Defining Column Contracts
Column Naming Anti-patterns
Types of Data Contracts
Business Definitions and Metrics Layer
Boundary Conditions and Handoffs
Column Name Contracts and Their Utility
Challenges in Enforcing Data Contracts
DBT Plyr Project Overview
Linking Column Names to Documentation
Innovative Uses of Controlled Vocabularies
Future Plans for DBT Plyr and Controlled Vocabularies
Biggest Gap in Data Management Tooling
Closing Remarks