Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse. So, Emily, can you start by introducing yourself?

Hi. Yes. My name is Emily Reuter, and I currently work at Capital 1 as a senior analytics man manager.

And do you remember how you first got involved in data management?

Yeah. Definitely. My journey to data management, I'd say, was actually very

accidental in a

way. Back thinking about school, I was really, really focused on kind of the last mile of statistical modeling, did a lot of mathematics and a lot of statistics.

But then I think when I got to industry, something very interesting happened.

My program in school, if anything, focused very, very much on the theory

instead of the application. Had spent a lot of time proving theorems and think about conversions, that sort of thing.

But then when I got into the real world, I realized the hard part was almost somewhere else entirely.

It was about managing data, accessing it, understanding what it means, and just seeing those same stumbling blocks come up over and over and over again

of just imprecise definitions.

Data kind of defying expectations

really kept pulling me further and further upstream to understand

both why that was happening

and what we could do about it. That brings us to the topic we're discussing today of the idea of data contracts and controlled vocabularies.

And before we get too far into that, I'm wondering if you can start by discussing some of the anti patterns that you've seen in data warehouses and data lakes and some of the naming conventions and sort of lack of agreement in terms of how the data is laid out and some of the contracts,

how that all relates to the data modeling approach?

So overall, I'd say I tend to think about column naming antipatterns

in 4 kind of general categories.

First there, I think there's sort of typos

and internal inconsistency

of just,

say, you have, like,

multiple variables in the same dataset

that the part's name maybe aren't even arranged in the same way, like

logins_12

for the past 12 days, logins underscore 14 for the past 14 days, and suddenly you have, like,

logins underscore

16 underscore days. Bucket of just, like, typos and inconsistencies.

2nd, I think we have

misleading naming,

like, name components

where you think something means 1 thing, but it really means another.

Then the last 2 kind of buckets I tend to think about are when the same name can be used to represent multiple different concepts.

If you think about an ecommerce company, something like as simple as, like, login might mean many, many different things to many, many different people and represent a different type of event within a, like, broader login process.

And finally, we have the same concept being represented by multiple names, something like account maybe getting abbreviated sometimes as ACCT,

ACNT,

sometimes spelled out in full,

etcetera.

So I think

sometimes those can be more just annoying, sometimes they can be misleading,

or sometimes they can just be confusing when the definitions aren't precise.

And I think there's actually a really interesting interplay between that and different modeling approaches.

As you mentioned,

I tend to think of sort of the different modeling approaches as we think about a star schema or a snow schema or a data vault as kind of this spectrum

of what's represented

as a table versus what's represented

as a column in a table. And, of course, I'm kinda doing a lot of hand waving, like, very, very speaking in broad terms in terms of those modeling approaches.

But I think for the actual data producer,

any of those are fine as long as we have a very good mental model

or an r diagram

that sort of thinks about those linkages.

But once those tables start to get denormalized

and pulled into, like, higher level views

that users can

access and are maybe intended for more specific purposes.

A lot of the metadata and lineage

that previously existed in those links between tables and understanding how they interact

kind of goes away as those tables are flattened out and only really the metadata in the column names themselves remains.

So unless we kind of proactively fight against it, we have this kind of inherent loss of information

throughout any sort of kind of data modeling process.

And finally, beyond what I'd call anti patterns, I think there's also an element of, like, opportunity costs.

Again, going back to my background and my initial point of view coming from a more

math and theory view of the world where concepts have these hyper specific names and definitions and meanings.

Every time that our variables and our columns don't really have that same

type of clarity,

We're kind of just missing out

on providing more metadata information to our users, which is ironic when our job as data producers

is literally to provide as much information to users as possible.

And in terms of the

avenues for being able to identify some of these contracts and methods for conveying that information. As you mentioned, there are things like data catalogs and metadata management.

Obviously, there are the table and column names. And I'm wondering what you see as some of the types of contracts that can and should be defined and some of the ways that they can be enforced in establishing and maintaining these different data workflows

and ways that you can convey that necessary and useful information

in

the way that the data is stored and represented?

Let me start out with what sort of contracts should be enforced.

And I think there's sort of, in my mind, 3 main parts to that 1. 1st, there's basic things that are probably what people think about now in terms of standard data quality checks. So we have things like nulls, how they're encoded, whether or not they're allowed to exist in certain columns,

and correct data types. We definitely wanna know if the date's really a date or time stamp or a string representing a time stamp because that can have pretty foundational implications on how we can further process that variable,

what sort of logic we can use to filter it, etcetera.

The second type of contracts that we'd wanna enforce is thinking about what are our foundational business concepts.

So things like, is there a single source of truth definition for what is a login or what is a user? I think this is something that until you've experienced it, it sounds

just kind of absurd that a large organization

wouldn't have that.

But so often, a lot of those definitions

can end up being very

team specific

or even

person internalized something and made an assumption in their own mind specific.

And enforcing the contract for what do those concepts actually mean

can be really important and really powerful.

And thirdly, I think there is also space for contracts around

the implications

of how we chose to define some of those concepts.

So, for example, if I think about

what constitutes a successful login,

depending on if I'm thinking about the first step in the process or the last step in the process, that also has implications

on when will that data load. When is it fully available? When is it only partially available?

How soon after a whole record is created? Should I be able to, like, trust an individual field in that table?

So I think that's the kind of the third category of how do we enforce and communicate the interaction

between some of our more definitional aspects.

It's interesting that you're bringing in the idea of some of these business definitions, because that's also something that comes up a lot in the conversations around the semantic layer or what a lot of people are calling the metrics layer in some of the recent businesses and tools that have been coming out. And I'm wondering what your thoughts are in terms of the

utility

of that dedicated metrics layer versus just encoding some of those business rules in these column level contracts that we're discussing here?

Yeah. Definitely.

And I think

probably 1 theme throughout

all of this kind of philosophy I've come up with

is I think I'd like to really front load as many of these things to happen

as soon as possible.

So having some of those kind of foundational concepts, like if I'm trying to model a customer, I'd better know what a customer is. I'd like the earliest state engineer in my data pipeline to know what that definition of a customer is

to drive as much consistency throughout platform as possible.

Of course, I think that if I contrast that with the metric layer,

I think to me, probably, the rub really comes down to

some types of transformations,

especially those involving

aggregations or that sort of thing.

Obviously, some things, like, necessarily

can't happen at the beginning of the pipeline, have to really happen in that last mile when you know what sort of grain of data you wanna be working with and that sort of thing.

But I think

by and large, since I'm someone that really, really values

consistency, not in, like, necessarily the database sense, but just the single source of truth, like semantic sense.

I think when possible, I like to really, like, pull those things up

early into the process.

It also brings up the question of what are the boundary points where we need to be thinking about the definition and

enforcement of these contracts and how as you mentioned, you'd like to be able to pull that in as early as possible, but you also want to make sure that these contracts exist at the points where you're handing it off between teams or between data producers and consumers. And I'm wondering, what are some of the useful strategies you've seen for identifying these boundary conditions, and

what types of information need to be present at maybe some of these different handoffs where maybe it's just a boundary between data engineering teams versus between the data engineers and the data analysts or between the data analysts and the business users?

I think necessarily

and almost increasingly so, there definitely are a lot of those boundaries

in building up any sort of a data infrastructure.

And in my mind, I think probably 1 of the number 1 threats to data quality

is not having that alignment really throughout and across all those boundaries.

I know even in my own career and I've been on, I think, both sides of this both as a data user and a data producer.

A lot of times users will end up looking at a piece of data

and thinking that data is low quality, thinking it's strong, just because it doesn't adhere

to their specific mental model of what the data should be. I suppose another way to say that is, like, data seems wrong to users so long as it is not fit for their purpose,

never minding what the actual requirements were of that data when it was built.

So I think, especially as kind of data pipelines grow more complex and there are more handoffs,

kind of making sure everyone's really aligned with the same baseline

of last mile goals of the user

is really, really important. And something I really like about the column name contract framework

that I know we'll be getting more into in a little bit

is just defining in 1 place with kind of a more cross functional group of stakeholders,

what all those requirements are, what all the these definitions are of these concepts,

and being sure everyone's

kind of from the first person who touches the data all the way on down

has that shared understanding.

To the point of using the column names as 1 of those

contractual elements, I'm curious what you have found to be some of

the benefits of that and some of the potential trade offs versus maybe establishing these contracts in some sort of documentation process or procedural elements or having some metadata layer enforcements or enforcing them using something like using a tool such as Great Expectations or whatever your pipeline orchestrator might be?

Yes. Definitely. And I think everything

I really advocate for

is definitely more to be used in tandem with a lot of these great tools you mentioned, like great expectations

instead of on its own. I think the main kind of my main

push is on better encoding

the requirements

for our data in the data itself

to then be used

kind of throughout all the best quality tools we have today

of what's now called the modern data stack.

But I think the utility to really embedding

a lot of meaning in these column name contracts,

I'd say there's both a philosophical and a pragmatic

aspect to it. On the philosophical side

and I guess first, let me take a step back and maybe explain,

kind of before I answer that question,

what actually the common name contracts

are and how they work. I think it may make the second part make a little more sense. The general approach I'd like to advocate for and have currently been experimenting with a lot is breaking column names into stubs that communicate different types of information at different levels.

So, for example, we could define 1 layer of stubs

in a controlled vocabulary

that communicates how variables are encoded.

Like, an end variable could represent something that's a non null binary indicator

and an n variable could represent always positive integer quantities that represent count.

Then we could define another layer of this controlled vocabulary

that maybe represents more like entities, things like entities. Again, if we're working in a very denormalized table where maybe we have fields that are representing multiple different entities in the same table

and attach those really rich and precise definitions

of these business concepts. Again, what's the user? What's the login?

All embedded in a single word in the column name. And you can imagine

subsequent layers that have more modifiers,

at some point, maybe aren't predefined to really capture the nuance of the variable.

And ultimately then we can define our column names and these really complex concepts

just by permuting these sort of more atomic blocks.

So going back to what's kind of the utility of this approach versus other approaches,

On the philosophical

side, I think that what I really like about this approach is that column names are something that can persist throughout the data pipeline

so they can pretty seamlessly

serve as a latent form of communication

that shares this context between stakeholders

who don't often have the chance to interact with each other,

both either at the same point in time just because they're on different departments

or across time that those column names persist even when the team that built the table is long gone. They can also be used by all parties in the data production process,

and this is where it really connects back something like a great expectations, like you mentioned, where not only can these column names help end users understand what's in the data.

Something that I think is even more exciting is they can help define intent for developers

and help them act on intent

by programmatically

transforming variables

based on their names.

So I think

sometimes frameworks that

kind of just provide that, like, last mile value to end users,

there's definitely a big, like, effort versus value trade off.

Hard to kind of, like, get alignment and get enforcement.

But I think common name contracts are something that we can really embed

with how data is actually produced.

So going back to great expectations,

something like, if we have a variable that starts with end,

it's automatically

all end variables

are checked with a not null check

and so forth and so on. Now on the more pragmatic side

and sort of the less grandiose philosophical view of the world,

a lot of the same these same concepts and benefits could definitely be achieved through the standard tools you mentioned, like having good metadata, having good data discoverability tools and data dictionaries,

looking at data lineage.

However, I think all of those solutions are predicated

on users really reading through all those resources,

really embracing the documentation,

and trying to understand what's going on.

But

I think often we do have to be honest that users

don't always do that, and probably I'd have to admit when I'm on the user side, I'm probably guilty of that as well.

That you come to any dataset with your preconceived notions of how it works, and those can be sort of hardship.

But attaching these all this meaning to column names

tries to make the data's meaning more intuitive, just in the way that if you get a new iPhone,

you're hopefully not expected to read along the user guide.

It just should be like so intuitive that it kind of pulls you along in the process.

And in terms of the actual

contracts

that you're establishing and the vocabulary that you're using, you mentioned, for instance, having IND to represent a binary indicator.

And I'm wondering what the

process is in being able to establish

these vocabulary elements and their meanings and

who should be involved in that process to be able to say, these are the terms that we're going to use. This is how they're going to be represented. This is where those definitions are going to be stored so that everybody knows

when they go to a column name, these are the different

components of the name. This is what each of those components means. Maybe there is some semantic meaning in terms of the ordering of those components and just how that's established or if you think that it's something that we should maybe, as an industry, work to establish a

universal baseline

of what these terms mean and what their sequencing is supposed to represent.

I think to answer the last question first,

while I absolutely love the idea of widespread standards,

I do think because of just the nuances of different domains,

there's probably a lot of value in having or in at least trying to have companies or companies or organizations

define their own data dictionaries that make the most sense for their organizations.

I know some people, for example, I always tend to put kind of that more type layer first just because it makes more

sense personally to me.

Those are typically the types of indicators that are hinting to me things like transformations that I wanna do with the data, but that doesn't always make sense to everyone.

So I think I tend to shy away from having industry wide definitions

just because there is certainly a risk to this approach of if it doesn't

feel right and feel good and feel like a perfect fit for your use case,

people might issue it altogether.

But for the process of actually defining a controlled vocabulary and these stubs to name different concepts.

I definitely think that should be a cross functional effort

between data producers

and the different data consumers within an organization.

I tend to think of it much in the same way that 1 would think about typical or requirements gathering

for any sort of a large data project build, data warehouse build, etcetera.

But I think ideally, it can actually

have some positive externalities

that typical requirements gathering doesn't have

because it can bring those cross functional stakeholders into the same room and push for more precise and universal definitions

of different business concepts and metrics

just as people debate.

How should these concepts be represented

in a table? So I do think data can actually

be very influential. It's almost the common language of how a business tends to think about its processes.

So really taking the time as hard or painful as it might be to bring all those people together

and come to common understanding

of how the data should look,

I think can actually

ultimately help influence how an organization thinks.

And the second benefit I'd say in this kind of alternative process to typical requirements gathering

is it is a little bit more immutable that, like, if I sit down, maybe as a product manager with my stakeholders stay,

we decide the specific fields that needs to be in a table and my engineers go build it.

If someone comes along later and wants more fields in the table, wants different information in that table,

there's lots kind of more

reinventing the wheel,

thinking about making all of those decisions again.

But I think by defining and documenting a controlled vocabulary,

we can get a little bit closer to immutable and evergreen documentation

where

that definition and that almost specification

can exist

even as the underlying specifics of exactly

where that data is coming from, which fields are including,

some of the more transient aspects of data

can change while we still kind of have that controlled vocabulary to cling on to. There's definitely a lot of useful information that you can encode about the data and its format and the data type for a given column and maybe some of the boundary conditions for, you know, this value needs to be within this set of ranges.

But what are some of the types of information that you found are

either difficult or impractical to be able to actually embed in these vocabularies

that does still need to live in some sort of documentation system or external metadata catalog?

I think 1 of the most important aspects really of this whole philosophy

is to avoid lying to yourself. As I started to build all this out for myself,

something that I continued to challenge myself on was

the moment that you start violating

any of your contracts and you deviate at all,

in some ways, it's a house of cards where the whole thing falls down because no 1 wants

names that are like, oh, these uphold valid contracts

75%

of the time, but you don't know which 75%

of the

time. So as I think about what's impractical,

I really think about I mean, just as if I were making a promise to someone in real life,

I think about what are promises

that I might make in the data that I truly cannot uphold. And I think a good example of that is probably something like a uniqueness check.

So, for example, if we're building an account level table and we had a field like ID underscore account,

maybe the account field, like, is unique, and it could be tempting to say, like, oh, all ID field should be unique. However,

you know, joined different data sources is probably 1 of the most common operations

in all of data management

or data wrangling.

And we don't want a situation where as soon as somebody joins that field to another table,

it's sort of broken that uniqueness contract.

There are ID accounts that

have 2, 3, or 4 records genuinely associated to them.

And then you get into weird situation where it's like, well, should someone rename that variable?

You don't always wanna have to rename a variable every time you joined a table just because it's function changed,

etcetera.

So I think probably

properties that

can

but don't always have to be true of a variable

in different circumstances

are definitely something that can't be captured

at the name level and kind of do have to be handled in more of the traditional

data dictionary

type approach.

And so in terms of the actual work of

ensuring that your data conforms to the contracts that you're specifying in the column and table names.

I know that you built the dbt player package for the dbt ecosystem. But before we dig too much into that, I'm wondering if you can talk to some of

the work that's involved in actually taking your source raw data

that doesn't necessarily

fit the specifications that you want in your more

managed and transformed data and actually taking it from that source,

manipulating it to ensure that it conforms to the contracts that you're specifying,

and loading it into the destination tables and maintaining that overall process and maintaining the integrity

of those contracts as you continue to evolve the schema, incorporate new data, or as there are anomalies in the sources systems that you're pulling from?

Definitely. And I think that question really highlights

a few key challenges

that

certainly inspired

the DBT flyer tool that you alluded to that we'll touch on in a little bit.

But I think itemizing listing a few of those key challenges,

We are pulling data from many different source systems that won't follow these conventions.

So something I think I like to do somewhat early on in the process

is get the names of the different fields more like what their target names will look like.

Because as I mentioned, some of these stubs can sort of serve like a config file almost

where they can tell the data producer then

how certain fields need to be wrangled

to get them to their destination state. So for example, if I have a field that should be account value,

it now starts with an n,

so I know eventually that field needs to get transformed to an integer.

The second challenge that you mentioned

is,

you know, making sure that if data comes in corrupted, how does that get cleaned up throughout the process if there are anomalies in the data? And I think that's where in pipeline testing can be very powerful

because with this approach, it gives us another layer of abstraction

that we're no longer saying, like, well, I wanna check this column that it has this property.

But we can expand some of those data checks to say, well, for anything

that starts with an n that I know is a count variable,

I can go ahead and run a data check-in my pipeline

to make sure that it has no fractional component to that number. Whether it's an integer today, whether it's float today, no matter what form it is today, I can go ahead and run that check for any variable that starts with an n.

And then once I have that comfort

that all these variables

seem like integers,

I can go ahead and have a transformation step in my data pipeline where I actually cast all of those variables

to integers.

And the other sort of power of this type of approach

is since I'm not anymore itemizing those column names,

we can make ourselves a little more resilient

to changes in schemas like you mentioned,

where now if I go in and I, for some reason, remove a couple of columns that started then or I add multiple columns that started then,

it doesn't actually really matter or change anything

because my ideally, the code in my pipeline is just saying, I'm going to do this for any variables that have this certain structure,

which makes it a little bit more robust to changes.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

In terms of the actual

enforcement of these contracts once they're defined, I'm curious

what the process is for actually going from the conceptual element of

these are the components that we want to ensure. This is the vocabulary that we've established to be able to signify

these

semantic meanings.

And then taking that

prose

and conceptual idea of what the vocabulary

represents and then actually enforcing that in a programmatic fashion and then maybe even being able to

share some of those programmatic elements between different tools where maybe you have

a definition in your great expectations,

assertions that says this

element of this table name means this certain semantic ideal. So taking, for example, the counts to say this can only represent integer values

and then also being able to identify and enforce that in

DBT player or in the downstream

data analysis or data science tooling to say, I'm selecting from this column, so I know that I need to expect this type of data within, you know, these ranges or, you know, maybe this is a string value, but it's an enum type, so I only expect, you know, these 3 values to be present in this column.

The question of, I think, interoperability

across tools is

very interesting and honestly, it's not something that I've personally delved into a lot yet.

I've implemented kind of helper tools to, like, work like this

in both DBT

and small experimental art package that's doing similar stuff.

I think to really take this approach to the next level,

there would have to be a little bit more

of a universal

spec or configuration file or something,

a slightly more standardized way

to define these expectations

that then could automatically

generate

the DBT code that you need, the great expectations code you need.

Right now, in my experience, honestly, I've mostly done that kind of step manually of automating

within each tool,

how these transformations get applied,

but

have not honestly done a lot of work yet with interoperability.

Yeah. I can definitely see an opportunity for the metadata layer being the

common glue to be able to

maintain those specifications

and tie it together,

maybe using something like the open metadata specification and reference implementation

where you say, this is the JSON schema document that says these are the valid enum types for this vocabulary element. So if you see a column with this attribute in the name, then you can

validate that these are the only pieces of information that you can consume or inject from that location.

And so digging more into the dbtplyr

project itself, I'm wondering if you can talk about some of the work that you've done there and some of the story behind how you decided to actually implement it and some of the ways that it's being applied in your own work.

Yeah. Definitely.

So far, we've talked mostly about how call name contracts

can be that interface

for end users, how it can help them, you know,

understand the data better,

use the data kind of more correctly for their purposes.

And we just started to tip our toe into how

these same contracts

can guide producers

into knowing what sorts of transformations need to get done.

DBT prior makes it possible

to actually not only think about these contracts as configs,

but to incorporate them in the code itself within a DBT project.

And I think the real motivation for this was just my thinking about that

nominally having these contracts

is almost

absolutely the worst case scenario

if you cannot enforce them and if you risk breaking them to users

because

the kind of 1 risk of this approach

is leading to even more complacency of data users of, like, oh, if I know this variable is supposed to not have nulls, maybe I stop doing the not null check, which is both

a great time savings hack if it's true

and a kind of scary result if it's strong.

So with DBT Prior, my goal was to think about how we can make as many of the kind of transformations

and wranglings that you need for common name contracts

as automated as possible.

And both the basic idea and the API for this package were really, really heavily inspired

by the r package dplyr.

They don't talk explicitly in that tool about column name contracts,

but there are a lot of really nice semantics

in the language and in a lot of packages,

both dplyr and a lot of it sort of companion packages that have stolen its semantics

for operating on column names.

And there were a lot of things that I found when I kind of switch between languages and move back and forth that I really, really missed when I was working in SQL.

So some of the features that it offers

are it provides a number of macros to help select a list of column names based on properties of the name itself.

So in a DBT macro,

you can extract all of the column names from the relationship you're working with

and subset them with macros

called things like starts with, ends with, contains.

So I can easily make a short list of all of the columns in the table I'm currently working on

that say

start with

all variables that start with I n

d. Next, it provides a number of iterators

that help us supply transformations

in either a select clause or different filters in a where clause

on a list of variables instead of individual variables at once.

So in that DBT model, we can truly write code that's just like,

select the sum of each of the variables

with that and your starts with I n d instead of having to manually list those out 1 by 1 by 1. So, overall, like, I think the concept behind DBK plier is pretty simple. A lot of the code on the back end

is just Ginger code calling

regex code in Python

to subset these lists variable names.

But the way it becomes very powerful

is when we think about the interaction

with DBTs

inbuilt macros in testing functionality,

because you can programmatically

apply transformations

and validate your results.

So, for example, going back to the example that I've used a couple of times before about

wanting our n variables to be integers,

but wanting to make sure we aren't casting non integers, 2 integers, and just covering up our mistakes.

With DBTs pipeline, we can write a test

that, like, runs before transformation is executed,

and we can double check that all variables that start with n in a certain table have no fractional component.

And we can write that in a robust way that doesn't care if we have 1 variable that starts with n or 10 variables that start with n or if the number of variables that start with n change.

So that can do the just, like, simple 2 or 3 lines of code. And then in our next model, after we've assured that's true, we can write a dbt model that casts all the end variables

to that integer data type.

And doing this sort of thing programmatically

can help us avoid typos

and ensure that if our model changes or variables are added to our pipeline, it still works.

And, of course, you can think about like, I take that as an example just because it's relatively simple, but you can think about in similar ways doing far more

complicated transformations

in that same sort of programmatic way,

but still achieving kind of the standardization

and the consistency

that we're striving for.

And furthermore, you can go as far as to have tests,

not just for

the data values,

but for the data structure

and the data schema themselves.

You can write tests using dbt and dbtfire

to make sure that

only valid stubs are included in any names in your dataset

just by querying the table's information schema,

splitting out the stubs into different components,

and checking those against your controlled vocabulary.

Or you could check that, again, with the information schema, you could try tests

that certain stubs always correspond

to certain data types.

So,

again, it's the sort of funny thing because I think the tool has

really rather simplistic functionality at the surface level,

but definitely some really interesting interactions

with DBT.

There's a lot of safeguards and a lot of automation into this process.

And so going back to what we were discussing before about

the limitations

of the vocabularies

and being able to

encapsulate some of the contextual and semantic elements and when you need to actually fall back to documentation

or additional information in the metadata layer. What have you found to be some of the ways to be able to

link

the column names and the vocabulary elements to the necessary information for when you do need to be able to

dig into what the more nuanced meaning is for a given column or table or piece of information?

Honestly, right now, I think I do have

a bit of a, like,

simplistic system for that. Another nice functionality I really like of DBT is its ability to automatically generate really nice documentation websites.

You can embed any sort of custom information you want

in those websites

with just basic markdowns and tax.

So very easily by putting something at 1 of the higher levels

of your DBK project, you can have almost 1 of the main pages of your documentation site,

feature more of the, like, kind of controlled vocabulary,

universal source of truth definitions

of these foundational concepts,

but still have a page that you can kind of click into more specifically

for every single dataset

and read those more

traditional definitions each variable.

And I think then there's stylistic choices to be made there too of, like, when you're defining kind of the metadata for each and every individual variable,

do you define it in full? Do you only define things that go above and beyond? What are already implied by the controlled vocabulary?

Just depending on the the level of verbosity you wanna include.

That's obviously, I think,

not the perfect approach because it still sort of leaves questions in people's mind. Like, you don't know which columns have that extra bit of kind of specialized information. You don't know exactly where there's more to the story, but that's at least the first pass of getting all information kind of

collocated in 1 single place.

In terms of the

challenges that you've seen in being able to

agree on the necessary semantic elements to encapsulate the majority of concerns that you need to

encode in your

schemas.

What are some of the

sort of social and organizational

complexities that arise from that? And I love how you asked that question too because I think, like, so many challenges in data and technology,

the hardest challenges really are the intersection of technology

and organizational culture.

As with many things, I think the challenge is is a lot more work probably upfront getting that initial alignment

and the initial investment

that

it's kind of worth

getting a lot of stakeholders in the room, hashing out these definitions,

and

incentivizing

slash kind of

really pushing towards

all or nothing adherence to those definitions.

I think definitely just painting as vivid of a picture as possible of the value that comes out the other end, how this can really help all stakeholders in project. I think it's really important to help people understand that

it's worth different business units getting involved in the discussion because this can actually help

their users

work more efficiently,

understand the data better, etcetera.

But at the same time,

helping developers understand

that it's not just

syntactic or semantic sugar for, like, the end user, but it's also a tool that can really help them,

can start to get a little bit more of that alignment.

However, I think something I've learned about

in the many different ways that I've tried to, like, push standardization

in my own work and in the teams I've been on

is

no 1 will ever want to play the role of that enforcer.

And at the scale you're trying to get unification,

there's just no way, you know, like I don't think this is even a solution that you can just, like,

pure code review your way out of.

So I think as much as you can overach come those challenges, I think mostly through

automation

in ways that having kind of this

automated testing

of the fields, of the schemas themselves

is probably, like, the only way you can truly achieve kind of the consistency and adherence

as it like, to have kind of, I guess, both the carrots and the sticks approach.

And then 1 of the other interesting things worth digging into is some of the stylistic aspects

of defining these

vocabulary

terms where in some of the posts that I've seen, you write about it

using very succinct

abbreviations. So for instance, you mentioned having an n prefix to indicate a numeric or account variable

or I n d for a boolean indicator.

I'm curious what your thoughts are

on the

utility of keeping those vocabulary elements

concise

and abbreviated

versus

spelling them out and having them be more verbose

at the risk of having the overall column names be on the longer side?

That's definitely a tricky question that,

again, like, there's a part of me that always craves the level of precision

that requires a certain amount of verbosity. I'm afraid that's probably imminently clear despite way I've been answering some of these questions.

But I think at the same time,

like, thinking of the perspective

of the data user,

especially when you stop, like,

iterating over columns and saying just, oh, yes. Some all end columns. And you remember at the end of the day, someone's probably trying

to have to type the names out in full.

I think there is definitely a fine line to walk. I think already this approach

tends to produce

longer column names

than exists probably

in the wild per se.

So I think

knowing that

we're already including kind of more

components in the name, I think I tend to at least keep

each of those components a little bit on the shorter side.

Just recognizing that

probably no matter how long I made them,

the stubs themselves might not be completely self evident

without someone reading through

the controlled vocabulary

to understand both the semantics and the contracts that come along with them.

And then 1 other aspect of this is the

management of the evolution of the vocabulary

where you'll go through an initial phase of establishing

these are the core elements of the vocabulary because they contain the concepts that we need to encode for the majority of our data.

But now as our data usage expands or as our

sophistication

grows, we need to be able to have more concepts encoded

and just making sure that you are able to

maintain the evolution, not just of the schema that the data resides in, but also the vocabulary

elements that are required to be able to

convey the necessary context about that evolved schema.

I think that

that is probably 1 of the biggest challenges of this approach in the sense that, like, it can be

a little bit all or nothing. And it's 1 of those things where

if you start to implement

when you're kind of too mature in your data life cycle, you probably won't pick out those best concept definitions.

If you start to implement when you're too mature, there's a lot of, like,

baggage and backtracking required.

But I think probably 1 of the ways to keep things as flexible as possible,

there are probably some really good analogies

to how we think about something like a data vault, which can also keep your data schemas flexible as possible.

And the analogy I draw there is I think in data modeling, when we wanna keep things flexible,

we tend to break things up into more smaller, like, unitary modular components

that can then be, you know, pieced back together when they need to be pieced back together.

And I think similarly with column names, it's easier to

keep that consistency over time while keeping things flexible

if you don't overload any 1 stub with too many different meanings.

Because, again, if you think about pretty, like, insular modular concepts,

going back to that I n d variable,

like,

binary variable, non null. Maybe if we need a binary variable with nulls, we can add a bin variable.

But by not adding kind of, like, too many contracts or too many conditions

to any 1 stub,

I think we can achieve a lot of that flexibility

by how we, like, permute and combine those concepts

without

too much

creeper explosion

in the number of concepts we define

or worst case in having to change them later.

Yeah. And going back to the point about

maintaining some level of

terseness

in the

vocabulary

elements, I think that if you start to get

verbose in the definitions, then you run the risk of having concepts that span multiple words, which makes it very difficult to be able to understand

what are the actual tokens that I need to split this column name on to understand all of the elements. Where if I have a 2 word vocabulary element,

I either need to smush it altogether or I need to have an underscore and say, I don't split this underscore. I just split all the other underscores. And so just being able to think through some of the kind of tokenization

and programmatic language design elements that go into it as well.

Yeah. No. Definitely. A lot of choices to be made, and I think it also really hearkens back to your point that, like, not every bit of context

can probably be captured in the names. Like, when you have to optimize,

I think probably,

like, always better to optimize

for communicating things that

are true and are helpful even if they aren't necessarily

the full, full, full story. Yeah. Maybe you need to have a vocabulary element that says, this name doesn't tell you everything, and you need to actually look up the documentation.

Possibly.

And so in terms of your experience of working through the

evolution of this philosophy

and codifying it in DBT player and using it in your own work and seeing other people start to adopt it, what are some of the most interesting or innovative or unexpected ways that you've seen either these controlled vocabularies and or the DBT player package used? I think 1 of the most interesting use cases in my mind,

just because

when I initially was working on this idea, I was definitely working on it very specifically and very narrowly more in the data warehouse domain,

is the way that I've seen people use and people have shared with me the benefits

in more complex pipelines where they're moving data in and out numerous or in and out for different use cases

of more or less structured formats.

I think, you know, obviously,

people that work primarily in databases

are fairly lucky because

even just the basic kind of, like, type information

about different variables

is in fact stored in the database.

But the more that you're working with unstructured data or even putting your data in and out of, like, flat files where you don't have a concept of that schema,

I think column names can be an even more potent way to still capture and save some of the, like, correct expectations

of how should this variable work, what kind of variable is it. Those can kind of be

resurrected

once your data is moved to a system that kind of has and expects those features again. In your work of

using

DBT player and controlled vocabularies,

what have been some of the interesting or unexpected or challenging lessons that you've learned in the process?

I think the biggest challenges really are around how you absolutely make sure you're driving consistency

with this approach

and also

that you're finding the right level of abstraction.

And I think those 2 things are pretty intimately tied

because

if you define that really, really good vocabulary that works perfectly for your organization,

fits like gloves,

it'll probably be fairly easy to get adopted.

But as we know in, like, all of the gnarly reality of and messiness of, like, real world data,

if your vocabulary

isn't a good fit for the org, you probably won't get a lot of adoption.

Things will start to drift apart.

So I think both kind of approaching it humbly, not being too

overly grand in the vision,

and just figuring out kind

of really no regrets abstractions you can use. And like we were saying before,

picking things that are, like,

generally in the right direction, communicating useful information even if they don't tell us everything

can definitely be helpful to not get too terribly ambitious.

And then I do think, secondly, in terms of driving that consistency,

it has really been a challenge

just to make sure that these even data producers really understand all the ins and outs

of the vocabulary

and implement it correctly.

But I think the more you can work kind of with a shared code base, shared dbt macros,

and have the really robust testing of your outputs

is definitely a major advantage.

Because this system,

like,

professes having a correct answer,

it's at least above and somewhat easy thing to check whether or not it is going right.

So I think just leveraging the advantages

that can be very useful.

And if things

you know, if a lot of those tests are failing,

if there is a lot of data that isn't becoming compliant,

just honestly, like, talking and understanding

with different teams and data for data producers,

why that process may not be working for them

is definitely also a critical aspect for success.

And for people who are looking for a way

to maintain some of the consistency

in their data approach and

be able to share some of the semantic information, What are the cases where controlled vocabularies

or the DBT player project are the wrong choice, and they might be better served with

maybe

some library code in their data pipeline or orchestration layers or

relying on their metadata

system and data catalog to be able to share that information or leaning on these newer metrics layers as the source of truth for these business concerns?

I think it probably depends a lot what stage in your data life cycle you're at.

If you're too immature, say if you're a super fast growing startup, you barely even know what your product is going to look like a month from now,

It's possibly kind of too soon to optimize, too soon to implement an approach like this

just because as we talked about with fast changing

schemas and definitions,

just the requirements that may make sense

could change fairly dramatically, especially when we think about things like

what is an entity.

On the flip side, I think if you're in a really, really mature org with a ton of data produced under different processes,

There's probably still a place for this approach if you're starting a new project, building a new warehouse or schema,

but there you definitely have to be a lot more careful and explicit with your users,

what tables this approach applies to and

which it doesn't apply to.

So I think that's all to say. They're probably, like, many points on the maturity curve where, like,

if you're

changing

things too quickly

or you're unable to change things even very slowly,

it's probably the wrong choice because it's not the world's most flexible solution to a problem.

But if you're kind of reaching a point where maybe you've had some tables, maybe you have some data infrastructure,

but you're at a point as a company where you want

to kind of take your data infrastructure to the next level,

I think that can probably be

more of the sweet spot

where it's a relevant tool,

where you have a good grasp

on your domain,

but you aren't really too locked into anything or too changed at first.

As you continue

to work with this

data modeling approach and evolve the dbt plier package

and start to

popularize

the controlled vocabularies

utility both within your own organization and in the community? What are some of the things you have planned for the near to medium term or that you would like to see other people in the ecosystem contribute?

I think the biggest thing in my mind there

is

the kind of philosophy, I think, of this approach

far outstrips what TBC player is currently capable of doing.

Right now, I think it fairly effectively handles the middle step

of transforming

and testing data based on a controlled vocabulary,

but not really the first through the last step. So I think things that really could improve this package that I'd love to work towards in the future

are providing a clear, like, specification

or definition of that controlled vocabulary.

Probably in DB Key, that would be in a YAML file since that would be very consistent with their overall style.

And I think I would love to see that sort of specification or definition,

both find ways for it to more proactively

guide these transformation testing steps instead of users using d vg pliers to write those out manually.

And also to our conversation about

documentation,

figure out how to bring that a little bit more

holistically and seamlessly in an automated fashion

into the DB2 Fire docs.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Definitely. I think another area that I'm very passionate about

that I'm definitely watching closely right now

are the many different data quality tools that exist today. I think the past couple of years have seen

an explosion of really powerful tools for monitoring data quality.

The thing that I

feel like there's a substantial gap in is what those tools are actually testing for.

I think there's a lot of talk right now about thinking about data quality is like unit tests,

but to me that feels like a slightness from the standpoint

that data isn't always

necessarily a single unit. I think a lot of the data quality tools that exist today have a lot of really powerful

road wise or record wise checks.

But what I tend to find when I think about the data quality issues that I've encountered in my work,

a lot of it is more driven

by internal consistency of data

and how that data

kind of relates to itself

and relates to the domain that it's in. So, for example, I think a lot of businesses work

with panel data where you're tracking a similar set of users

over time, or you at least

have some interrelationship

between rows.

I find a lot of data quality tools are somewhat immature in terms of

being able to specify more

domain driven and custom

data quality checks

that kind of help you test the relationship between rows instead of just singular rows at a time. So I think that's definitely

1 space where I'm really interested to see where it goes

and would really be excited to see more tools that allow you to, like, embed kind of that domain specific context in the workflow.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on DBT flyer and the

controlled vocabularies,

specifications

that you have been experimenting with and helping to popularize.

It's definitely a very interesting and useful approach

to establishing and maintaining these data contracts, which I think is something that is

not as widespread as it can and should be. So I appreciate the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you. You too.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language,

community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links