Build Your Data Analytics Like An Engineer With DBT

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.

And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events.

You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch.

Not only does it free up your engineers' time, it lets your business users decide what data they want where.

Go to data engineering podcast.com/segmenti0

today to sign up for their start up plan and get

$25, 000 in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom.

On top of that, you'll get access to the Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to the show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

Go to dataengineeringpodcast.com/conferences

to learn more and to take advantage of our partner discounts when you register.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving a review on Itunes and telling your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Drew Bannon about DBT,

the data build tool, the toolkit for building analytics the way that developers build applications.

So, Drew, could you start by introducing yourself? Sure. Thanks, Tobias.

I'm Drew Bannon. I'm 1 of the cofounders at Fishtown Analytics and a maintainer of DBT, data tool. And do you remember how you first got involved in the area of data management? Yeah. So in college, I did an internship with an Internet radio startup based in San Francisco called 8 Tracks. And I started there as a software engineer, but I worked on some internal analytics functions at the company.

And so these were things like building an in house event tracking system and building out some internal reporting. And, so as I as I kind of built these things out, I understood that I need to transform these events in Redshift, the data warehouse we were using.

And I thought a lot about how to expose these tables for analysis in our internal analytics tool. And so that's kind of where I got my start. And I, eventually came back to Philadelphia to finish up school

and I, started working at RJ Metrics, which was a business intelligence company in Philadelphia.

So that's where I met my 2 cofounders, Tristan and Connor.

While I was there,

we yeah. At RJ Metrics. When we were at RJ Metrics, they went through this transition in which the,

core product, it was Cloud BI, was sold to Magento, and out of that spun another company called Stitch Data. So While we saw this Stitch Data thing happening, it was called RJ metrics pipeline

at the time,

we understood that the advent of databases like Redshift really changed the game for analytics.

And so we went off to figure out how to put Redshift to work with all these big datasets that you could pipe in pretty easily. And

so out of all of this, you ended up starting the dbt project. So can you give a bit of an explanation about what it is and your motivation for creating it? Sure. So in the earliest days, I wrote some of the first code lines of code for dbt.

The big problem was that we wanted to build these logical abstractions

in Redshift. And so we wanted to create views that, say, calculated,

just say orders. The problem with Redshift was that you couldn't drop a view and recreate it atomically without

cascading the drop to all the downstream views. And so we started building this tool that would help you create views and understand the dependencies between different views such that you could rebuild this whole,

like, pipeline of views on Redshift

from the command line.

And so, eventually, that evolved in our thinking about exactly what this thing was evolved. And we realized that what we were doing was creating a software engineering workflow for analytics.

And so we we knew we wanted to version control, all of this code. We wanted to test it. We thought a lot about dev and prod environments, for example. And so all of these things together form dbt. And so

the primary use case seems to be oriented around data warehouses

and doing these,

in database transformations and building these views. So I'm wondering if you can just talk a bit about where it fits in the overall landscape of data tools and the overall life cycle of the data in And

so, there And so, there are off the shelf tools like Stitch or Fivetran, or you might roll your own ETL with airflow.

Once that data is there, you want to transform it. And so I mentioned views. That's how dbt worked in the early days. But, since then, we've built out dbt's capabilities to to build tables and what we call incremental models that are, like, incrementally refreshed tables, for instance. And so all of these things happen inside of your warehouse. DBT will connect to Redshift or Snowflake or BigQuery,

etcetera, and run sort of create table as or create view as or insert statements to transform this data based on the select statement that you've written to codify a data model. DBT fundamentally works by materializing

these data models in your warehouses, tables and views. So once you've created these things,

there's a whole suite of things you can do with them

around testing and documenting, etcetera. But probably the thing you wanna do is connect your BI tool or your Jupyter Notebooks to them. So so you can query a a clean dataset that you've prepared with dbt.

And 1 of the questions

particularly out of the idea of cascading the delete and recreate of views in a redshift table,

but also for being able

to create these different tables and views.

1 is just curious about some of the performance implications of having to do these cascading operations and then rebuild the entire view structure.

And then also in terms of

the

creating tables and potentially

manipulating and transforming the data in the warehouse,

just being able to have confidence that you're not accidentally

removing data or mutating it irreversibly

and just the overall safety of of the operations that you're building and any way to gain enough confidence to not have a panic attack every time you run it by accident. Sure. Yeah. It's our belief that, all things analytics should be, you know, uneventful. So we think hard about how to make this, not a scary proposition. So, let me tell you a little bit about how dbt fundamentally works. Dbt

is a compiler and a runner.

When we talk about compilation,

there is a Jinja context,

Jinja, the templating language,

present when you want to execute a model. And so these models are are mostly select statements that do some sort of transformation.

But you can also use a Jinja function that we've added to the context called ref for reference.

And so,

by referencing other models,

you can build up this DBT

DAG of all of your transformations.

And so on the topic of performance or rebuilding specific models, you can do really powerful model selection

to say I wanna re rebuild my orders table because the logic changed or because the data was updated, as well as any model downstream of it. And there's some pretty clever,

model selectors we provide around tagging and source datasets and parents, children, things like that that that help here. On the topic of

performance in general, so dbt helps you leverage database specific performance

tuning functionality. On Redshift, this is sorting disk keys. On BigQuery, it looks like partitioning and clustering. On Snowflake,

it mostly just works the way you would hope. So there's there's support, for clustering,

coming, but we find that you don't need to use it except for really very large tables. So you can make use of these,

performance tuning levers in order to make your tables well tuned for queries.

So if you have a really big, like, say sessions table that you've rolled up from events with DBT and you want to make sure that your BI tool is snappy when you query sessions, you can throw a sort key on the, session start timestamp, for instance, with a config change in in dbt. The other question that you asked was about how you make this uneventful and how you get confidence that you haven't broken anything. Right? So in DBT, there's sort of a notion of, develop, like, different environments.

And so every analyst that uses DBT has their own scratch schema. By convention, we do

dbt_

and then your your first or last name. So for me, I'm dbt drew. The production

schema is usually called analytics or prod or something like that. And so what you can do is, you can run dbt in your local development environment,

and it will transform your source data, which is never mutated in this process.

But it will transform it and place it into your desired destination schema. So for me, I'm transforming from, say,

you know, public dot orders, wherever your source data lives. Hopefully, you know, shopify.orders

or something like that, and placing it into dbt

drew dot orders. And then that process of transformation, I'm only affecting data that I specifically

want to mutate, which is my sandbox dataset. I'm not touching any production

analytics

schema

data that would say be queried by, your BI tool or or external queries.

And so what's great about DBT is it works,

really very well with, version control tools like Git. And so you can version control this whole repository. You can make your changes locally

and run them against your sandbox schema. And then only when you're ready to deploy this thing to production do you make a pull request, have your colleague review your code, merge to master, and then run against the production schema. And so 1 option there is to use some a common paradigm in software teams of only letting the continuous integration system actually execute execute against production so that you never have a human potentially introducing their own error or using the wrong command line flag as they're executing the run so that you can be sure that everybody has reviewed the code before it actually gets running into production. Yeah. That's totally right.

Dbt in particular makes this continuous integration question really

tractable. So what you can do is because Dbt

has such a a first class knowledge of environments,

you can actually make a CI environment in which your whole DBT project is built from scratch.

So every dbt model, because it's a select statement, is a deponent. You can run them all and they create your desired destination

tables and views. But what that means is when you're running it locally in development, it's say dbt drew. But in

your CI environment, maybe you open a pull request.

You can automatically run your full dbt

DAG into a scratch schema called pr135

or whatever. Then you can run your tests against that schema. And then if you really want, you could plug your BI tool for some key reports,

swap out analytics for PR 135,

make sure the reports are all built, and get user acceptance that the PR is you know, solves the problem that you were trying to solve or does a good job of representing the data that you wanna analyze. And so that's a totally natural and great way to use dbt

that

we certainly practice at Phish ton Analytics.

And when I was looking through the documentation, I noticed that there's first class support for adding tests around the different models

and operations that you're creating with DBT. So I'm curious if you can talk through how those are implemented

and just the overall workflow

of building a new analytics report from scratch using DBT. Sure. So, DBT has basically 2 types of tests built in.

1 of them we call schema tests, and these are sort of templated

assertions about your data. And so on databases like, say, Postgres, you can set column constraints, and you can say, this column should always be unique or this is my primary key or it's not null. But most of the analytical warehouses out there don't support this full suite of

database level checks on your columns. And if they do support them, actually, it's problematic

insofar as, like, on Redshift, if you say that something is unique but it's not unique, it won't be enforced and your query results could be incorrect.

So we generally don't recommend that people use database level constraints and instead we recommend something that you might call like batch data testing. The big idea being that you actually transform the data and then you assert that the things you think about the data are true.

DBT ships with 4 built in tests. There's uniqueness,

not nullness, referential integrity, and enumeration of accepted values.

Out of the box, you can test that these things

are true or that they hold against any column in any model that you create. You can additionally create your own schema tests with macros, which is a super interesting

topic. And what you can basically do is write the template

for, this is how all the schema tests work. You write a select statement that returns 0 rows if the constraint holds and more than 0 rows if the constraint does not hold. So basically, the the not null test says count the number of rows that are null. And if it returns any number other than 0, you know that your test didn't hold. And so you can write a similar type of test for, you know, whatever constraint you could imagine, like, that your you know, the sum of 2 values equals a third value, for instance. So Those are schema tests, and once you define them, they're very easy to apply as metadata

to your model specifications.

They're a really handy way to quickly assert that your assumptions hold both when you're actively developing

models, so test driven development, you can do with DBT.

Or, also, it's super useful for, you know, the case of regressions when you go and modify

other seemingly unrelated code down the line. We've all been in the position where we write a bad join that fans out some orders table and inflates revenue. And these schema tests for uniqueness and foreign key constraints, for instance.

Oh, like, really help find these things before they make their way to production. I think the other question you asked about was the process of actually building a report on top of a DBT model. Is that right? Yeah. Just talking through the overall workflow

of getting started with dbt and going from

having a problem and then,

all the way through to solving with dbt and getting it into production? So people have certainly

built reporting out before dbt existed, or they do a good job without dbt. But the thing that dbt really helps you solve is building this corpus

of, of data. This, like, information architecture in your database. And so having a really good definition of exactly what an order is or what a user is, for instance, that's a really challenging

problem when you have multiple different apps with, varying levels of authentication, for instance,

visitors, users, etcetera. And so you can build out all of these data models with dbt. And so usually the starting point is, like, you have some problem and you're just you don't want to go and encode the transformation in your actual analysis. So here's 1 we see a whole lot. You are an ecommerce company

and you sell your product in a dozen different countries. And so every time you want to do any analysis, you have to sort of build a union that unions together the orders from a dozen different countries. And so 1 day you get tired of doing that and you check out dbt. And, basically, what you do is you open up a folder, you you PIP install dbt or brew install dbt, you set up your connection information so that dbt knows how to talk to your database, and then you open a model file. These are just dot SQL files in some Git version repository on your hard drive. And so what you do is you start typing the SQL select statement that produces the transformation that you want to materialize in the database.

Maybe you start by, you know, select star from us.ordersunionialselect

star from

ca.orders. And you enumerate your dozen

different datasets, and then you press Save at the end and you pop up in your terminal.

Now it's the actual process of invoking dbt

to materialize the SQL select statement as a, as an object in your database. And so the way that you do that is simply by executing dbt run, which is the secret of, why dbt is named what it is. It's, it's very short and easy to type compared to something like the world's greatest Databill tool. That would be too long, I think. So,

okay.

So once you have run your dbt project, dbt will find all of the models that exist in your

identified dbt project. So git is identified by a dot git directory.

Dbt

projects are identified by a dbtproject.

Yaml file. And so if you have 1 of these files, you're in a dbt project, and anything in a folder called models by default is considered a model.

Maybe you build that thing, you write your tests to

insert that your order IDs are unique and not null, and that

every order has a user associated with it, say. Then all you have to do is go into your BI tool,

and instead of selecting from a dozen different orders tables in a big union, you can replace that with select star from, in my case, dbtdrew.orders.

So if I I run that code and it checks out, everything looks great, what I can do is, you know, go do the Git flow and push my code, merge to master, deploy in production. In my production environment, the target schema is called analytics instead of dbt drew. And this is all configuration, you know? And so suddenly, you can swap out dbt drew for analytics in your reports and you're pointing to the production data model for your unioned orders dataset. The wonderful thing here is as time passes, if you find that

you actually wanted to exclude some orders that were test orders, they didn't really count, you want to filter on the IP address for your office or something. You now have a logical abstraction for what an order is. You're not pointing to 12 different source tables in your BI tool. You're instead pointing to to the abstraction of an order. And so you go into your model and you, at the very bottom, add, you know, where,

you know, is internal equals false or whatever.

You do the same git flow to push to production, and all of a sudden, all of your reports that point to orders now have an updated, more correct definite definition of what an order is. And this is something that you certainly could do by clicking into the 50 different reports that touch orders in your BI tool, but certainly it's

probably nicer to just do it in 1 place instead.

And, also, we don't have this version repository

of queries and reports

and the ability to collaborate on it fairly easily, then you end up leaving everyone to write their own SQL,

usually ad hoc, and they might have their own assumptions as to what an order is or what a customer is or how to structure the query to join across different tables. And so everybody's going to have slightly different views of the data or slightly different,

outputs. And so

definitely having the ability to have 1 location that everybody can look to

and 1 interface for everybody to collaborate on makes it much easier and more scalable to

have more people working on building and interacting with these analytics reports and these analytics pipelines. Absolutely. I think that's a great point. What we find is that in the process of building up these data models, what you're actually doing is generating knowledge about your organization. And so you're saying, here's exactly what an order is or here's exactly how we calculate MRR.

And to that end, dbt ships with, auto generated

documentation about your project. You can run dbt docs generate

to generate this single page app of all of your models with all the tests on the columns and descriptions that you can populate for these different models. And so if you do have some consumer of the data that isn't using DBT,

they have a great place that can go and and see, all the models that exist and all of the columns and and your pros about all of it. And so in that way, it's sort of a a catalog of all the data that your organization commands and sort of instructions for use too. Yeah. And I think that that is definitely very powerful because

particularly having the documentation

be generated as part of the code as opposed to something that someone does after the fact or alongside the work that they're doing means that it's much more likely to stay fresh and, actually be updated periodically rather than somebody putting in the time and effort to write some pros

once when they first build the overall reporting pipeline, and then it, quickly grows stale and useless over time as new modifications are made. Yeah. That's absolutely right. And another interesting capability that dbt has is the idea of packaging

and being able to package up these different subsets or, of reports or transformations

so that they're reusable across

different teams and across different code bases. So can you talk a bit about how those packages are set up and, implemented? And also maybe talk a bit about, who these sort of primary drivers are for the different packages that are currently available. Sure. So When DBT runs, it will look in a couple different places for,

we would call resources.

An example of a resource is a model or a test of a model,

or things like documentation snippets, etcetera.

1 of the places it looks is your models directory, which are the models that you've created, but the other place it looks is a folder called dbtmodules,

which is sort of node inspired. And so what you can do is just drop whole dbt projects into that dbtmodules

folder and they get picked up as though they're a natural part of your project.

All of these resources become available in the compilation context that dbt provides. And so there are basically 2 types of packages that are produced. 1 is dataset specific

packages and the other is sort of macro or utility packages. Example of a a dataset package is something like Snowplow.

And so we're huge fans of, the Snowplow event tracking

system at Fishstone Analytics. The big idea is that you can track events from all your different clients

and they flow directly into a big events table in your warehouse. And so this event table is like an immutable event log. It has a full history of every interaction that you cared about to track in a single table, which is phenomenal. It's a great resource, but the problem is it's difficult to plug a BI tool right into that either because it's too much data or because the things you really care about are hard to analyze. Like how many people had 2 different events in a single session. And so what we frequently find ourselves doing is rolling up these events into sessions using some code that was actually originally produced by the Snowplup team called their web data model. And so what we can do is we can make a package of these transformations

that go from raw events to page views to sessions, all the way up to users.

Then we encode these things as dbt models and if you include this package into your dbt project, when you type dbt run, these models will automatically run. You can also reference them from your own models. So if you want to do marketing attribution on top of sessionization

that was provided by the Snowplough package, you can absolutely do that. The other broad type of package that we make is maybe more focused on macros.

The Jinja templating engine

supports something called macros, which are functions that return

text, basically. In most cases, text. We actually hack them so they return other things, which is pretty wild how we do it. And and so what you can do is if you find yourself writing the same type of code over and over again, what you can do is make a macro that accepts arguments and spits back out

usually the SQL that you need to

encode that particular piece of logic.

So here's a really good example that shows the full, like, force of the dbt compilation engine. We wrote a actually, let me create that. Somebody contributed

a pivot macro

that you could point to a table in a specific column, and you can say, pivot out the values into this column using this aggregate function.

So you say, look at the orders table. Look at the

you know, how about this better example? Look at the products table. Look at the product's color, and then pivot that pivot that out to, like, color red, color blue, color green

with a a 1 if that's true or a 0 if it's not. And so this is probably something that a lot of people have written manually many times over. But with macros and the ability to sort of encapsulate logic plus packages, which is a distribution mechanism, we can write that thing once and many, many people can benefit from it. So this is 1 example of a macro that was contributed by a member of the dbt community, but really this this dbtutil's

package that contains the pivot macro has dozens of macros, many of which were contributed by dbt users. And the really cool thing is a lot of these people aren't engineers by trade. They're analysts. And so for a lot of them, it's their first time contributing to an open source repository. And that's a pretty cool experience to be the benefactor of the the code that they wrote. And as far as the packages themselves,

they are primarily the SQL or Jinja statements. But I'm wondering if

you add any support for

being able to create additional filters in the Jinja context for people who are building these packages. And if so, if there are ever any additional dependencies that get pulled in as a result? Sure. So currently, dbt does not support custom filters at all, though. It is, something on our road map, and I think that you're right. That's a great opportunity for,

sharing the same type of code that that many people can benefit from. But you're you're right to identify that the Python

environment is 1 of the the most difficult aspects of of this whole question.

In practice, we find that a lot of the analysts that use DBT

have

non pristine Python environments. And so the idea of

installing things into them or getting them to install something into it based on third party packages,

a little bit difficult.

And so I think I'd wanna have a better handle on how DBT works with, its Python environment in general before tackling that kind of question.

And so the way that all of these models are defined is via these SQL select statements.

And so I'm wondering what you have found to be any limitations

of

the abilities of creating models based on these SQL queries.

Sure. So I think this is 1 example where,

constraints kind of breed creativity.

So there are a ton of exam ton of benefits

of

of encoding all of your data transformations purely as SQL select statements. You get impotence

right out of the gate. And so if you run the same model against the same dataset, you will 100% of the time get the same output, maybe 99% of the time get the same output. That is definitely not the case when you write transformations that use delete and update and insert kind of haphazardly.

And so behind the scenes, DBT takes your select statement and it wraps it in create table as or insert into, you know, from select, for instance.

In order

to either create or mutate the objects in your database

without you actually having to write that specific potentially non independent logic. And so

this this can be a difficult thing to do, to reframe

an insert statement that you have or an update statement as a series of select statements.

But, in practice, I don't think I found an update statement that that couldn't be rewritten as a select.

And, really, the benefit here is that with select statements, you're building data models. Update statements are

untyped

mutations,

whereas the select is that you produce a thing at the end. And so every time you produce a thing with dbt, you can test it and you can document it. And so maybe it's strictly internal and you never want the world to see it. You can

configure it to live in a custom scratch schema that you delete at the end of your dbt run. But just in the middle of the run, you get an entity that if there's a bug, you can inspect, you can select from, which you can't do if you just run a dozen, you know, updates and inserts and deletes in order. So, truly, I think, you know, in our

introduction, we have, in the DBT documentation, it says in big letters, everything is a select. And I think that's 1 of the the core benefits of DBT is that it forces you to think about how to define your transformations as as pure transformations

via selecting data. And

just overall being able to make SQL code reusable

and,

composable

is difficult and sometimes impossible.

And I know that you've added in these capabilities

of Jinja templating, which,

helps to alleviate some of those issues. But I'm wondering

what you have found to be some of the remaining issues and potential shortcomings of using Jinja for composing these various different SQL fragments and snippets

and how it compares to other approaches that might use a higher level language that actually compiles down to the SQL that is closer to a sort of tour and complete environment?

Sure. So there was a a good post that I read the other day. It was called,

we can do better than SQL by, I think, folks from EdgeDB. And, basically, they enumerate all of the challenges with DBT, and 1 of them is non orthogonality.

You you can't actually use the same query in places that you think you should be able to or or composition is difficult. They talk a lot about the ales of of null values. And so the direction that they get to at the end is, like, we need a language other than SQL

to query databases.

And so,

I like

maybe some of the other people I saw on Twitter thought this is a really good critique of SQL, but also I I kind of disagree with the conclusion. And so I think SQL is,

you know, the lingua franca for querying databases, and it has been for, yeah, since the seventies. It will be for another 40 years,

I think. And so the question then is, how do we define,

in my mind, good practices around using SQL to make it modular and reusable?

And so for DBT, that sort of happens in 2 different ways.

1 of them is we think less about

reusing SQL Snippets. That certainly happens a lot. But if you have maybe a fragment of a query, that's less useful than actually building a named object in the database.

And so if the thing you really want is to, like, have a persistent definition of orders in your database and not a query that can generate orders. Dbt is certainly something that makes that very doable. There are legitimate cases where you do want to reuse a SQL snippet, for instance, and 1 of these places that we see come up a lot is around window functions.

What you can do is you can write a macro that returns

a window function definition

for, say, the sum over some partition with some configurable time range. You have a dozen different time ranges you care about, you're gonna encode that in a macro. And then you can reuse the pointer to that macro many times over in many different models.

And so the thing that becomes important is thinking about the actual interfaces of of, say, these macros. So maybe you have macros that return columns or macros that return

the names of tables or subqueries.

If you're not cognizant of exactly the the contracts between these macros, it it can be very hard to reuse these macros and to compose them. And so the thing I've been thinking a lot about is if SQL were to have types for SQL fragments, what would they be? And so I'm not so far down this path. If anyone has more evolved thoughts on this, I would I would love to hear them. But broadly, I think that

there are interfaces like

a queryable interface. And so something that's queryable is the name of a table or the name of a view. Equivalently, it could be a select statement,

but only if you wrap it in parentheses.

And so regardless of what you have, maybe it's the name of a table, maybe it's actually a subquery wrapped in parentheses, you could always say select star from and then that thing. Right? And so if you're cognizant of the interfaces of your macros and you're and you're conscious, you do the same things with, you know, scalar values. Say it's either a literal or the name of a column

or a subquery that returns a single row. You can do all these things and you can be consistent in your interfaces.

And then it becomes a lot easier to compose these things on top of

each other. Yeah. So no. I definitely like the way that you approach the idea of having these interfaces

to these different

concrete objects within the database rather than just trying to use the typical approach of munching together a bunch of different text snippets into something that's supposed to make sense as a whole, which often doesn't and sometimes just doesn't even function at all. So I like having that perspective particularly from someone who is working closely with the space and trying to make these SQL objects more composable

and more

reusable and, able to be collaborated on

versus just,

giving up on SQL in general and saying, no. We're just going to either move to a higher level language that compiles to the SQL or just, you know,

throw your hands up and just have, you know, v 1, v 2, v 3, or, like, 15 different slightly

slight variations of the same query to do slightly different things.

Yeah. Exactly. You know, I I mentioned EdgeDB, but there's the other prior art here is, I think, LookML, which also compiles to SQL, but they tackle this problem of dimensions and and measures. They deal, I think, a lot less with the composition of broader SQL constructs.

And so I don't think it's a solved problem in in this world today exactly. And I my instinct is that it's not gonna be a technological solution. It's gonna be sort of a convention

solution to this problem in which people kind of add constraints to their own workflows in order to to make certain other types of things possible.

And so can you explain now how DBT itself is implemented and how the design and and capabilities of the tool have evolved since you first began working on it? Sure. So when we first built DBT,

the only warehouse that was in our in our purview was Redshift. And so it was, in fact, very Redshift specific

in all of our

like, it it kinda looked like a tool that existed to run queries against Redshift

in in topological sort order. You know? And so over time, our horizons brought in. We started making dbt work with BigQuery and and Snowflake,

and it became

it started to look a lot less like a tool that ran queries against a database, and it looked a lot more like a compiler. And so there's a parsing step,

and after parsing, we pluck out all these ref statements that I mentioned earlier to understand the shape of the graph.

And then we we analyze the graph and make sure there are no cycles anywhere in your dependencies.

This is a whole topic, but dbt has a notion of an ephemeral model, which is sort of like a subquery that can be compiled into other models, but doesn't actually get materialized in the database.

So we we recursively link all the ephemeral models that are relevant in in the given context. And so really, it looks a lot more like a compiler if you look at the code these days than it does,

you know, a tool that connects to databases and runs them. And what we were able to do was take all the database specific code out of Python, out of the dbt core package even, and package these things up in their own packages. And these are like Python, PyPy modules. And so there's a Dbt Redshift package that has all the logic required

to

do the things DBT needs to do on Redshift. So we actually do these things with macros.

There's a create table macro

in the DBT Redshift package that knows how to create tables on Redshift that's aware of sort keys and disk keys, and similarly for BigQuery and Snowflake. And so what this lets us do is we call it sort of an adapter plug in architecture. We can now make dbt work with,

all types of different databases.

And so we just produced,

maybe at the beginning of of 2019,

a Presto plug in and, shortly thereafter, a Spark SQL plug in. And I know some folks are thinking about Hive and Athena as well and and, Azure SQL Data Warehouse. And so the big idea is, like, we can have our core adapters that that we use and test and verify. And if folks wanna make their own adapter plug ins for their own databases,

they can, like, feel very free to do that, and it's totally possible with dbt.

And I that's the biggest architectural difference between dbt of 2016 and dbt of of today, is this this

core divergence of, like, dbt, the compiler, that knows about an adapter contract, and the dbt plugins that they know all the specifics. And there are many of how to build things on Redshift and Snowflake and BigQuery and etcetera. Yeah. And I think the recent addition of

the Presto and Spark capabilities

will certainly

broaden the reach of DBT as far as the number of people who are able to use it because, well, data warehouses have traditionally been the default place for analytics to take place. Data lakes are becoming more prevalent, particularly as we start to have more volume and variety of data.

And so people would want to be able to have composable and reusable analytics in their data lake before they get it to the point where it's ready for a data warehouse. So, I definitely think that that was a smart move to add those capabilities. Yeah. Thanks. It you know, for us, it was,

it was and and remains a learning experience and that our you know, we've put a lot of time in on Redshift and and stuff like a BigQuery, but really, professionally,

this is our first foray into Spark SQL, and we haven't used Presto before

in a production environment.

And so there's a lot to learn about, you know, obviously, the SQL specifics, like what does a create table as look like on Presto. A level above that, it's things like, are transactions supported, or should we try to use transactions? How do we atomically

replace

objects in the database? And then maybe a step even beyond that, which is really the level that we're we're operating at right now and where we could probably use the most help from the experts in the community, is, how do you actually implement these paradigms well

on a a data lake? And so on Snowflake, we can incrementally

build a table with a merge statement, or same for BigQuery. But on Spark SQL,

it probably looks more like insert overwrite for a partition, and you do this thing on a partition basis.

Like, none of the databases that we use, with the exception of BigQuery,

have a notion of partitions.

It's a spectrum on Redshift, but it's sort of the same issue. And so we're trying to we're trying to rethink, how do we take the DBT paradigm, which has so many known benefits, and how do we adapt it to these other databases that that provide different mechanisms for actually

creating or modifying relations in the database. And and certainly, that's an area where, like, it comes with experience there. We've tried a couple different iterations here, and they work in some ways and and they don't work in others. And we're very much still learning. And what have you found to be some of the features of DBT that are often overlooked, but that you think provide

significant benefits and that people should pay more attention to? Sure. So the secret sauce of DBT is the DAG at the core of it. And in our,

release we made maybe a month and a half ago, maybe 6 weeks ago, we

introduced the notion of data sources into DBT.

So previously, when you wanted to build your models, if you wanna select from another model, you select it from ref and then the name of the model in your Jinja curly brackets. But if you wanted to select from a source data table, you would just say select star from

schema dot table. And so the problem was dbt's DAG ended

right before the source data, which is such obviously a crucial part of your transformation, like the actual data that's being processed. And so what we did was we added another function

analogous to ref, but it's called source. And it lets you reference a source data table. And then separately, using pretty much the same syntax you use to

to,

specify models, you can specify sources.

This is like a YAML configuration,

and you can say this is my snowplow source schema. It has a table called event in it. Then in your model you can say select star from, you know, source snowplow event.

This does 2 things. Dbt

will make it possible to say run all the models that depend on Snowplow data, which is really cool. You can visualize this in the Dbt documentation. It actually builds the edge.

The other killer feature that I think is currently

underutilized

is that you can specify the loaded at timestamp

for these source data tables,

and then you can use the dbt source snapshot freshness command

to understand if your data is recent within acceptable SLAs.

In your config you say, I always expect my Snowpla data to be less than 24 hours out of date. Well, you can run the dbt source snapshot freshness command and if that's not true, dbt will exit with a non 0 exit code and then you can fire off alerts in whatever way you see fit. And and so this is really wonderful. Like, we after deploying this thing, we set it up on our own internal analytics project, and we realized that 3 of the tables that we were using for reporting were in fact out of date. And we fixed them real quick. But, you know, there's no way of knowing really unless you're unless you're in there. Yeah. Definitely being able to have that visibility

of the recency of the data

is valuable particularly if you're trying to create reports with any sort of timeliness.

Yeah. That's exactly right. In general, it's this this whole conversation about, you know, making analytics look like engineering. If you're working on an app, you you instrument monitoring and alerting

because you know that some things are going to fail and you wanna understand what had happened before your users find out. And so this is just another example where we take a tried and true best practice from the engineering world, and and we say, like, okay. What's the analog in analytics?

So DBT itself is an open source project, and I know that you've also built the DBT cloud capability. So can you talk a bit about the additional features that that brings in? Sure. So,

we built dbt cloud. I guess the thing to say is, Fishstone Analytics, in addition to

to maintaining dbt, is an analytics consultancy.

And so we built the first versions of dbt cloud about 3 years ago in order to deploy DBT for our clients.

And so we were working with people that they didn't have any sort of data competency.

We needed to set up a warehouse for

them, get the ETL set up, do the data modeling with DBT, plug in a BI tool, etcetera. And we didn't wanna

spit up airflow for this, you know, just to run dbt jobs nightly. That'd be kinda crazy. So what we did was we made this thing called, I think, hosted dbt that then eventually morphed into dbt cloud.

The big idea is that it is a sort of user interface and job scheduler on top of dbt.

From this interface, you can go in there and set up your connections, connect to your GitHub repository,

whatever Git provider you use. And, say, I would have run, you know, this job, dbt run,

nightly at midnight UTC or every 4 hours.

And you can also, in the same interface, set up your tests to run maybe after your jobs have run

to alert you with email notifications if if anything

didn't run the way it was supposed to run. Right? The other really cool thing that you can do is it builds your DbT documentation for you and hosts it. And so if you wanna invite a bunch of your colleagues to to look at your DBT documentation,

you can do that without, you know, setting up some server behind your VPN, you know, for instance.

And and long term, it is and will continue to be a user interface on top of dbt.

And so we have a

a ton of metadata about your runs, like how long every run takes or even within a run, how long each of the models took, for instance.

And so exposing

this this data either as visualization or, like, via the API

is super valuable. In addition to just even simpler stuff like viewing the logs for for a job that ran yesterday and failed. You know, if you're running dbt on cron or something in production, you have some analyst that's like, oh, my tables are out of date. They have to SSH into the CRUD server to tell the logs or they have to ask you what happened. You know? And so dbt clouds an interface into this, you know, productionized

deployment of dbt such that everyone that interacts with dbt

can understand what's happening, view the logs, view the documentation,

and soon, like, build their dbt models within the interface.

And so what have you found to be some of the most interesting or unexpected or innovative ways that you've seen dbt used? That's a really good question. So

we have a,

dbt

Slack group that's open to the public,

and, I cannot say enough good things about it. The people in there are, like, incredibly helpful and kind and knowledgeable.

But also every now and then, we we see people that are trying to do some pretty interesting

things with dbt. So

novel applications. You know, 1 thing that I think is really compelling is when people use dbt

to power their their in app analytics. And so they're exposing

numbers to their clients or charts to their clients. They use DBT to actually transform that data and test that everything's valid. I feel like, generally, internal

analysts are more forgiving than external clients. And so that's a lot of trust that people put in DBT. I really like to see people doing that. On the topic of maybe particular things, it's probably

macros.

1 of the really cool things I saw someone do

was, they wanted to use a recursive CTE

to walk, like, a tree or a graph that was, like, like a self referential table with, I don't know, say customer IDs or something that referred back to itself.

And so if you're on Redshift,

you can't use recursive CTEs.

And what this person did was they built a macro

that they could provide with, like, a depth, and it would self join the table to itself some specified number of times

such that it could flatten out the whole

tree. That was pretty cool. It's really neat to see people, like, implement almost database level functionality

using Jinja Macros. Like, I once saw a tweet that said, if you ever feel bad about programming,

just remember that computers are rocks that we tricked into thinking. And so I always think, like, yeah, it's it's not crazy if it works, you know.

Yeah. People do crazy things with computers and sometimes it's you have to ask yourself, 1, how do they have that much free time? And 2, you know, why are they even bothering? But most of the time, the answer is because you can, and sometimes that's good enough. Yeah. That's exactly it. I mean, the the meta point here is that a lot of these analysts that use CBT are like, they don't have a computer science background necessarily. They came from some other domain, but they're every bit as, like, capable and creative and, you know, good at their jobs as as, say, a software engineer.

And so by giving them a substrate upon which they can be creative and, like, build their own tooling, That's, like, such a compelling part of the work that we do to me. I feel like historically, analysts have been serviced by products that you pay for. And so,

certainly, you know, I I want a lot of people to pay for DVC Cloud, but fundamentally, in the open source version of this thing, you can do everything you need. And I I think that's really great. It's really like, the best part of my job is when I get pull requests from people that have never made a pull request before. They're like, hey, I have this problem. I solved it. I want other people to benefit from it. That's a really cool part of, you know, the world that is, like, materialized, if you will, by dbt.

And in terms of your experience of building and maintaining the dbt project, what have you found to be some of the most interesting or useful or challenging lessons that you've learned? Sure. So I think that I know more about the DDL,

undocumented

quirks

of the modern warehouses out there than than, like, most people in the world.

And

so something that you think should be easy, like, I want to atomically replace this table

is, like, a really challenging problem across,

this sort of disparate set of databases.

And and so it's like, oh, you wanna do it this way on Redshift, but, well, wait, BigQuery doesn't have transactions. Or like Snowflake has create or replace table, but if you try to do that

on Redshift, you can narrow if the column definitions change, just like for instance. And so a big challenge for me was kind of understanding that, like, sometimes

it seems like a good idea to abstract over a problem, but if you don't have full knowledge of the actual problem space, you're you're better suited by, like, say, duplicating code than trying to build some faulty abstraction, you know, and, like, shoehorn it in there. That's a that's a a lesson that I've learned many times over in many different domains in in doing this type of work. Like, no abstraction is better than a bad abstraction. I guess simultaneously, I've learned, like, the right way to do all this stuff, and that's pretty fascinating too. I'm always happy to talk about DDL specifics. Here's an example. You can't rename a view on Snowflake

if it selects from a table that doesn't exist anymore. That's the thing that I know that, like, I wish I didn't, but you were supposed to do. You know?

Yeah. I don't know. And the other the other really big thing for me, like, I started as a software engineer

writing the code for dbt and increasingly,

much to the benefit of the people that use dbt, I don't write much of the code anymore and it's instead my my colleague Jake who's phenomenal, writes the majority of of the PRs for dbt core. And so I find myself moving to, like, a product management role and the way that that manifests on

an open source project is, like, there's a lot of people on Slack that have questions or suggestions, or they show up in GitHub with issues and feature requests.

And so it's sort of a trial by fire for me for, like, how do you

do a good job of managing an open source community? And another, like, to my benefit, my colleague Claire, who started just over 6 months ago, has has been

the community manager behind dbt, and she's doing such a phenomenal job in, like, handling the actual community aspect. I can focus on, like, the community development aspect of it. So you get issues from people in GitHub, and it's like, how do I prioritize this thing? How do I help the user maybe see a different way that we can accomplish this that's more in line with the vision that we have behind dbt? And a lot of it, ultimately, at the end of the day, is just like there are no rules for how to be a good maintainer. I think I was talking to another

well, I shouldn't say another. I was talking to a prolific maintainer

of open source software at the Data Council conference.

And I asked him, I was like, how do you know what's acceptable and what's not? And he said like, Drew, you just gotta be kind to people. You know, like, don't be rude to anyone and it'll be fine. And so I internalized that. Like, I try to understand where people are coming from all the time and, and also, like, kinda be firm if if there's something that they want really badly, but, like, I don't wanna incorporate in DBT. Like, I'm happy to point someone to the fork button on GitHub. You know? But overall, this this shared experience of, like, you know, we're all trying to build this thing together,

and I'm in the position of, like, product manager maintainer has been so fascinating, and I love it so much. And, certainly, it's 1 of the more 1 of the areas where I've grown the most since since starting working on dbt 3 years ago. And what would you say are the cases where dbt is the wrong choice and somebody should reach for a different tool? Sure. So,

this is a great question, and I think that maybe less of these days, but certainly

a little bit earlier, people would show up in our in our Slack group, and they would say, hey. I wanna run these for, like, update and insert statements. How do I do that with dbt?

We'd be like, hey. Wait. Listen. Here's the here's the best practices guide. You'll find that there's no mention of interrupting in it. Here's how to do it the dbt way. And so we've been pretty,

concerted about exactly how to do things well with dbt.

And we have a document online called is dbt the right tool for my, you know, data transformations?

So I would say that in general, if

you find yourself fighting against dbt rather than cruising along with it, then dbt might not actually be the best way to generate and run the SQL that you need. This is things like custom DDL or if you want

very close to real time processing,

DBC is a batch based process, or fundamentally if you don't have the data in your warehouse that you need to transform, DBC is probably not the right tool. But if you don't have any of those challenges, if there's already data

you can do in a batch based way and you're happy with select statements, then, like, DBT is a great way to, to execute

and, and test those queries for sure. And are there any other aspects of DBT

or data warehouse transformation

that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. There's a whole suite of,

tools that that have come up so many times for us and our consulting work that folks in the community have mentioned that we've built out. So 1 of them is, what we call archival.

It's the ability to generate type 2 slowly changing dimensions

from a source table.

This is something you would run sort of out of band from your DBT process. Maybe every hour you want a snapshot a specific table, and you can understand

when the name of a product changes or something like whatever value might be updated in place. You can capture that change and then look back and see

in given time intervals,

what the value was for specific columns. Archival is pretty cool. We're overhauling it in our next release. I'm really excited about that.

So much of DBT is deeply customizable from within your own project and that's the thing that's hard to convey in a conversation like this, but you can

write your own macros in your project and override internal functionality of dbt.

So if you don't like the way that dbt creates tables and you would prefer to do it a different way, you can make your own materialization,

for instance. And you just put that in your project. You don't have to fork dbt,

and dbt will pick up your materialization definition. That's super cool.

And

I you know, 1 example is like a community contributed. It's called insert by period

materialization. It's sort of like our incremental materialization, but it does it on a daily basis, which sort of approximates these

insert overwrite partitions we were talking about earlier. That's pretty neat. I think beyond that, it is

a game changer for how you actually do the analytics work that you do. It's, I think, welcomed by a lot of teams out there

that might have a mess in their BI tools and and inconsistencies across different reporting interfaces. And So for anybody who wants to get in touch with you and follow along the with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. That's a really good question. I think that the last few years

have brought

a sea change for how data is ingested into databases. So these off the shelf ETL tools are great.

The tooling around actually, like, building ETL if you need to build custom stuff yourself is great compared to what was a few years ago. I think that, you know, the actual warehouse technology

has quantum leaps in the past, you know, 5 years going from just say well, maybe a little bit longer than that at this point. But it's like a pre Redshift world to where we are now where it's like, well, it's Redshift, but also BigQuery is amazing and Snowflake's phenomenal,

as a warehouse. Those are all great. I think

the last part of it

is on the BI stack,

like the BI part of the stack. And so I think that these individual tools all work really well, but they're, just say the, you know, the Lookers and the Mode Analytics and the Periscopes of the world, they're all, like, great tools for what they do,

but the actual interoperability with the rest of the stack is, I think, kind of,

like, not all the way there. And the the place that I think the highest

the most amount of value can be added is in understanding, like, the connection between the queries that are running in your BI tool. And just, you know, maybe this is a DBC centric point of view, but which DBT models they're querying,

for instance. And, actually, like, uniting the whole stack from source data through to, you know, ETL in your warehouse, transformation

analysis.

If you can have that unified picture of of lineage for, like, an actual chart, understand the provenance all the way back to source data. That that's, like, a really powerful thing that to date has been elusive, but I think, like, we're getting closer and closer to it. Yeah. No. I I agree that being able to have that complete end to end visibility is highly valuable, particularly when you're just looking at a visualization at the end of the day, and you say, you know, that's great, but I have some additional questions that you can't really answer unless you know where that data is coming from and what's been done to it. So, I agree that that's definitely something to continue to work towards. And,

so yeah. I just want to thank you very much for taking the time today to join me and discuss the work that you've been doing on dbt.

It's definitely an interesting tool and 1 that I

look forward to seeing continue to be developed and maintained, and 1 that I'll probably introduce to, some of my teammates. So thank you for all of that, and I hope you enjoy the rest of your day. Great. Thanks so much, Tobias. It was really a pleasure. Thanks for having me

on.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links