Summary
The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects
Interview
- Introduction
- How did you get involved in the area of data management?
- What was your path to adoption of dbt?
- What did you use prior to its existence?
- When/why/how did you start using it?
- What are some of the common challenges that teams experience when getting started with dbt?
- How does prior experience in analytics and/or software engineering impact those outcomes?
- You recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort?
- What new lessons did you learn about dbt in the process of writing the book?
- The introduction of dbt is largely responsible for catalyzing the growth of "analytics engineering". As practitioners in the space, what do you see as the net result of that trend?
- What are the lessons that we all need to invest in independent of the tool?
- For someone starting a new dbt project today, can you talk through the decisions that will be most critical for ensuring future success?
- As dbt projects scale, what are the elements of technical debt that are most likely to slow down engineers?
- What are the capabilities in the dbt framework that can be used to mitigate the effects of that debt?
- What tools or processes outside of dbt can help alleviate the incidental complexity of a large dbt project?
- What are the most interesting, innovative, or unexpected ways that you have seen dbt used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with dbt? (as engineers and/or as autors)
- What is on your personal wish-list for the future of dbt (or its competition?)?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Biobot Analytic
- Breezeway
- dbt
- Synapse Analytics
- Snowflake
- Fivetran
- Analytics Power Hour
- DDL == Data Definition Language
- DML == Data Manipulation Language
- dbt codegen
- Unlocking dbt book (affiliate link)
- dbt Mesh
- dbt Semantic Layer
- GitHub Actions
- Metaplane
- DataTune Conference
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Miro: ![Miro Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/1JZC5l2D.png) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at [dataengineeringpodcast.com/miro](https://www.dataengineeringpodcast.com/miro).
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize today to get 2 weeks free. Your host is Tobias Macy. And today, I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your DBT projects. So, Dustin, can you start by introducing yourself? Yeah. Thanks, Tobias.
[00:01:34] Unknown:
So my name is Dustin Dorsey. I am a systems and data architect at Biobot Analytics, which is a wastewater epidemiology company based out of Cambridge, Massachusetts. Have been working with data for over 15 years at this point. Have designed and built tons of data warehouses using a lot of different technologies. And dbt is 1 of the tools that I've been using the past few years and something that I absolutely love, and
[00:01:56] Unknown:
I'm excited to be able to get on here and talk about it. And, Cameron? Thanks, Tobias. Cameron Seer. I'm a I'm a staff data engineer at a company called Breezeway. So we build up property management software for short term rental property managers, like Airbnb, Vrbo, that kind of thing. I've been in been in the data space since the beginning of my career. Well, my career hasn't been extremely long and robust at this point. I started off in financial sector and have changed around quite a bit since then. But, yeah, DBT is 1 of my favorite tools in the in the modern data stack, if you will, and excited to,
[00:02:31] Unknown:
have our chat today about it. And going back to you, Dustin, do you remember how you first got started working in data?
[00:02:37] Unknown:
Yeah. So it's a it's actually a pretty interesting story that it's got some funny points in it. So I didn't go to school for tech. I actually went to a Bible college and studied theology, and here I am. At the time that I got into tech, I was working for a vending company and absolutely had the worst luck ever with this job. So my job was driving around stocking vending machines. During the course of working there, of no fault of my own, I blew multiple truck engines. I caught a truck on fire. I lost my drive shaft on the interstate and hit a semi truck. I took the awning out at a local hospital.
And every day, it felt like there was something crazy bad that was happening. And I was like, I'm becoming a liability to these guys. I just need to I need to find another job. And thankfully, at the time, I had a family member who was working for a company that was looking for entry level tech person. And so he reached out to him, and he was able to get me in the door basically just doing monitoring. So it's like, hey. We just want you to look at these alerts and tell us when there's a problem so someone that knows tech can go in and fix it. And while in that role, we didn't have a lot of UI tools to troubleshoot issues. We had to do it through the database layer.
So I began learning from them, and that was kind of my first exposure to data. And from there, my career has kind of taken off, and I have gone through a lot of different iterations. I kind of got my original kind of data start as a DBA, and then that's kind of gone into data architecture, into data leadership, into enterprise architecture, which is where I am now.
[00:04:10] Unknown:
And, Cameron, do you remember how you got started working in data? Yeah.
[00:04:14] Unknown:
So coming out of high school, actually, I was working for, just a insurance company, and I was selling insurance. And quite frequently, I was trying to opt I found myself trying to optimize my my sales pipeline. Like, oh, what are the you know, how could I do this more efficiently? Right? Constantly trying to understand, okay. These are the types of businesses that work better for me. I became shamelessly today, I came became, like, a Excel guru at the time. And it's like, oh, there's gotta be a better way to do this stuff. And so I started researching and ended up in, going to college for business analytics. So I got a business analytics degree.
Coming out of college, I start working for a bank in Texas, like, 1 of the biggest regional banks in Texas, and, working there as a data analyst. And so I start getting to do the the analytics stuff that I was really enjoying in the past. But as I start working for this enterprise level organization, like, there's I'm starting to experience, maybe some data issues. Right? There's a lot of red tape on things in the banking industry as you can imagine. And so sometimes technology doesn't grow as quickly as you might you might want outside or you experience, maybe issues accessing data or issues trying to build out a new dataset. Things aren't always documented great. And so it's like, what are the roles that I could get into to to do better? Right? I mean, I felt somewhat limited as a data analyst. And so I started researching and learned about the data engineering position.
And I was like, oh, this is exactly what I'm looking for and dove head first into everything data engineering and everything under the sun. And that's how I ended up where I am today.
[00:05:58] Unknown:
And now bringing us around to the topic at hand where we're gonna talk through dbt, how to use it, some of the ways that people can scale their usage of it, some of the tech debt issues that might come up. Before we get into all of that, what was each of your path to actually getting involved with and adopting DBT as a utility for your toolkit?
[00:06:19] Unknown:
Yeah. I'll go ahead and start with this. So for for me, like, a lot of my early part of my career, probably the 1st decade of my career, was spent pretty heavy in the Microsoft realm. So working with Microsoft Services, Azure, and on prem Microsoft Services. And a lot of my background is in health care, which traditionally kind of more so leans a little heavier into the Microsoft world. I had an opportunity to leave a role to go work for a start up company a few years ago, and the start up company had already landed on what their infrastructure was gonna be, and everything was built within AWS. And all of my experience coming from Microsoft World, I was just coming out of a role where I built an a warehouse, for a large company using Synapse Analytics.
I was now kind of being moved over into or moving over into working for an AWS shop where, obviously, Synapse Analytics, which is a Microsoft product, isn't supported on on AWS. So wasn't able to use that. And we had a big need of needing to build a data warehouse. We needed to centralize data. We needed to build analytics on top of it, etcetera. So the cloud tools that I was comfortable using were an option. So when we have started evaluating other tools, I knew I wanted something that would allow me to use my SQL skill set because I love writing SQL. I've written SQL for a lot of years, and it was something that I wanted to be able to use. So when we started looking at and evaluating tools, DBT continually rose to the top of something that we were interested in and packaged with other vendors that we ultimately ended up selecting, which was Snowflake and Fivetran, which worked really, really well with DBT, it became pretty a pretty natural fit for us on what to use.
And since doing that, like, being able to kind of go back and look at what I've built in the past, where I've used SQL to build transformations, but using things more like store procedures and then using other tools to orchestrate those. Like, DBT took that to a whole another level of having so much of what all those added components we were needing to build the stored procedures. We now we're all kind of integrated into a single tool. So it was very natural for me to land in it. And since I've kind of landed there, I'm like, why wasn't I using this the whole time that I was building things? But, yeah, that is that's ultimately kinda how I ended up on on using DBT. It's just a shift of role and having to step out of my comfort zone. And, Cameron, what was your path to adoption for DBT?
[00:08:49] Unknown:
Yeah. So when I started learning DBT or learning of DBT, it actually came from, just a podcast I was listening to. So this was back in 2019 is when I first started learning about dbt, and they were still a very early company at that time. I don't think dbt cloud was even a product yet. So it was all the all the open source solution at the moment. But I kept hearing it on on the same podcast. Like, they would bring it up almost every episode for a while, it felt like. And they would talk about things like data lineage, data quality, introducing software engineering best practices to your to your workflows. And at the time, I was still a data analyst. Right? And so I was a guy that had, just a file on my on my laptop that had, you know, a 150 different SQL scripts in it that I would run whenever somebody needed an analysis done. Right? So I was hearing these things, and I was like, wow. That would be really nice for my day to day work. And so I was still in the the banking industry at the time. Right? So as a low level data analyst, my odds of getting that implemented at the organization were slim to none, but I as a I as an individual wanted to go learn about it. So I started researching it online, reading through all their documentation.
Yeah. And, like, just really went heads down on it and experienced amazing things. Right? Just like Dustin, I was also using stored procedures at the time to manage objects in my database and my data warehouse and, the ability to be able to promote things through environments and whatnot in your data warehouse using dbt is 1 of the biggest reasons that I was influenced into it. Right? And being able to test data quality, like, another thing that is absolutely amazing to me. Since then, I've built 3 data warehouses from the ground up using DBT to manage all the transformation logic and would never go back to using stored procedures if if I could and have yet to have a reason to do so.
[00:10:43] Unknown:
That that podcast you mentioned wouldn't happen to be this 1, would it?
[00:10:47] Unknown:
It wasn't this 1. I can say the name of it if that's fine. That's fine.
[00:10:51] Unknown:
Yeah. It was the analytics power hour. And I I was just curious because I was was wondering what time did I actually do my first interview on DBT, and it was in 2019. So Yeah. And it does come up quite frequently in this show as well. But, no, I'm I'm always happy to help, promote other people who are working in the community to spread the knowledge.
[00:11:12] Unknown:
Yeah. Absolutely. Yeah. It's hard to not bring the topic up in the in the data management space. Right? It just it has become such a quintessential tool in the tech stack, in my opinion, that
[00:11:22] Unknown:
if you have had the opportunity to use it, once you get it, you just get it, and it's hard to look back the other way after you've used it. Absolutely. Particularly coming from a data engineering perspective where you want everything to be reliable and repeatable, and there is a source of truth that doesn't happen to live on somebody's laptop, DBT just is a natural fit and a natural extension to the rest of the ecosystem.
[00:11:43] Unknown:
Yeah. Absolutely.
[00:11:45] Unknown:
And so for people who are newcomers to DBT or for people who are maybe bootstrapping a brand new DBT project for a new company or a new initiative, what are some of the common challenges that they experience when they're first getting that initial DBT project up and running and off the ground?
[00:12:04] Unknown:
Yeah. I mean, for me, it was just figuring out how to get started. So and it's part of the reason why we actually wrote the book that we wrote that we ended up writing was when I was new kinda coming up to speed on DBT and trying to figure it out, the documentation was great on teaching you how to do it. Where it lacked, in my opinion, was how to practically apply it. And so it's how do I take this information that they're giving me on how to use this, and how do I build something that's scalable and maintainable and reliable that I can do? So I think just figuring out where to get started was probably 1 of the biggest challenges that I had just because there wasn't a lot of resources to help guide you through that.
So my experience was a lot of trial and error. So me and Cameron worked together in a previous role in which we were standing up a warehouse. We stood up DBT, at least some of the bones of it before before Cameron had started. And there was a lot of trial and error and kind of figuring out, and, eventually, it's like, well, look, we're just gonna have to start creating some structure here, and we'll adapt as we learn and make some mistakes. And, eventually, we kind of landed on how we are gonna structure our project to best work for us. But, yeah, I think that was a big 1 for me.
Cameron, I'll kick it over to you. I'm I'm sure you have additional thoughts there as well.
[00:13:22] Unknown:
Yeah. For sure. I think that, you know, the getting into a DBT project, there's it's kind of twofold for me for teams that have never used it before. The first challenge being the technical perspective. Right? And so if we think about dbt core as the open source solution first, you have to worry about managing the infrastructure to actually run the project. And there are there's obviously alternatives to that. You can use dbtcloud to to host everything for you, which would recommend most of the time unless you have a really good team of data engineers or dev ops folks that can help you manage the infrastructure. So that's 1 piece of it. Right? Getting that set up. Now don't wanna scare anybody away using dbt core if that's the route you go. Like, the infrastructure to run a dbt project is is pretty straightforward. Right? I mean, it's a very lightweight Python package at the end of the day, and so you could just easily run it in a Docker container and then put that Docker container out wherever you choose to. Right? Whether that be ECS or just running it on a EC 2 machine, however you decide to do that. But the other piece is the strategic perspective. So if you think about larger organizations or someone outside of the startup world, they typically already gonna have a data warehouse built. Right? And they're gonna maybe be moving from maybe they're moving from, say, Teradata to Snowflake. And as part of that, they wanna move all of their tera Teradata stored procedures into Snowflake. But while doing that, migrating all the stored procedure transformations into DBT. And so just having a plan on how to do that is of the utmost importance for me, in my opinion. Right? Because going from a stored procedure to dbt is not just a copy paste solution, because of the fact that in dbt, you no longer have to worry about managing things like DDL or DML. Right? Everything, at least from the SQL transformation perspective, is just a select statement. And so having you know, you might have to rethink the way you do things a lot. Right? And in this word procedure world, you lose you use, temp tables a lot. Right? And so you've gotta think about how to break those out into different DBT models. So it's just a mindset shift. And so you have to have some sort of strategy going into that so that you don't make mistakes or just get overwhelmed, and then the the the migration just ends up failing as a whole.
[00:15:37] Unknown:
Yeah. And 1 thing to add to that too, Cameron, like, it's a shift in your thinking too of how you're building models when you're only building models using the select command. So this took me a little bit of of getting used to of coming from largely kind of using, like, store procedures in place before because you're running up search, you're running merges, you're running a lot of different commands with those. And when you go to DBT, you don't have the option of running those, or you do have the option of running those, but not like a straightforward way in terms of how you're building your models. So everything is running as a select, and you're like, what in the world? Like, it takes a little bit of getting used to. And you have to kind of you have to rely on some of the configurations and DBT to actually be able to basically kinda maintain that upsert merge logic Because what you need to do with the data doesn't change. Like, you still need to be able to incrementally load data. You still need to be able to create slowly changing dimensions.
You still need to be able to do that stuff, but the mindset of, okay, I have to write this in a select command takes a little bit of getting used to, especially when you come from some of these other, ways of doing things.
[00:16:42] Unknown:
Yeah. Absolutely. But once you once you do get that mindset shift, it is it is immensely easier. Right? I mean, if you think about I have a I can't recall the last time I actually had to write a merge statement and had to deal with any of that. Right? And now I just oh, I need to incrementally load a table into my my final data warehouse model. Like, okay. Cool. I just configured in the DBT project as an incremental model. Tell it what the unique key is, and and we're set. Right? But, yeah, there is definitely that shift of, thinking from going from stored procedures and actually writing all of that logic yourself to just, okay. I'm gonna trust that DBT can handle this, and and it does. It's great.
[00:17:23] Unknown:
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/ starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
In terms of prior experience working either as an analyst or as a software engineer, how does that maybe influence the ways that people will approach that initial setup of a DBT project or any areas where they maybe focus their attention and possibly overlook some of the features that DBT has built in that will help them get their work done that is maybe not as obvious as it could or should be?
[00:18:44] Unknown:
Yeah. I think that 1 is also kind of twofold depending on what your experience is. Right? So if you're coming from a analytics background or a software engineering background, the adoption of DBT tends to be extremely easy. Right? Because, until last year, all of your transformations were written in SQL, And now you can also write Python transformations. But for either of those cohorts of folks, whether you be an analyst or a software engineer, you know, data focused software engineer at least, those are the 2 languages that you're most likely gonna have, very robust skills in. Right? And so being able to come in and quickly start providing value by by building DBT models, the time to value there is very low.
Where I see maybe these 2 folks using features of DBT that is either overutilized or underutilized, I would say, are macros. So people come in strictly from the analytics background, like, say, a data analyst or something of that nature. I feel like macros get underutilized. Whereas I've worked with some folks that came from software engineering backgrounds, and they wanna put everything into a macro. And it's, like, to a to a fault. Right? And, you know, it's like the classic over, over abstracting concepts, and it ends up just adding additional confusion. Right? And so you have to try try to find the right balance in there. And if you can, it it makes things all the more smooth.
[00:20:04] Unknown:
Yeah. And when you're getting started with it, you kinda you kinda hit on this, Cameron. I mean, if you know SQL, you're gonna be able to utilize DBT. Like, it's a pretty low entry, low learning curve. SQL is really the only skill that I would say is a must have, at least being able to understand the basic structure of a select statement to get started. But other skills that are absolutely gonna help you along the way, and you hit on these camera and Jinja, Python, understanding YAML, so all of your configuration is done in YAML, having a good understanding of source control. Some of that level of understanding you need will depend if you're using dbt core, using dbt cloud.
If you're using cloud, you just need to really understand just the concepts, associated with it. And then the last 1, which is the the biggest 1, and it's actually 1 of our I think 1 of the biggest issues that people have with using dbt is thinking about their data model and data modeling. And this is a very, very underrated skill, in my opinion, when it comes to using DBT because DBT makes it so easy to get started and to be able to use it that, you know, you can go in and you can start creating a model in a matter of seconds without actually thinking through, like, okay. What does my end state look like? And it's very easy for your project just to go complete chaos and bonkers when you have thousands of models running because everyone's just turning it into the wild wild west and not working against the plan. So having some experience, at least someone who's working in your project, have experience doing some data modeling, I think, is a really crucial skill that's gonna save you a lot of lot of headache down the road. There's a lot of consulting companies who are making a lot of money off of companies today in their DBT environments because
[00:21:44] Unknown:
they didn't actually model their data before they started it, and they just kinda let their analysts go crazy. Yeah. And, Dustin, you mentioned the you mentioned the skill of YAML. YAML is so easy to understand, but anybody that has worked in the DBT project of any scale can probably relate to the like, how much time you actually spend in YAML files. Right? Because you're in there. You write your documentation in YAML files. You do your testing in YAML files. You do all of your project configuration in YAML files. It's a great way to do it. Right? And it it makes things very simple to move between environments again, to do testing, documentation, all of this stuff. But, yeah, the amount of time you spend in there is something that I underestimated as a as a newcomer to dbt, and I've I've learned, you know, better ways to do it, right, over time. There's there's dbt packages that you can use to help you, like, automatically generate some of your YAML files so you're not sitting there writing the the things that become redundant, and then you can actually write the specifics about documentation and whatnot and and specific tests. But, yeah, that that was something that I certainly underestimated the amount of time I'd be spending in was, was YAML.
[00:22:52] Unknown:
Yeah. And it's all great until you forget which level of indentation you're supposed to be working at.
[00:22:59] Unknown:
Yes. Oh my goodness. That's so true. Yeah. So I guess this are we doing 2 2 spaces? Are we doing a tab? Are we doing 4 spaces? It it all it all gets pretty tricky. Fortunately, there are, you know, YAML validation tools out there, but, yeah, it can become a headache very quickly. Well, even that can be a bit of a foot gun because it might be valid YAML, but it's not actually valid semantically because of what you're trying to convey, where it will parse as valid YAML, but that documentation string that you thought you were adding to 1 attribute is actually on a different 1. Yeah. Very true. Yeah. And that gets even that gets even more true when you talk about the documentation string. If you're actually, you know, like, abstracting your documentation up into markdown files and then trying to use it throughout your YAML files, that can that can become a headache as well. Well, very powerful, but a headache if you're not careful.
[00:23:49] Unknown:
And as you mentioned, Dustin, briefly, you both worked together on writing a book to encapsulate some of these hard learned lessons about DBT and using it for real world production use cases. What motivated you to actually invest that time and effort, and how much do you regret it now?
[00:24:12] Unknown:
Yeah. No. Great question. So we as I mentioned before, I mean, we we wrote the book as as a resource for a practical application to DBT. We felt that a lot of the documentation and the resources that existed did not really give you a blueprint for how to practically apply it. And we felt there was even a bigger gap when it came to data engineers specifically. DBT markets pretty heavily to the data analyst. And while I don't know this for a fact, I imagine a lot of their businesses coming through folks that are data analysts are getting DBT into their organizations. And that's where they push it to. But, yeah, we don't feel like there we didn't feel like there was a lot, like, directed toward, like, data engineers. And I think some of this has to do with where the money is made. A lot of data engineers are capable and comfortable using DBT core, which obviously DBT Labs does not make any money off of.
Whereas a lot of analysts don't necessarily have those skill sets. They wanna be able to use the IDE, and they wanna be able to use the cloud functionality. So they market a lot of their more a lot of their materials toward that. So a lot of their documentation doesn't really it kinda gives you structure, and it gives you some tips along the way of building a warehouse, but it doesn't really kind of give you a blueprint of how to use their tool to go in and build a warehouse. What we wanted to do with this book, we wanted to pretty much walk through every part of every folder or every part of your DBT project and show how practically apply it and how you can actually build a data warehouse and build an an efficient and effective data warehouse using this tool because we think it's an incredible tool for it. In fact, it's 1 of my favorite tools out of my 15 years of using tools, 1 of my favorite tools that I've ever used. So we really wanted to be able to demonstrate that.
And we, at the end of the day, both me and Cameron, were engineers first, and so we wrote the book from an engineering perspective. So it's got trials and errors that we've gone through. It's got lessons learned that things that we've learned throughout it. And so we thought it was gonna be a, yeah, super helpful book out to the community and for people just as another resource.
[00:26:18] Unknown:
Yeah. And I'll expand on that a little bit too that it it is not the only resource to learn DBT. Right? Like, you can you can learn from the documentation. I mean, that's primarily how I learned DBT. But, yeah, to Dustin's point, I had to piece things together and kind of learn my own way. While there are there's also videos out there, though, you can watch. Right? You know, I I know dbt has some courses published themselves. There's some great Coursera and Udemy courses out there as well that you can learn how to use dbt from beginning to end. But, selfishly, like, I'm I'm a bibliophile. Right? I love reading books, and I there wasn't a book on DBT. And so when Dustin came up with this opportunity that we could write the book, I I mean, I jumped at it immediately, right, because of the fact that there was nothing in the market for it. There was no book on DBT that shows you how to go from beginning to end.
And while reading documentation and videos are great, sometimes I just wish I could pick the book up and go go and reference something very quickly. Right? And maybe that's a little old fashioned of me or whatever, but that's just that's just the way I like to do things sometimes.
[00:27:23] Unknown:
Just, you know, an extra tool in the tool belt to help you learn. Right? Right. Yeah. And as far as the misery that you mentioned when it comes to writing books, so when we started writing this book, I had literally just come off of finishing a project of writing my first book, which was pro database migration to Azure. So totally different topic. And the publisher had reached out and saw that I was presenting at a conference on DBT. And it's like, hey. Would you like to write a book on DBT? And I'm like, no. I don't. It sucked the first time. Why would I wanna do it again? And, I thought about it some more. As Cameron mentioned, there wasn't a resource on the market for it and the idea of being 1 of the first to to hit the market on it. We weren't actually the first. There was 1 other that got came out right before us, but at the time, we didn't know that. I reached out to Cameron. I knew Cameron loved DBT as well, and I'm like, hey, man. I'm there's no way in the world I'm writing another a full book. Like, I don't have time for it. I just spent a year writing this other 1 to go right back into this, and Cameron was super excited about it and convinced me.
And so we ended up writing it, and going through it. And if you've ever written a book before or if if you ever decide you think you wanna write a book, it's great with getting through that first chapter. And you're like, oh, man. This is great. Great content. Feels good. And then you get to that first chapter, and you're like, oh god. We have 9 more of these to write. And then it starts the misery starts sitting in when it's a beautiful weekend and your wife and kids are outside playing or going somewhere, and you're like, man, I'd really love to be doing that, but I'm sitting at my computer typing. And not only that, you're writing about the same stuff you're working on throughout the week. So your work week just feels like it never ends. So definitely challenging.
Feels good when it's completed, and you're super excited and super proud of it, but it is a very hard process.
[00:29:10] Unknown:
But you do it so you can be rich. Right?
[00:29:13] Unknown:
Yeah. Right. Yeah. The royalties are are amazing on books. That's for sure.
[00:29:21] Unknown:
Yeah. So may we, we get a nice dinner or what, Cameron? Nice dinner at the Red Lobster?
[00:29:29] Unknown:
Something like that. No. I mean, it is a huge time commitment, though, to to write a book on any on any subject. Right? I mean, whether it be a technical topic or, you know, fiction, nonfiction, what have you. Something that I never thought I would experience. Right? I mean, if I go back to my high school and college days, I was the I was the guy that wanted to put the bare minimum effort into any essay I ever had to write. And so, the the idea of a book, you know, 5 years ago, the idea that I would be a published author would I would not believe it. There's no way. But, yeah, if you're passionate about something, if you have enough passion about it, you can make it happen.
And DBT is that, like, is that for me. Right? And at least in my professional career, it it is the thing that I have enough passion about to be able to write an entire book about. Right? I mean, I mean, I can't take the whole credit. Right? Obviously, Dustin's here too. But between the 2 of us, we're able to to put out an entire book on the on the subject.
[00:30:24] Unknown:
And in the process of going through the exercise of writing a book, figuring out what is the flow, what are the details, what are the core elements that we're going to include. I'm curious what are some of the aspects of the DBT project, ways to use DBT, some of its features that you learned in the process that you previously had not had occasion to experiment with?
[00:30:44] Unknown:
The product itself is extremely extensive. And I knew that going into writing the book because we've we lay out you know, each chapter in the book is very more or less dedicated to a a component or a feature of the product. But once you actually get in there and you start writing about say you're writing about macros. You start thinking of every way you could use a macro. And and you don't normally do that in your day to day work. Right? Because you just sit down. You have a problem to solve, and you solve that problem using some feature of the product. Whereas when you're writing a book, you just sit there and you contemplate every single possibility where you might use a macro, for example.
And so then how do you actually choose what what you're gonna write about? And so I don't know that I necessarily learned a ton about the features of the product. I was fairly familiar with the features of the product at the time, but starting to think through all of the different use cases is what what what really put it into perspective for me. And so the lesson to take away there is that with any tool in in any facet. Right? But, like, you don't have to try to learn the entire thing to start gaining value from it. And, hopefully, that's you know, hopefully, any resource out there for learning DBT, whether it be our book or or some videos or just reading the documentation, you can understand that and you can go, okay. Great. I need to learn this feature because I need to implement it in what I'm working on in my day to day right now. And that that is the thing that that really got set into perspective to me. Again, there's just the the breadth of the product.
[00:32:16] Unknown:
Yeah. I'll echo that, Cameron. I'm kind of kind of the same boat for me. There were definitely features as we were going through writing it that I didn't know exist. I learned a lot going through the process, and definitely I'm much better at DBT than I was before I started. But I think when we got to the end of the book and we were starting to wrap up and we looked at how much content we had on paper, and even going back and looking, like, all additional content that we can add it, which may come in a v 2, like, we were at 400 pages worth of content to on this product. And I had at the beginning, you have no idea, like, we can write 400 pages of content on this product because it so so many aspects of it just felt so simple. But to Cameron's point, the breadth of the product and how vast it is, there's so many considerations, and there's so many things to take into account that yeah. It was it was eye opening when we got to the end of just how much content was there and just how much more that we could have probably added to it.
[00:33:16] Unknown:
Yeah. Dustin, wasn't our original estimate on the book something like 200 pages? We're like, oh, we told the publisher, I think we'll be able to write about 200 pages about this. And so kinda goes back to that Brett thing when we start when we sat down and started actually writing, things got very verbose very quickly.
[00:33:32] Unknown:
Yep. Absolutely.
[00:33:35] Unknown:
And through that exercise of having to think through every detail of the product and possible ways that it could get could get used, has that shifted the ways that you actually apply it in your day to day work?
[00:33:46] Unknown:
I don't think for me. I think it's still I think I still look at it the same way that I I still use it the same way now that I used it when I started. I don't think it changed my perspective a whole lot. Cameron, I'll defer to you if Yeah. It didn't it didn't change my perspective about how I use the product,
[00:34:03] Unknown:
but what did change my perspective is the frequency at which I stay up to date on the new features of the product. So, you know, before writing the book, I I didn't I didn't know what version of DBT was the most recent. You know, I didn't have to worry about that kind of stuff or what was involved in every release of DBT. But, you know, going into the book, I was attempting to stay aware of all of these new features because if it made sense to try to write about it, we wanted to. Right? But there were even things that we opted not to write about because they were such new features, and that's that's where the perspective changed for me. It's just staying more up to date on all of the new features of the product and and thinking through ways that we might actually work that into, you know, v 2 of the book like Dustin mentioned because there are things that we didn't even talk about because they were so green. And just a couple of examples are like the semantic layer. They were going you know, DBT was going through an entire revamping of their semantic layer at the time, right, as we were writing. And so we opted not to write about it at all because it we didn't wanna publish the book and then it'd be irrelevant immediately. Right? Another 1 is the DBT mesh. Like, that was a a feature that was very very green as we were writing the book, and so we opted not to write about it too extensively. I believe we mentioned it, but just more as a, hey. It's there in case you're interested type kind of thing.
[00:35:21] Unknown:
Yeah. And just to pull back the curtain a little bit on the process, and this, I think, is true with most books that get written, and we were very cognizant of this as we were writing the book. From the time that you finished writing the book to the time that it goes into print is about 4 or 5 months. There's a pretty big gap there between that. So as Cameron mentioned, even though the book came out while some of these things were announced, we had actually finished writing it months prior before it actually ended up coming out. So we tried to take that into account when we were deciding on what to include and whatnot.
[00:35:51] Unknown:
Now that the book has been published and as you continue to use DBT in your day to day and you're more, I guess, meta aware of the tool that you're using and maybe how other people are using it. I'm curious how that has caused you to maybe reflect on the broader ecosystem that DBT has helped to generate in the form of analytics engineering as a job description and a profession. And I'm wondering what you see as being the net outcome of that trend towards analytics engineering being a discrete role. Is it net positive, net negative, net 0? I'm just wondering kind of what are some of the larger ramifications of the potential that DBT has unlocked in the ecosystem.
[00:36:36] Unknown:
Yeah. I think that I think that DBT has unlocked time to value for analytics by I mean, I don't have an actual measurement here, but I'm just gonna say an exponential amount. Because prior to DBT, say I'm a data analyst, if I needed a new dataset that I couldn't just throw together based on maybe the fact and dimension tables that I have access to, I would have to go to a data engineering team or a data management team to build that dataset out for me. And then they build it, and then it comes back to me. But that might take weeks, months sometimes. Right? And the time to value to deliver this dashboard and, like, in the world of analytics, we're always time sensitive or we should be at least. Right? As an analyst, that was really difficult. And so having DBT now, you have this advent I have analytics engineering, and it kinda sits between data engineering and data analytics or or between the data analyst role and the data engineering role, right, where you have you have this person that knows how to build dashboards. Right? They're really good at the analytics side of things, the business facing side of the house.
But if you can get them access to the raw data, they can use DBT to build out all those transformations that they need for their end use case, right, whether that be a dashboard or some other data product out there. I think that the net the net, you know, net net, like, the advent of analytics engineering has been very positive. Right? And it's not just DBT that fits into there. I think this natural progression and DBT definitely helps in there, but there's just been this natural progression of, moving from a more traditional ETL ecosystem to ELT, right, where we just get all the raw data into a landing zone and we then transform transform it on, you know, after that point.
And dbt, of course, like, makes that super easy. Right? Because at the end of the day, dbt isn't doing any of the other mess. They're just they're just there to help you transform the data.
[00:38:34] Unknown:
Yeah. Maybe a bit of a controversial take from my end. But to me, the term analytics engineering, which has been around even before DBT, although not quite as mainstream as what it is now. To me, it it's more of a marketing term. It's your it's combining skill sets from a data analyst and a data engineer, but in my viewpoint, it's largely you're doing data engineering. Like, you're still having someone build models. You're still having someone apply tests. You're still doing DevOps. But it has been a great term for the the DBT community who I think has really rallied behind it. And usually, if you see it now anywhere, regardless whether marketing term or not, you pretty much know, like, hey. DBT is gonna have something to do with this role. So if you're going to search for if you're like, I wanna work for DBT and you're searching for jobs and you're looking for analysts, engineers, it's probably going to encompass some functionality from that. But I don't think it's necessarily a new role. I think it's a kind of just a term that's kind of coined to represent data analyst and data engineers or someone that's kinda combining different parts of those skill sets together.
But I do think it is something that's around to stay, and I don't think it's a I think it's a job that we're gonna continue to see continue to pop up with organizations.
[00:39:48] Unknown:
Yeah. And I think, you know, to expand on that a little bit, I think for a long time that people have been doing this role. Right? It maybe in the past, it was called the data warehouse developer or data warehouse engineer. Right? You weren't full blown in the data engineering perspective where you were building data integrations with, say, SaaS products or managing change data capture pipelines, things like that. Your your entire job is to transform data in your warehouse and and get it ready for whatever use cases you might have. Right? And so it's just a natural progression of how things change over time. It's it's an existing role with a with a new title, and dbt has obviously helped shape that a lot. Right? Dbt as an organization, not as a tool, has helped shape the the term analytics engineering a lot.
[00:40:33] Unknown:
Data projects are notoriously complex With multiple stakeholders to manage across varying backgrounds and tool chains, even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first 3 Miro boards are free when you sign up today at dataengineeringpodcast.com/miro.
That's 3 free boards at dataengineeringpodcast.com/mir0. And cycling back now to somebody who is using dbt, they're, building their data warehouse. As you mentioned, Dustin, 1 of the challenges of having these capabilities so easy to tap into is that maybe you move too fast for your own good. And I'm curious what you see as the challenges that people encounter as their DBT project begins to scale where they've gone past the, oh, hey. I can build this 1 or maybe small handful of tables with DBT to I'm building an entire warehouse system using DBT. I've now got 100 or thousands of tables, and I cannot figure out how to make this 1 small change that I need to make without bringing the whole house of cards crashing down. And so I'm curious, what are some of those elements of technical debt that people are likely to experience as they get further along in their journey of DBT?
[00:42:10] Unknown:
Yeah. I think just as getting started, and as you mentioned, it's really it's a really easy tool to go and get started in. You can go in and start building models and start producing and creating objects really, really quickly. And the fact that it's so easy, just you just can just move so fast on it. And when you have even in the realm of 5 to 10 people, and they're building models for different purposes, like you can rack up hundreds of models really, really quickly. You end up with multiple versions of the same truth because 1 developer is building something and not paying attention to what another developer is building. You end up in needing to find something and not knowing exactly where to look. It can very quickly just become wild, wild west. I think for anyone who's getting started with DBT, I think you should stop before you even write your first piece of code in it, you should think about your plan. Like, how do I plan on using this? How do I plan on structuring this? How do I have governance in place to make sure that it doesn't become the wild, wild west? 1 of the things we've done in the role that I'm in currently is there's always gonna be a need for reporting outside of data warehouse. Not every company needs a data warehouse. Sometimes you need a data warehouse. You need to build things in it, but you also need to be able to build operational reports. You need to keep give the business the data that they need. Were there 2 different needs? 1 of the things we've done is we have 2 DBT projects that we run. 1 of them has our actual data warehouse in it where we it's our data model. It's what we're working on. It's our single source of truth, and then we have another 1 that has our reporting items in it, some more of our operational reports, or things that haven't yet been built into a common model. So things that we just need to query against raw data and we need to produce a report, whether it's to sales, marketing, finance, product, whatever.
And they build those. And we have both of these projects are very structured. Like, everyone on the team knows exactly what we're building here, and they know where to put things of where they're building. They know where the dimensions and the facts live over in the reporting project. We have a lot of the project structured really well in terms of our folder structure to know what goes where. So it's very easy for anyone, even if you've never worked with our DBT project at all, to come in and know, oh, this is what this is used for. This is what that's used for. Yes. You can look at the lineage, and you can see how all of these things connect together, but it's very well structured. So everyone knows, you know, knows where things are.
We didn't just take the product, hand it off to a bunch of data analysts and say, hey. Here. Go build. Be merry. Be happy. We started with a plan. And that's probably the biggest piece of advice I could give to anyone with DBT is have a plan for how you wanna build things and how you wanna structure things so you know what's happening. And then you can implement other best practices like code reviews and other things, etcetera, so that you're making sure you adhere to those standards. But you gotta have a good plan getting started.
[00:44:56] Unknown:
Yeah. And I really like this idea of having 2 separate projects. Right? You have your core model in 1 project where, you know, hey. The data warehouse, this is very tight controls on that. Right? That's your data engineering team's gonna own that. But I also like the idea of having that reporting project separate as well. Know, because that's at at the end of the day, like, that was the original use case that DBT was trying to solve was how can we bring these best practices to data analyst reporting and all of that stuff. Right? And so having that project separate for them to still be able to take advantage of DBT, I think, is super super powerful. Because if you if you say, no. We're not gonna do that. We're gonna be too strict with our project.
You can't check-in your your SQL queries for reporting. Well, when I say check-in, check-in said source control. They're gonna build that stuff anyway. Right? So they're gonna build all their ad hoc queries. And so they're just gonna say they're gonna they're gonna be like me when I was a junior data analyst. They're just gonna be saving them on your on their laptop. So give them empower them. Right? Empower them with this reporting project like you're talking about, Dustin. I think there's some other things, though, that I could talk about as far as decisions to make a successful project, if that's is that a way we wanna go? Yeah. 1 of the other things that I think helps set your project up for success, and this is the this is very much the engineering me talking, but set up CICD pipelines early on if you can, like, early on in your establishment of dbt if you don't already have those set up. 1 mistake that I made when I first started using dbt was not doing that. And so it would push push changes out via git. They would hit production, and models would start breaking because I missed I missed something. Right? And so having CI checks in place to actually build those changed models before a pull request is ever allowed to even be merged into your main branch is of the utmost importance. Right? Because at the end of the day, any stakeholders that use your models downstream, if you're push if you're putting bad things out, like, you're gonna lose trust from them. Right? And it's hard to rebuild that trust. And so having those CICD checks in place are are of the utmost importance. The other 1 is a testing strategy, which also kinda plays in with that, but I love having a built in testing strategy, and there's there's this is too bold as well.
The first 1 is data quality, and, like, DBT makes that super easy. Right? Adding, you know, uniqueness test to a column or a not null test relationship, so on and so forth. That stuff is very easy to do. It's also very easy to overdo. Right? You could start putting a 100 test on every single column in your in your tables, and, like, it's gonna slow your jobs down a lot, but that's a whole separate conversation, I think. The other 1 is unit testing, and this, in my opinion, is so often overlooked in DBT projects and just in data engineering as a whole, I think. But, yeah, unit test your your code whenever you can. And SQL is inherently difficult to unit test. Right? So don't don't try to, like, unit test just for the sake of unit testing.
I guess the things that I would say is, like, say you have a big hairy case statement. Try to unit test that. Right? Make sure that your logic is doing exactly what you think it should be doing. Macros is another 1. If you're utilizing macros pretty frequently in your dbt project and you're doing some interesting business logic there as well, you might want to consider unit testing those. Fortunately, there's no packages in the dbt ecosystem that you can use to help you out with unit testing. It's a little bit of work to get those set up, but I think it's immensely worth it once you actually have that have that rolling.
[00:48:16] Unknown:
And a couple of things that you mentioned earlier is the temptation to not use proper formal modeling techniques because it's so easy to just create a new table. Why do I need to do facts and dimensions until you've already gone down the road of I've got a dozen tables that I'll do mostly the same thing, whereas I could have just created those fact and dimension tables to begin with and saved myself a lot of headache. And then also that aspect of having reporting be something that is incorporated into the DBT flow instead of something that's bolted on after the fact. I'm I'm wondering, what are some of the ways that you can help your teams train up on some of those formal modeling strategies and try to maybe incorporate some of those unit tests or linting into the project flow to be able to kind of help people fall into that pit of success of, I built my dimensional models because it was the easy thing to do. It made everything else easier for me, and then I can just build those 6 different views on those facts and dimensions as another layer of my DBT project.
[00:49:19] Unknown:
Yeah. So I'll share how I've approached it. So granted, I'm in an architect role, so I've designed a lot of these models before. The design starts before you actually get in and start writing code. So that means really sitting down thinking about your data, building some ERDs, building some source to target mapping documents to figure out okay. Or, actually, even before that. Let me even go back even prior to that. Like, understanding your business need is the first thing before you even start putting together technical requirements down to paper, meeting with your stakeholders, understanding what their needs are, and documenting that out, and then taking that and converting that into a data model that can service those needs. So most of the warehouses that I have built, and I know different people have differing opinions on this, have been dimensional models.
I'm not die hard dimensional model everything. There's different models that meet different use cases. But in the in the stuff that I've built has predominantly dimensional models have worked extremely well. But taking those business requirements, laying down a model, or having someone on your team create that model of, okay, here's what my dimensions look like. Here's what my facts look like. Vetting that, having the team review it. It's like, does this accomplish our needs? Because effectively, what you're trying to do in a dimensional model is you're trying to create with your dimension, you're trying to create master data tables. You're trying to create single places to look at stuff. And then your fact tables, you're creating what you're trying to measure. But you need to think through those things beforehand because when I've gone through model design before, I'll start with something and then 3 weeks later, it looks completely different. Not that it always takes that long, but you start thinking through things, thinking through it. And I'm not saying that everyone has to go through a full exercise if you spend weeks weeks trying to figure out your model, but you do need to at least put thought into it before you get started. And you need to have a plan of what you plan to do. And does that plan change? Absolutely. As you get in, you start writing code, can absolutely change. But you need a plan to start with of this is this is our north star. This is what we are working toward. This is what we're trying to build, and these are the business needs we're trying to accomplish before you ever even get into your project and start running code. And where I think a lot of people fail is it's like, oh, we have dbt now. We can write SQL, and we can do this stuff really easy. Let's just go in and start building stuff, and then it becomes chaos really, really quickly.
[00:51:35] Unknown:
Yeah. And I think all of that actually helps improve the developer experience as well. Right? If you as you continue to scale your project, your and your organization, you're gonna have you just naturally are gonna have more engineers that are working on your team. And so if you have some level of standard modeling practice in place, it's gonna be easier for them to come in and start building out new models in dbt. This gets a little confusing when we start talking about it like this. Like, talking about the data model itself and then the actual activity of building the model when referring it to dbt. But, yeah, from the developer experience, it makes it a lot lot easier and a lot smoother as well.
[00:52:14] Unknown:
And so as you are building out a DBT project, you're building out a team to work on it, what are some of the tools or processes, whether inside or outside of DBT, that can help alleviate the incidental complexity of working on a large data project, whether it's a data warehouse or reporting or some other application of these technologies?
[00:52:35] Unknown:
Yeah. 1 is, you know, just in being intentional about writing the documentation for all your models. Right? Go in. Write the documentation because as everything continues to grow, it's gonna be much easier for a new developer to come in and get up to speed quickly on what's actually happening in there. I've come into DBT projects where there were 15 engineers working on it, and there was no documentation written. And so the best thing that I could do is, like, load the visual bag and try to understand how everything connected, which is great, but it doesn't give me any context about why the things are doing what they're doing. Right? The DAG tells you what is happening, but it doesn't at all tell you why it is happening. Yeah. Being intentional about writing that documentation. This goes back to how I was talking earlier. Like, if you're good at this, you're gonna live in YAML a lot, but it pays dividends at in the end. Yeah. The documentation is a good 1. Just kind of in general, some things that I think that will help your project be successful. We've kind of hit on this 1 a lot working from a plan. Another 1 is making sure, like, you have good,
[00:53:32] Unknown:
like, push processes in place. So if you have any sort of standards that you're working against, make sure there's a way to be able to check those. So for an example of when you push to get, making sure there's a code review process so people just aren't able just to push push anything that they want, that there's some sort of review that makes sure it adheres to a standards. And then just generally working collaboratively with with each other. Like, DBT is pretty easy to work collaborative on, but it's easy just to kind of go rogue and start building things. So just making sure you're having good communication with other folks in your team as you work through things. And then lastly, that I'll mention here is just monitoring your builds. So you can build models currently, and they may only take x amount of time to run. But as you continue to add more and more and more to it, just keep an eye on those things. Dbt, if you're using cloud, has some decent monitoring capabilities, to be able to check to see how how long things are running and then tracking those over time just to see, like, hey. Am I going the wrong direction? Are our models taking longer? Did something did somebody change something that had a big impact on it, etcetera?
[00:54:39] Unknown:
Yeah. There's definitely tools outside of DBT that help make this all of these things possible. Right? So we've hit on CICD quite a bit. You know, at the end of the day, you have to have some tool to help you do that. And I personally like GitHub actions. It just it just works. I work in GitHub already anyway. But, you know, I've seen teams use Jenkins, CircleCI. You can use dbt cloud to help you with your CI processes as well. Right? I mean, they have they have a neat little, neat little add in for GitHub. But all of that, like, definitely, you wanna have those additional products to help make sure your your, CICD pipelines are running very smoothly. And, Dustin, you also just hit on the data observability piece. More you know, I think that's what you're getting at. As you wanna notice you know, say you have a a fact orders model that gets built every day. If there becomes some discrepancy in the row counts or, some column suddenly becomes skewed in the average. Right? You wanna be aware of that. Dbt will help you you know, at least dbt cloud will help you know when things are failing, but there's other tools out there that you can use. Like, Metaplane is 1 that comes to mind immediately for me that you can use to monitor the actual values in your tables and get alerted when things go wrong. Right? So it's very similar to, like, Datadog for an application, but, you know, just helps you understand when things are maybe that wouldn't might slide under the radar, you get you get alerted of those, which I think is great. Right? I'm I'm a big Slack advocate. And so if I can just get pinged in Slack that something's going on, like, it makes my life a whole lot easier. Yeah.
[00:56:05] Unknown:
1 of the other challenges that I've experienced with working with DBT, and it touches on that question of CICD is how do you actually have a proper QA environment? Is it just a different schema in your production data warehouse? Do you actually have a QA data warehouse where QA where data from QA systems flows in, and that's where you test things? Like, how do you do it? That that's kind of the, overarching question that the entire data engineering community is trying to grapple with right now, I think. That is a great question. And I think it a lot of it boils down to which platform you're using,
[00:56:39] Unknown:
meaning, like, which data platform you're using. Right? So you're using Snowflake, Databricks, BigQuery, what have you. I can speak from it very well from the Snowflake perspective, and it's how I've done it. So when we talk about CICD, right, and we we wanna run our models when pull requests get opened up, we obviously don't want that to run-in production because if something's wrong, the entire idea here is so that it doesn't break up the production datasets. What I've done currently is using Snowflake's database copy commands. So copy the entire production database as a new database.
I also tag that database not tag. I name that database. I include the PR number in that database somewhere so that if you have, like, 2 CI jobs running at once, they're not stepping on each other's toes. And so you can have scales out pretty well. Right? You can have a bunch of developers opening pull requests at the same time, but then everything runs in there. And then if something fails, like, it's just in this isolated environment. There's no worry that, we broke dev. We broke prod. Like, you broke the CI database. That's totally fine. That's why it's there. And we're gonna drop that database once the job is completed anyway, so it doesn't really matter. There's ways there's other ways that you can do it, like or I guess platforms that don't have that copy command. Right? I've seen some other folks run it in an isolated schema. If you you can do that in production, and then you run it in an isolated schema and then use, like, custom schemas and dbt to, like, build everything in that 1 schema. That's a a nifty way to do it. But then again, you're still running it in your production database. And while you can probably manage permissions to not expose it, it's
[00:58:13] Unknown:
less ideal to me at least than having it in a completely isolated environment. That's something that I favor a lot. Yeah. I've got a very similar setup to what Cameron is doing. The the feature in Snowflake is 0 copy clone, where it'll take a clone of the database. It doesn't copy the entire database. So if you have a really large database, you're like, oh, this can take forever for a tech copy. It only copies the metadata over, so it's really, really fast. And then, yeah, you run your your CI checks against it. It runs a full build against it and tells you if there's any errors. And we actually have our promotion process. It's actually set up that it has to pass that CI job before you can actually complete a merge. So if it doesn't pass, you've gotta go in and fix it, and you never taint data between your upstream environments. We are running a dev, a QA, and a prod. I've seen people run more environments than this. On Snowflake, it's really easy to use the same instance of Snowflake to set up each 1 of your environments because you can separate those based on how you create your warehouses so you're not sharing compute between environments, as well as using RBAC to set up roles that can access different environments. So from a user perspective, it looks like I'm only looking at dev. I'm only looking at QA. I'm only looking at prod when you look at those. And that's how I've done it. But as Cameron mentioned, it's gonna vary depending on what your data store is that you're using. A lot of these are gonna have some some similar sorts of functionalities to be able to make it work, but I've had great success with with doing exactly what he described, and it's worked worked phenomenal.
[00:59:39] Unknown:
Yeah. I think your your setup of an environments also depends on how you structure your release cycles. So if you're doing so you have some sort of release cadence where everything gets merged into a, like, a a release branch in GitHub, and then you actually release it to production once a week. That might change the way you think about this stuff. You might really wanna have a dedicated QA environment at that point, in addition to the CI environment. But for me, I don't I don't do that right now. Like, I don't have a dedicated release cycle. The way we have it set up is everything. Once it runs in CI, you can release things to production immediately. Right? And the developers have full control over production releases. And so it yeah. That I think that plays a big factor into whether or not you would want a QA environment isolated or not. And, you know, into whether or not you would want a QA environment isolated or not. And then you have to make the decision of, do I run my CEI stuff in the QA environment or not? That's a that's another question to ask as well. And as you have been building with DBT
[01:00:33] Unknown:
in your own work, the work that you've done to write this book and publish it, and the work that you're doing in your local communities? What are some of the most interesting or innovative or unexpected ways that you've seen DBT used? Yeah. I've got a I've got a pretty cool 1. I don't know that it's extremely complex, but I I it's a nifty little trick that I have.
[01:00:53] Unknown:
And so I quite frequently use incremental models so that we're not, you know, loading data a 100 times a week or something. Right? Like, just load the data we need to load. But things go wrong when you're loading incremental models. Like, it it's bound to happen. And so anybody that's worked in the data space, especially in data engineering, is very familiar with backfills, I'm sure. And so a unique use case, I suppose, that I've had is using a a macro to override whatever date you're using for your incremental loads. Right? So in a in a incremental model in DBT, a very common pattern is to have a where clause that filters your select statement by some date and or time stamp. And so what I've done is wrap that date field in a in a macro that allows you to pass in a static variable whenever you run the dbt project. And so you, you know, run your dbt build command. You can pass in a variable. So say you wanna backload or I'm sorry, backfill for the past 2 weeks, you can actually pass in a static date that way instead of trying to do, like, a full refresh. And I think that that improves performance significantly when you're talking about backfills because without that, DBT, of course, has the full refresh flag. But, like, if you've been building this incremental table for 3 years, like, just because 1 thing went wrong this week, does it make sense to really build rebuild 3 years worth of data? Probably not. So having that macro that allows you to override whatever your incremental date is, I think, is extremely valuable in in terms of performance and just getting backfills out faster.
[01:02:26] Unknown:
Yeah. From my perspective, I'd I was trying to think if there was any sort of, like, interesting there's lots of interesting things, like, being done in dbt, but it's still part of its sort of core competency. 1 thing I'll add though is with hooks, pretty much anything is possible within DBT, so whether you run a prehook or a posthook. And I've seen some pretty creative ways that those have been used to be able to yeah. You can literally do anything before and or after using hooks. It's not necessarily always a good idea to do. Usually, hooks are, like, a last option. Like, I just don't have another way to be able to do this, and I need I just want it to be part of my dbt build. But that's the only other 1 that I can think of where people can get pretty creative with those, but you can also do things you probably shouldn't do. Yeah. I mean, going on the the topic of hooks, like, I've seen I've seen people use
[01:03:15] Unknown:
hooks to manage, access controls on their objects in their database. Right? Or, I've also seen them use it to, like, vac like, run the vacuum command on their tables in BigQuery. And I think dbt now has some you know, they have some built in capabilities for this stuff. Right? And so that's another thing is just to with regards to hooks. And we do talk about this in the book. There's a whole chapter dedicated to hooks. But we do caution you, like, time and time again is to make sure that when you use hooks, don't get into this don't get into anti patterns. Right? Make sure that dbt for sure doesn't have a feature for whatever you're trying to do before you start using hooks. Right? So if you're using if you're using hooks to, like, create a temporary table before you build a model, certainly an anti pattern. Like, you should just break that out into a separate model and then reference it downstream. And,
[01:04:03] Unknown:
but, yeah, I've seen some very creative use of hooks as well. And more often than not, it seems like they're antipatterns, and there's probably better ways to do them. Yeah. We have an example in my current role where we we use the post hook to do access controls where we have a model that's getting built that we're then sharing with an external customer. So we're using data sharing capability in Snowflake. And when you use table and view materializations within dbt, it does a drop and a re recreate every time you run those. And so as a result, any permissions that you have on that object didn't get dropped. So we had a post hook that would run that would add those permissions back, but it actually wasn't very good design because we just needed to switch that to an incremental model. So the table didn't get dropped and the permissions persisted. So kind of back to what Cameron was saying in terms of anti patterns. Like, a lot of times, there's a better way to be able to do it without having to do that, but they are there if you need them.
[01:04:52] Unknown:
Yeah. For sure. And, I mean, this is a this is more of a personal preference or rant, if you will, but I've made the conscious decision to pull, you know, access controls out of my DBT projects. I there was a time where I was managing them as post hooks on all of my on all of my models, and I realized this is a bad idea. And I was working with Snowflake at the time, and Snowflake has a there's a Terraform provider that you can use and so much easier. And it like, that's what it's designed for, right, versus trying to put the the square peg in the round hole kind of thing. Right? And DBT, I don't know if it's a recent feature, but they actually have the grants capability as a config you can put on the model as well. Yep. Yeah. But as we found out, grants don't work with data sharing. They only work with with access. So
[01:05:35] Unknown:
Fair enough. As you continue to build with DBT and you, keep track of its continued evolution and you continue to think towards what does that next version of the book look like, what are the things that are on your personal wish list for the future of DBT?
[01:05:50] Unknown:
So for me, I had a a pretty good personal wish list a couple months ago, and they actually, like, have them. So my my wish list has gone down a lot, but I'll I'll I'll plug the things anyway. The first 1 would have been support, at least from the the cloud perspective. Right? DBT cloud would have been support for micro the Microsoft platform that like, for a long time now, I had just a really hard time understanding why DBT wasn't supporting the Microsoft platform. I mean, so many companies use Microsoft. But with the announcement of, Microsoft Fabric, they now support fabrics and apps analytics. So I think that's a that's a great win for them and something that I'm interested in trying out as well. And the other things would have been materialized view support for Snowflake. That's been on my wish list for quite literally years, and it just it just came out, I believe, in the last release of DBT. And I can support materialize views in my DBT project, which a huge win. Right? I mean, there's so often times where I don't wanna build a table, but a view just doesn't cut it because it it's not quite performing enough for my end user. So being able to have those in there is is great. So I guess the holidays came a little bit early for me this year on my on my DBT wish list. Yeah. A few things from my end, and these are more so focused toward cloud.
[01:07:04] Unknown:
I think having the having better ability to be able to profile data from within dbt cloud is something that I'm interested in seeing. Right now, when I do a lot of development, I find myself developing like, if I'm using snow Snowflake, I'll develop more in SnowSite, and then I'll convert what I've built over into DBT. So I'm not just completely working through the DBT cloud IDE. So having more to be able to, like, go in and profile data, understanding, like, where I'm missing data or number of distinct values, etcetera. Because a lot of times I'm working with data, and maybe as I'm building transformations, it's really the first time that I've really kind of dove in into it. Having more capabilities within cloud, I think, is gonna be great. I've talked to a lot of people who and granted, they're data engineers who love command line, but they're not big fans of the IDE.
So I think continuing to see see that get built out. Unit test is 1 that I've heard. There's way to do unit test in in DBT, but I think continuing to build that out and provide more more functionality. And then the last 1, which I I'm kind of interested to see how this evolves, is to see how DBT incorporates more AI into their product and if they're going to do this, particularly around, like, doing like, evaluating code and doing code quality and having checks against that. We talked about a lot of the issues when you give when you give this tool over to analysts and they're just going wild writing stuff. Like, is there some sort of mechanism where we could utilize AI to basically evaluate code and catch stuff that we would maybe have people doing during code reviews,
[01:08:34] Unknown:
would be something that I'd love to see as well. So, Dustin, you're kinda thinking like a like a GitHub Copilot, but in dbt cloud. Yeah? That'd be Absolutely. Yeah. Because, I mean, it would have all the context of exactly what dbt syntax, you know, needs to be. So that would be that'd be pretty interesting. I guess 1 other thing for me too now that we've we talked about it earlier on the call, Tobias, but you mentioned column level lineage and, yeah, that for a long time now, I feel like it's something that DBT has lacked in its, in its presentation. And it feels like they have they have all the metadata available to them. Right? Like, everything they would need to build a feature out around column level lineage. I don't spend all that time in those YAML files for nothing. Right? So they could use it to to help me understand column level lineage a little bit better. Absolutely.
[01:09:17] Unknown:
Oh, well, are there any other aspects of your work with DBT in your jobs, your work on the book, your investment in the community through your sharing of knowledge that we didn't discuss yet that you'd like to cover before we close out the show?
[01:09:31] Unknown:
So, yeah, 1 thing. So, obviously, we have the book that's out now, and we would love to get that in as many hands of folks as we can. We spent a a tremendous amount of time writing this. It took us a year to write and 100 of hours, so we'd love to get that out to folks. Both myself and Cameron, we're also very active within the data community. We presented at conferences around the world. And if you ever get a chance to come see us or talk to us, feel free to stop us. We also within the Nashville community, we run the Nashville data Data Engineering Group, which is just an opportunity for us to get together with other local data engineers, hear about new products, new topics, learn from others. We think we're pretty smart, but we know we still have a whole lot to learn.
So being having the opportunity to learn from others. And then we've actually launched a new conference that we have coming up that is gonna be in March of 2024. It's gonna be hosted in downtown Nashville, where we have multiple tracks around data technology that's gonna be happening. We have a data engineering track, data analytics, data science, and data management. It's gonna be a 2 day conference. We have workshops 1 day, general confer conference, more so conference style sessions the next, around all of these topics. We've got we're gonna be talking DBT. We're gonna be talking Snowflake. We're gonna be talking Databricks. We're gonna be talking a whole bunch of different stuff during that conference.
We'd love to have folks come and join us. If you're interested in speaking, you can go to Datatuneconf.com, and you can check out what it is to submit to speak and all the different benefits that come with that. But our speaker submission is open until the end of this month, So it will close on November 30th, and then we will make selections shortly after that. So you have plenty of time to know if you're from out of town and wanna come in. But Nashville is a beautiful city. It's a lot of fun. Would love to have folks come and join us.
[01:11:20] Unknown:
And 1 thing I wanna hit on the conference as well and that I I like to publicize everywhere I can is how often do you go to a conference and it's just simply out of reach because it's so expensive? We have intentionally made it very, very affordable for practitioners to come and hang out. Right? Like, we are not marketing this conference necessarily from the attendee perspective of, like, sales or business folks entirely. Like, we want practitioners there, and we want everybody there. But we want it to be affordable. Right? And just to come to the general sessions, it's $25, which I mean, you you both know the the price of conference tickets. Like, $25 is absolutely insane.
The preconference workshops range anywhere from 1.50 to 200 depending on which 1 you choose and, you know, when you buy the tickets. But even that, I mean, $200 for a full day workshop is still cheaper than just admittance to most conferences. So just wanted to plug that as well.
[01:12:19] Unknown:
Definitely worth it. If you're gonna be in the Nashville area, buy a ticket. Come hang out. It's gonna be a great time. Yeah. We also aren't trying to push any initiative, so you're not gonna hear 50 sessions by our partners trying to sell you something. Like, we are technologists at the end of the day. We just wanna learn. So everything that we are putting on the docket is about learning technology. So there's no sales marketing initiatives. No no vendor is from is running this event. This is literally a group of local community leaders who got together and said, hey. We wanna we wanna put together a tech event. We wanna learn.
[01:12:51] Unknown:
Alright. For anybody who wants to get in touch with either of you and follow along with the work that you're doing, I will add your preferred contact information to the show notes. I'll also add links to the conference and the speaker submissions, for anybody who wants to get involved there. And so as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I'll start here. So for me, it boils down the security.
[01:13:18] Unknown:
So I think especially as you continue to see things evolve, data privacy, security, compliance, these things all continue to be big concerns. Data is increasing in volume. It's increasing in sensitivity. And so us continuing to ensure it is safe and protected is gonna be huge. With some of the rapid adoption, especially of, like, AI and ML, it raises a lot of questions around ethical uses of data. So having more robust governance frameworks is something that I think is gonna be really, really critical. So I know we didn't talk about security too much during, during the talk, but it is something that I think is going to be extremely,
[01:13:56] Unknown:
extremely important as we move forward, and we have all of these new technologies popping up. Yeah. And then for me, the the gap that I see is the biggest 1 today is, you know, the this concept of data contracts. Like, it's becoming more popular. And as a data engineer, I wanna make sure that my pipelines are trustworthy and that any upstream changes that, you know, might happen in, like, my company's application don't break my data pipelines. And while there's ways that you can go out and implement them today yourselves, like, it's interesting to see there's there's a few new companies that are coming out. I can't think of them exactly by name right now, but there's a few companies coming out that are gonna be offering the SaaS product for you to implement data contracts that you can build into your CI workflows for your for your product engineers and your software engineers upstream so that, hey. Like, we've agreed that this is our contract. This is what we expect the data to be downstream. And if you're gonna change it, like, it needs to get added to added to the contract so that it doesn't break things downstream. Because, like, yeah, as a as a data engineer, this is horrible when you get a notification at 2 o'clock in the morning because something broke, and it was completely out of your control. Alright. Well, thank you very, thank you both very much for taking the time today to join me and share your experiences working with DBT as practitioners
[01:15:06] Unknown:
and on writing the book to help other people level up in their practice of DBT. So appreciate all the time and energy that you've both put into that, and I hope you enjoy the rest of your day. Yeah. Appreciate it. Thank you, Tobias. Thanks, Tobias. Have a good 1.
[01:15:23] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introductions: Dustin Dorsey and Cameron Cyr
Career Beginnings in Data
Adopting DBT: Initial Experiences
Challenges in Starting with DBT
Influence of Prior Experience on DBT Projects
Writing the Book on DBT
Impact of Writing the Book on DBT Usage
Analytics Engineering: Role and Impact
Scaling DBT Projects and Managing Technical Debt
Tools and Processes for Large DBT Projects
QA Environments and CICD in DBT
Innovative Uses of DBT
Future Wishlist for DBT
Community Engagement and Upcoming Conference
Biggest Gaps in Data Management Tooling
Closing Remarks