How Shopify Is Building Their Production Data Warehouse Using DBT

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows

try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Today's episode of the data engineering podcast is sponsored by Datadog, the monitoring and analytics platform for cloud scale infrastructure and applications.

Datadog's machine learning based alerts, customizable dashboards and 400 plus vendor backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting.

By combining metrics, traces, and logs in 1 place, you can easily improve your application performance.

Try Datadog free by starting your 14 day free trial and receive a free t shirt once you install the agent.

Go to data engineering podcast dotcom/datadog

today to see how you can unify your monitoring. Your host is Tobias Macy. And today, I'm interviewing Zeeshan Qureshi and Michelle Arc about how Shopify is building their production data warehouse platform with DBT. So, Zeeshan, can you start by introducing yourself?

My name is Zeeshan Qureshi. I'm an engineering manager at Shopify. I've been

involved in the data org almost since its inception for the last 5 years. And, Michelle, how about yourself?

So, hey, I'm Michelle. Nice to meet you, Tobias.

I'm a data developer at Shopify,

and I've been there for a couple years.

And going back to you, Zishan, do you remember how you first got involved in the area of data management?

Yeah. So

when I joined Shopify,

I had worked on the Chrome team before, so I got handled with the task of building our online storefront data collection pipelines.

So I did a bunch of JavaScript,

and effectively, like, we were building Google Analytics for Shopify, both for internal analytics and also serving merchant data, which kinda then snowballed into building a bunch of internal tooling for data scientists for, like, funnel analysis, like, stream analysis.

And then a point came where, like, I was the person who was being pinged every time someone wanted to measure something new.

So we started to start a team around, like, specifically all the tooling to collect our data and how we manage schematization of our data versioning and especially, like, privacy annotations.

Yeah. It's definitely a very broad and deep area that you can kind of go a lot of different directions with. So I can imagine the level of complexity whenever but somebody says, I wanna measure this new thing, and then saying, well

Yeah. Exactly. Like, a lot of times, it's like, you know, I wanna measure everyone across every site. We're like,

probably can't do that.

And, Michelle, do you remember how you first got involved in data management?

Yeah. So through University of Toronto's applied computer science master's program,

I was paired with Shopify on an applied research project where I landed fortunately on Zeeshan's team, and we were building and prototyping

a specialized query engine specifically for funnel queries, which we found were the kind of least performing class of queries for internal

analytics at Shopify.

After that, I joined Seshaan's team where we were working on Shopify's event collection framework.

Eventually, we started getting to the other end of this collection and looking into storage optimizations

and query engines, and that's really how we got into

exploring dbt as an orchestrator of SQL statements to to build a data warehouse.

And before we get too far along on the story of how you got to where you are with DBT in your data warehouse, can we just give a bit of context for anybody who's not familiar with what Shopify is and just describe the platform and maybe talk through some of the types of data sources and formats that you're working with?

Yeah. So Shopify's goal is to enable commerce for everyone,

whether it be e commerce, whether it be, like, physical point of sales,

and to enable a platform so that you can plug and play every part of the

platform. So, like, if you were to use a different shipping engine, you can use, like, Shopify provider shipping rate, or you could use a third party shipping service. We have, like, an app store platform where, like, features that are not in Shopify or if you want customizations

outside of Shopify,

you can opt out. So the idea basically is, you know, like, Shopify is there for all your commerce needs.

So on the warehouse side, we or at the lakeside, we are pulling from lots of internal databases that are sharded MySQL tables,

hundreds of third party APIs, and a lot of streaming event data over Kafka.

We also have some occasional dumps of data from customer surveys,

things like that.

In terms of the types of analysis

and products and decision making that you are making based on the data that you're collecting and the processing that you're doing with it, what are some of the

impacts that that information has on the business and the direction that you take the product? 1 of the key use cases for data is, you know, your standard business analytics, so, you know, funnel analysis, conversion, segmentation cohort. Over the last year or so, Shopify has also

made a lot of focus on performance optimizations.

So, like, a lot of our event streaming data is going into the service called Fast Shopify,

where every team can see their biz performance metrics, which is a key driving factor in, like, engineering decisions and technology choices. Shopify is also powering a lot of, like, customer facing products with this data, so, like, Shopify

fraud protection,

where for a small sir fee,

Shopify will provide insurance on fraudulent activities or Shopify Capital, whereas Shopify will

fund merchants to help them scale.

There's definitely a decent amount of scale and organizational

complexity within Shopify. I know that it's been around for a number of years, and I've been hearing about it for quite a while,

both in terms of the platform and people who are building customizations

on top of it. And I'm curious how you have structured the

organizational

aspects of your data teams to be able to

manage the different reporting requirements

and managing the analysis to make sure that you have the appropriate domain expertise

and some of the technological systems and processes that you have in place to be able to manage all of that organizational and technological complexity?

Shopify has a top level data org as, like, a separate pillar in the company,

and we've split data science and engineering in 2 different ways.

Data science uses a hub and spoke model where every data scientist reports into the data org, but they're embedded in their engineering teams. For data engineering,

there is a dedicated data platform,

team in the data org where everyone effectively reports functionally

according to the layer of the platform,

whether it be, like, ingestion, processing, serving.

In terms of some of the systems that we have in place to allow and support the needs of those use

cases, it's a whole swath of them. So we have some custom data extraction tooling that's built on top of JRuby,

a PySpark framework called Starscream.

We leverage Presto and BigQuery for our query engines and and analysis,

as well as other custom data stores such as Druid for serving.

We also use Amplitude Mode and Notebooks for reporting as well as ad hoc query.

And I know that in the blog post that you shared with me to prepare for this conversation, you mentioned that Starscream

was, for a while, the core element of your

reporting capabilities

and that that ended up

slowing down some of the delivery of

ad hoc

or relatively simple queries. And I'm curious, what was the tipping point that pushed you into deciding that you needed to build a new component to your data platform to be able to handle these types of use cases?

Through, I guess, data surveys and just chatting with data scientists,

the tipping point really was

how long the iteration cycles were from prototyping to production

data modeling.

People would generally prototype in SQL and then convert into PySpark

code and consider things like performance optimization, storage optimizations, and it would just take weeks weeks of work.

So we saw an anti pattern where data scientists were very commonly materializing scratch, views, or tables

instead of converting their SQL based prototypes to PySpark. So this led to lack of reviewability,

sharing, testing,

freshness. It was very manual.

So these antipatterns as well as just understanding that the long iteration cycles were caused from going between

different languages and different systems.

The crux was that the user experience with Starscream

was very engineer centric.

And back when Starscream was built, we didn't have, like, a data scientist and data engineer

separation. Like, everyone was effectively an engineer, and, like, you know, everyone was expected to write Python. And the way StarStream just output the datasets,

the metadata, and, like, just the storage format was also just not most optimized. We did not have partitioned tables output for the longest time. So we just

figured that, like, we want to go where the users are, and the users are with SQL.

Yeah. It's definitely remarkable how

durable the SQL format has been where there have been a number of cases where people have tried to move away from it and say, oh, well, you know, it's it's not a proper programming language. I do so much more if I use Python or use r or whatever it might be, and then eventually, the pendulum always swings back to SQL.

Yeah. Yeah. And, like,

even basic, like, join optimizations, if you're trying to do it in, like, raw Python, you have to understand what's happening. But if it's, like, the declarative model, we can push it all to the query engine and and let it do the magic. Definitely a lot to be said for

separation of concerns and letting the optimizations happen at a layer further down from where you actually have to deal with it on your day to day.

In terms of your selection process, once you decided that you needed to go a different direction with managing this aspect of your data platform, you wanted to push it into a data warehouse,

what were the criteria that you were using to measure the various candidates against, and what was the ultimate decision that you landed on?

1 of the criteria was that we realized we had been building a lot of custom tooling over the years, and we wanted to adopt something that was being developed out in the open

and that we could stay closely aligned with. So we didn't really wanna build a very custom workflow.

As we said before, we wanted the tooling to be declarative. So whether it be SQL or some other form that was fully declarative,

we did not want people to be concerned with the internal details of the platform. And

to be honest, it was mostly

a question between, like, using any form of, like, SQL based engine

or orchestrator and, like, the only

viable 1 out there was DBP. So the real question was, do we use DBT or do we build our own orchestration system, which I think this is actually Shopify's turn to building a SQL based modeling

tool.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/

datafold today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask.

And in terms of the actual storage layer, what were your criteria for determining

which warehouse to use?

Most of Shopify is on Google Cloud, and we wanted to run this team as a dedicated team, without blocking too much on other teams.

So we basically went for BigQuery because it's, you know, a fully hosted service, and our experience with it before using DBT had also been pretty positive.

That being said, we wanted to be query engine agnostic,

and we are looking at other storage layers in the future.

You decided on BigQuery. You landed on DBT as the actual tool chain to use. I'm curious what your approach was for

vetting it in terms of understanding

this looks like a good tool. I'm going to actually run it through its paces and make sure that it does all the things that I needed to do. And then once you decide, okay. This is going to do what I want, then onboarding new users into it and figuring out what the

repository structure is going to look like, what some of the organizational standards are going to be in terms of the formatting

and the overall workflow.

So the project structure, the directory structure was 1 of the biggest unknowns and really something that DBT leaves very flexible.

So we

spent, I think, a few iteration cycles going between alphas and betas, but landed on a project structure where we have backroom models and they have some access policies around those. For example,

by definition, a backroom model is something that has some rough edges and

requires some domain expertise, like, or team level expertise to understand its caveats.

So we restrict building on top of such models unless

the downstream model is owned by the same team. So the whole backroom pipeline is owned by the same data team. And gradually, those are promoted to other layers

for exchange and for front remodeling,

and that's the project structure that we landed on that that worked well for our organization.

It's definitely 1 of the challenges when you get a tool like this where there are some opinions, but then there are other areas where it's flexible just understanding

what opinions to form if you're just coming into it for the first time. Yeah. Definitely.

As far as the actual data modeling approach, you mentioned having these backroom tables for being able to prove out some of the raw source datasets.

And I know that there's been a lot of debate, particularly in recent years with some of the newer query engines as to whether to use things like the star or snowflake schema from the early days of data warehousing or using things like Data Vault for being able to manage a lot of different data sources with

changing needs within the business or just going with wide sparse tables because the query engines can handle them now. And I'm wondering what you have landed on and what your process was for determining how best to handle some of those modeling questions.

We've mostly stuck with Kimber and Ross dimensional modeling

and materializing into a Snowflake schema.

Although we have individual teams, which, based on performance optimizations, will use denormalized

and or white tables for critical datasets.

Our thought process around this is that we want data to be modeled, using Kimball and Ross, but then it can be published in various layout formats based on the use case.

I think part of that thought process has also been, you know, consistency and just choosing a modeling paradigm

is super critical and important, maybe even more important than choosing which 1.

Yeah. It's definitely important to make a choice because then you at least have something to build from. Whereas if you leave it up to everybody, then, you know, if there are 5 choices, then, you know, they'll they'll make 6 different decisions with it. And then in terms of the actual

workflow of being able to manage the DBT pipelines and structure the orchestrations,

you mentioned that you're using

star schema or, like, the approach for actual the dimensional modeling aspects.

I'm wondering

what you have seen as

opportunities or requirements for adding in some layers of abstraction

to buffer you from changes in the source datasets

so that you don't have to

rework all of your models downstream every time some API version changes?

So to that end, we adopted a pattern that DBT recommends called staging. We called it base models. It's akin to a layer as well, so it's the first layer right after raw sources is this base models.

And this is really their best practice for limiting references to raw data.

So our base models serve as more or less a 1 to 1 interface to raw sources,

but they're almost like a snapshot of the contract of the raw source. They don't change the grain or do any renaming.

They're just this 1 to 1 interface, and it serves to protect us and users from breaking changes in raw sources, which by definition

going to introduce breaking changes that are external to our system.

So once these breaking changes do happen, do inevitably happen,

the idea is that you only need to apply the fix to a single

place in the base model

as opposed to every individual downstream model that depends on that raw source. So there's a single point of failure and a single owner that can propagate those changes.

And you mentioned that you had some organizational

experience with using BigQuery

before you made this transition to building

the DBT workflow and fully investing in that. And I'm curious

what you

learned before using DBT

that informed some of the decisions that you made as far as how you wanted to

structure some of the transformations,

how you wanted to manage things like partitioning

schemes,

just some of the organizational

knowledge that was useful from working with BigQuery that you then translated into DBT and Codified?

So before BigQuery

before

DBT, we were using Redshift a lot on our warehouse, and I think Redshift was our original warehousing solution.

But then shoveling data from our GCS Lake into Redshift was becoming too much of a trouble, and we wanted to consolidate everything in 1 provider. So we decided to switch over to BigQuery. But then what we realized was that doing a lot of those things that we would have to do, for example, in Presto or Redshift, which included, like, partitioning or clustering, we would have to preprocess a lot of our data

before we load it into the warehousing solution, whereas with BigQuery, we could mostly give it r k files and ask it to partition or cluster. And so the whole data loading workflow went from the order of couple of hours to a couple of minutes for each dataset,

which was super useful, and it also meant that, like, our spark cluster was not getting hammered with all these just materialization

queries

to load into any other warehouse.

Once we kinda switched over to BigQuery, I think it was about a year before we started working on DBT, we also spent

a bunch of time just, like, studying more query engines, and then we realized, like, BigQuery's, like, flexible scaling capacity was really

useful for our use case because Shopify is, like, growing exponentially and, like, just forecasting how big our cluster needed to be and how much provisioning we needed to do. BigQuery just, like, took all of that away from us, and it was just super nice so far. Although, Michelle has been working more with BigQuery, so she can tell us some headaches if she has.

There have been a couple headaches, but actually nothing too major. Like,

the numeric support, the lack of high precision numeric support was the bane of my existence for a couple months. Oh, yeah. We tested some queries that we were testing in Redshift, and we found, like, large window operations worked much better on

Redshift, although I'm not sure if they've been fixed. This was, like, late 2018, I think. Yeah. That's another nice aspect of BigQuery was the server side version upgrade. So, like, any updates are just sort of automatically propagated to us as customers,

and we don't have to kind of upgrade as we do with Presto. So if it's been fixed, it just kind of magically gets applied.

Yeah. It's 1 of the benefits of serverless is that you don't have to look behind the curtain to understand how it's working, and it is just sort of magical from that regard. But then when you do need to look behind the curtain, you can't. So it's the the double edged sword.

You mentioned some of the workflow of

moving some of the workload into BigQuery has reduced some of the pressure on your Spark clusters, and I'm curious if you can talk through some of the ways that your data warehouse and DBT workflow integrates with the broader data platform and some of the impacts that it has had in terms of the other infrastructure investments that you've made or areas that you've been able to scale back certain work or consume from the data warehouse and downstream systems?

In reality, I think that integration with the rest of our data platform and BigQuery and DBT

was the main pain point that we are dealing with as the result of our choice to use, like, a fully integrated engine like BigQuery,

where our ingestion systems don't load all raw data to BigQuery for cost reasons. And we also don't export every BigQuery table back out to the data lake kind of by default, although we're looking to that in the future.

So it has been a pain point. As we rolled out Seamstress,

we sort of just dealt with this by declaring at the start that any models built with DBT on BigQuery would be terminal nodes and not actually accessible to other systems.

And similarly, source data

can be loaded through Starscream

and used as an input to a Seamster model, but this would be on an opt in basis.

It is on a road map to sync Seamster models to the data lake and, in this way, make them accessible for reading in other systems and also publishing to other downstream systems. Yeah. That's kind of the current state.

The cost aspects of things like BigQuery are definitely

interesting how it informs some of the other architectural decisions that you make as far as how often to access it, you know, how often to

run through a rebuild of the data warehouse or versus just doing incremental builds. And, again, 1 of the trade offs of it's useful, but then you do have to think about what is the cost aspect. I mean, everything has a cost, but it makes it a little bit more explicit when it's a line item in the bill.

Yeah. Totally.

Although, like, our data scientists have been telling us that the estimated

cost and scan cost while they're editing in the editor is like a very good proxy for them to say how fast this query is gonna run. So they've been loving it. But, yeah, I think BigQuery

kinda

forces you to really, you know,

acknowledge is materializing this table worth it or not. And, like, if you should even be, like, doing it lazily or less frequently than whatever you're used to.

I think it's also interesting to dig into

some of the

continuous deployment aspects of your DBT workflow and some of the decisions that you've made as far as which tables to rebuild versus which ones to build incrementally and maybe any edge cases that you've run into as far as how to manage that incremental updates and what to use as the

high watermarks for the different source and destination tables?

Can start by tackling that question and talk about our CI pipelines and some of the decisions we've made around that. So the goal for our CI was to ensure that data scientists

couldn't check-in or merge a model or a change to a model that would fail in production the next day. The only reason things should really break are for infrastructure reasons that our team could look after.

So

to that end, we extended DBT a little bit to add this concept of DAG validation,

where we build the entire warehouse except

materialize everything as a view to a temporary

schema in BigQuery.

And at compile time, BigQuery can determine whether or not the SQL is valid, whether it has any typos or incorrect type references

or

anything wrong with that SQL statement. So to that end, we can validate the actual

SQL and ensure that it will run successfully the next time in production.

So in addition to that, we have also introduced the ability to validate schema files. So dbt comes with schema dot yaml files that you can attach to a given model.

However, the schemas specified there are really just documentation.

They're not enforced anywhere, and you can't really treat them as metadata, and, of course, anything like that will go out of sync eventually.

So we've leveraged the same DAG validation idea

where we build everything as a view. But before we clean up the validation dataset,

we query the information schema and determine the actual schema for the dataset. We also do this on CI. So we ensure that both the SQL model is correct and the schema associated with it is correct.

Yeah. It's definitely an

interesting problem case as far as

the documentation getting out of sync. And it's 1 of the reasons that it's interesting to hear from people who are actually using dbt into production because

it's not something that you would necessarily think about when you're first exploring it until you really get deep into it. And I'm curious if there are any other edge cases that you've run into or shortcomings in terms of the out of the box capabilities of dbt that you've needed to augment or work around in your own systems.

So 1 of the shortcomings, I would say, was the lack of

unit testing, really unit testing on small statically supplied data.

DBT does come with this notion of schema tests, which are really, really great and and flexible for

testing in variance on your data. But ultimately, they run

on production data, which makes them both costly and also blind to the cases that are likely to happen but just don't exist in the data yet. So because of the cost and the time, it's not likely something that you would run kind of in a semi automated way in CI and really by relying on users to run before before they check things in. So to that end, we went through many iterations, but we implemented a Python based unit testing framework in the end, whereby users could specify

mock data in any format that they are comfortable with, whether it's a Python dictionary or a data frame or even kind of inline CSV style string.

And we basically run their model,

but pipe in those mock datasets as the inputs

and compare

the result of that, which is run against actual BigQuery

to an expected

mock data. So those are a lot more like traditional unit tests, and they can be run on CI, on every commit, and, can defend against regressions in a more scalable way. Just to add to that, I think, Michelle, we've also been looking at great expectations and integrating great expectations as, like, a after materialization,

like, data quality

check, which, like, running it after every run might be too costly, but that's something we're considering, like, to be running data quality checks at a regular cadence

out of band of the modeling process?

Yeah. We've actually added support for that recently. So we are running the schema test out of band kind of after the materialization.

And I believe Great Expectations is also integrated with DBT. We haven't started using those yet, but you can specify Great Expectations as schema test in DBT. So, yeah, we're still leveraging the schema test for sure.

Yeah. It's definitely interesting

how the overall aspect of data quality has been gaining a lot of ground

what your experience has been as far as managing some of the data discoverability

aspects of the datasets that you're managing with DBT

and how that might cross over to some of the rest of your data lake for being able to understand, you

know, I have this information in the data lake. I wanna be able to perform these transformations on it or get this information out and then discovering that that's our actually already been managed through this dbt workflow to

maybe save people's steps?

I don't think we have the perfect solution for this, but Shopify has an internal tool called artifact,

which will document every dataset that exists

in our lake. And so when you look up a dataset there, you can see whether it's coming from DBT or whether it's coming from Starscream, but, really, there's a lot of, like, lineage tracking and history tracking that we need to bake in for it to be really useful. I feel like that's 1 of the main things that

is a missing piece right now in a lot of the platforms.

And I'm wondering if there are any other

either technical or

structural or team patterns that you've settled on as you get more familiar with DBT

that were not obvious when you first started experimenting with it that you think would be useful for other people who are getting started with DBT or may already have been using it for a while?

There's 2 or 3 things that are super important. So what we found out with ChargeStream was that giving the user the ability to schedule individual runs for each treat each dataset as independent gives them a lot of power, but also lets them kinda go out of sync with other datasets.

So for Seamster, what we're trying to enforce is that we materialize the whole warehouse

in 1 single materialization flow. So that basically gives you referential integrity between each dataset. So for example, like, if you have a shops table,

every single dataset materialized through seamstress will use the same instance of the shops table,

which makes it super easy for data scientists to reason about.

And by not focusing on, you know, whether I should schedule this thing

every 6 hours or every 8 hours, like, our data freshness is between 15 minutes and a couple hours for our datasets,

but we have taken the conscious decision of only materializing this nightly so as to, like, you know, separate the operational use cases from the pure analysis use cases.

In terms of the scalability of the DBT workflows, I'm wondering

if you've run into any

complexities there, any challenges,

and maybe any types of metrics or monitoring that you perform on the pipeline execution to be able

to feedback into ways that you might want to either break apart or consolidate various aspects of the DAG?

So right now, we monitor

every single

individual node, and we treat the success of all nodes as the success of the DAG.

But maybe Michelle can talk a bit about, like, performance optimizations that we're looking at in the future.

In terms of breaking up the DAG, it's sort of not necessary because of, like, we're just materializing the entire thing, so there aren't really performance optimizations to have by, you know, getting, like, this connected component and this 1 and running them separately and, you know, not reprocessing

datasets needlessly, for example. So we just haven't come across that yet with our scheduling approach.

1 kind of quick scaling limitation that we ran into fairly quickly was with our CI approach and how we were both running unit tests, which actually hit and make BigQuery requests and queries,

and how we build the entire validation

DAG in CI.

This was fairly easy to mitigate where we have more of a slim CI running now, and we just determine the set of changed models

and can use DBT to very easily figure out what are all the affected upstreams and downstreams of that change and only build the DAG validation

for that portion and only run unit tests for affected models.

So on the other end, though, just in terms of data scaling, when we first rolled out, our goal was to make the simple thing simple.

So simple batch modeling,

full drop datasets

were our main starting point. But now as we mature, we do want to enable incremental modeling,

but we do have this desire to abstract some of the kind of recurring

complexity that data scientists would have to deal with in terms of managing their watermarks or checkpoints and to apply a correct filter to obtain only new data on a given run of a model.

This becomes pretty tricky to do and potentially dangerous, especially as the number of inputs grows

and data scientists have to make considerations on how they will handle late arriving data. So So something that we're currently building is support for a set of standard macros that make use of dbt's static insert overwrite materialization strategy for for BigQuery,

which is the most performant way they found to do kind of this append only style

incremental modeling

such that it's safe and reliable for data scientists to leverage. So really trying to take on the complexity of managing those watermarks as opposed to having everyone think about those on their own. Another scaling aspect is scaling in the number of contributors

to the projects.

So in the past, we've had several projects at Shopify that have started out as mono repos, but there have been, like, growing so large that we have started thinking about splitting, like, Starscream projects into

individual

DAGs or sub repos. With Seamster, I think we're gonna continue sticking with the monorepo model.

Would mostly be aiming at, you know, having some sort of consistency check where, like, unused datasets get descheduled after 90 days of inactivity

or other more intelligent checks like that instead of, like, adding the burden of coordinating across multiple repos for people.

Another thing that I'm interested in understanding, how you're managing things like master data management for being able to establish canonical references for a particular metric or the specific meaning of a given scalar value within the context of the data warehouse and just some of the challenges that you're facing on that front?

1 way in terms of standardizing the definitions of columns, we've added a whole swath of checks called data warehouse rules in our CI,

and 1 of them is that we enforce a globally unique column description for a given column name. So if a column name is introduced once in our data warehouse, it has a consistent description on every schema. And we can leverage something like that to build, let's say, like, a glossary of team specific words, for example, or team specific fields.

And so in terms of the actual

schemas that you're generating, I'm wondering if you have explored anything along the lines of creating

some specific

contracts

and semantic meaning into the ways that the column names are formatted

to help reduce any sort of ambiguity

as to what a given column might mean in context?

So this is something that we looked at in the past where we had this team that worked on a data naming convention. And I'm assuming you're, like, talking about, you know, prefixes

to column names to denote if it's a certain value type. We found that, like, adoption just by documentation

was not really successful.

So with Seamster, what we're trying to do is we haven't enforced this, but we will adopt some naming rules, and they will be enforced through the system programmatically.

And our experience has been that anything that can be enforced programmatically

can

be enforced. But other than that, like, in a large organization, it's very hard to keep the sort of consistency

with so many new people being added regularly.

As you continue to build out your data warehouse

and continue to work with DBT, I'm wondering what are some of the other features or improvements that you have planned or problem spaces that you're excited to dig into?

Yeah. There's a whole bunch of these. 1 somewhat straightforward 1 is we want to support

a more complex directory structure

for leveraging dbt's ephemeral materialization,

just to support logically breaking down really long and complex hairy queries,

such that these intermediate pieces

are more or less private implementations for a given model and can't be accessed by others.

So that's 1 thing we're looking to add. We were exploring multiple

engines

for DBT. So instead of just BigQuery, we're looking at accelerating

and or reducing costs by continuing to use our Presto cluster for lower frequency or lower tier datasets?

Yeah. So kind of exploring what, like, a multi engine setup of DBT

looks like. Because right now, each DBT project is tied to 1 engine. And if we had multiple, the DAGs wouldn't really have awareness of 1 another, so building 1 more level of abstraction on top of the engines. And as our DAG grows larger, we've, like, definitely

been thinking about, like, lazily materializing

datasets.

And we've applied this technique to some other, like, our schema management projects in the past.

If a dataset is not queried in mode or somewhere else for 90 days, we'll just stop materializing it and just start materializing it again when someone accesses

it. 1 more frontier that we'd also be interested in exploring is using SQL to write

early streaming jobs or models.

A good candidate for this seems like Flink SQL, so we'll be looking into

prototyping an adapter for that. The pendulum swing of SQL has definitely been going pretty full bore in the direction of being in favor of it. And the fact that there are a number of different projects that are pushing on SQL for streaming engines is a pretty strong indicator for that. So I'll definitely be interested to see

some of the ways that dbt can be used within those streaming contexts for being able to manage the pipeline of transformations that you're generating across these unbounded datasets.

And so now that you've been using dbt

and managing this data warehouse for a while now, I'm curious what you have seen as some of the unexpected or innovative or surprising ways that the DBT toolchain is being used or that the datasets that you're building are being managed or

used for interesting use cases within the business?

We launched for general usage about a quarter ago, and we've also launched with a pretty constrained

use case for serving internal analytics with a freshness of 1 day dataset. So we haven't really seen a wide variety of unexpected use cases quite yet. We have seen some pretty

interesting kind of dynamically generated SQL

queries with Ginger, and we've seen some interesting side effects. 1 of those has been that data scientists are just leveraging storage optimization such as partitioning and clustering a lot more readily since dbt's interface

for specifying such optimizations is really lightweight.

Another kind of forward looking

statement that we haven't observed yet but will likely be true is that we do expect more full stack engineers to adopt a SQL based modeling framework as opposed to learning something like Spark and managing a Spark based pipeline.

In your own experience

of

managing this project

from the beginning and

sharing it throughout the organization,

onboarding new people to it, I'm wondering what are some of the interesting or unexpected or challenging lessons that you've learned in that process.

As we started this project,

we rolled it out first for a private alpha for a small set of really keen data scientists

and found some of our early feedback to be really valuable. 1 of our initial shortcomings was that we tried to implement a CSV based style of unit testing where data scientists would encode their static data in,

you know, not in line, just actual CSV files. And we found that this led to some pretty bad unit testing practices,

specifically that unit tests should really fail for 1 reason, but we found that because of the overhead to create

CSV file for every input,

data scientists were really kind of clumping

the unit test into 1 really large mega

sequence of CSV files, which would make it really, really difficult to

debug or review. And their only real alternative was to make a ton of unit tests and consequently a ton

of CSV files, which would similarly be very difficult to review and maintain. And we're kind of hiding all this complexity between inside this, like, a folder of CSV files. That was 1 of the biggest pieces of feedback we had from the from the alpha, and the other 1 was that we were really just very loose with our definitions of a front room,

back room, and we really had to call in and set up a dedicated data modeling team that could establish some of these

definitions

more robustly.

So that would be the backroom, exchange layer, frontroom that we described before. Yeah. There were also,

like, some organizational

challenges

for teams that are already using Starscream and are already pretty proficient with the tooling.

So

we have to figure out the right incentives

for people to move or even rewrite their jobs from Starscream

to Seamster.

So for now, we're focusing on teams that are building new data models so that we can accelerate them. But in the end, we will have to offer either performance gains or more productivity gains long term for people to consider. And we're also thinking that instead of just terminal nodes, if people can slide dbt into their DAG, they use they can use Starscream for a bit, then dbt for certain models, and then go back to Starscream if they're doing, like, machine learning. That could be a good use case. Yeah. On the note of coexisting systems,

1 challenge that we ran into for sure was the need to reconcile data models between 2 systems because an essential part of migrating would ultimately be asserting with confidence that 2 data models are in fact, like, interchangeable and can be swapped out. So to do this, we need to create a diff of the 2 datasets,

across different systems, and this is just a really manual process

and quite involved to actually ensure that you fix input datasets such that both systems

build

on the same set of inputs.

As you continue to

push further with where you're using the Seamster tool and bringing more workloads into the data warehouse, what are some of the cases

that you have either already experienced or that you anticipate where you think that using dbt and the data warehouse is the wrong answer and teams should be looking

to either other existing solutions or new solutions?

I think SQL like, our principle is that SQL is good for 70 to 80% of the

use cases, but not for everything. So if you're doing any sort of, like, self join shenanigans

or anything, like, that is hard to express

declaratively, then maybe you need to go on to a more imperative model, and go to a full fledged programming environment. We don't want to end up in a place where you're doing, you know, ninja macro magic

to get your SQL queries to work. We would much rather prefer that you just use PySpark

directly at that time. Also, if you're doing any sort of, like, customer focusing, like, real time reporting or any sort of streaming

where you are modeling

with Seamster

and then loading into Druid or ClickHouse or Pinot, you might just be better off by taking a fully engineering solution and using Flink or Dataflow directly and then loading into those warehouses. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred

contact information to the show notes. And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Zeeshan, why don't you go first?

So what my observation

being in this field so far has been that there's a lot of tools

to process

the data and crunch it, but there's not a lot of tooling around data ownership, versioning, and lineage, and especially, like, around PII and handling sensitive information

and almost, like, semi PII information, like IPs, user agents, and tracking cookies, because in certain jurisdictions,

a combination of those could be considered a PIOI.

So it is incredibly hard for data scientists to figure out, you know, how do they handle these things, How do they, like, rotate them out? If they're using some sort of, like, redaction system, if they need to make a breaking change, how do they coordinate all these breaking changes with other teams and other data science teams? So

we're doing a lot of this by educating people, but I think, realistically,

the modeling tooling should be integrating very tightly

with the metadata tooling. So I'm almost imagining, like, you know, some sort of, like, Versus code type feel, whereas you're making data

models, you're getting annotations and notifications that you need to be thinking about, like, these PII and versioning and backfilling things.

Yeah. Definitely seen some cases of using the metadata management and being able to label for things like PI. But having that feedback into your development environment as you're iterating on the models and the code that you're using to process is definitely

interesting and useful next step. I'll have to see if I can find anything that's even trending in that direction.

Michelle, do you have anything to add? I definitely agree that there's sort of a lack for a more centralized

almost like metadata schema for datasets from a platform tooling perspective

and finding that more and more as we're building new tools in in the data platform that every tool needs to reinvent its ownership model or how learning is done or how PII annotation is done, monitoring latency. So something that kind of syncs together all of this in a standardized,

more holistic way that can be leveraged for a consistent user experience.

Well, thank you both very much for taking the time today to join me and share your experience of building out a data warehouse

platform and managing the tool chain around DBT and your experience

of working with that

and managing that as you continue to scale with it. It's definitely very

interesting and useful problem space, and it's valuable to hear from people who are going down that path so that it helps other people who are coming along after them. So I appreciate all the time and energy you've put into that and the time you've taken to share it, and I hope you enjoy the rest of your day. Happy to be here. Yeah. Thanks for having us.

For listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links