Summary
With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today.
- Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what the Shopify platform is?
- What kinds of data sources are you working with?
- Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage?
- How have you structured your data teams to be able to deliver those projects?
- What are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users?
- What was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse?
- What were your criteria when selecting a platform for your data warehouse?
- What decision did that criteria lead you to make?
- Once you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies?
- What were your criteria for determining what toolchain to use for managing the data warehouse?
- You ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates?
- What was your process for onboarding users into the DBT toolchain and determining how to structure the project layout?
- What are some of the shortcomings or edge cases that you ran into?
- Rather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs?
- What has been your experience with extending and integrating with DBT to customize it for your environment?
- Can you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for?
- How much of the testing are you able to do with out-of-the-box functionality from DBT?
- What are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes?
- Can you share how you manage the CI/CD process for changes in your data warehouse?
- What kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines?
- How do you integrate the management of your data warehouse and DBT workflows with your broader data platform?
- Now that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale?
- Are there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it?
- What are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse?
- What is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.)
- As you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned?
- What are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates?
- What are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions?
- What are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT?
Contact Info
- Zeeshan
- Michelle
- @michellearky on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- How to Build a Production Grade Workflow with SQL Modelling
- Shopify
- JRuby
- PySpark
- Druid
- Amplitude
- Mode
- Snowflake Schema
- Data Vault
- BigQuery
- Amazon Redshift
- CI/CD
- Great Expectations
- Master Data Management
- Flink SQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Today's episode of the data engineering podcast is sponsored by Datadog, the monitoring and analytics platform for cloud scale infrastructure and applications. Datadog's machine learning based alerts, customizable dashboards and 400 plus vendor backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in 1 place, you can easily improve your application performance. Try Datadog free by starting your 14 day free trial and receive a free t shirt once you install the agent. Go to data engineering podcast dotcom/datadog today to see how you can unify your monitoring. Your host is Tobias Macy. And today, I'm interviewing Zeeshan Qureshi and Michelle Arc about how Shopify is building their production data warehouse platform with DBT. So, Zeeshan, can you start by introducing yourself?
[00:01:33] Unknown:
My name is Zeeshan Qureshi. I'm an engineering manager at Shopify. I've been involved in the data org almost since its inception for the last 5 years. And, Michelle, how about yourself?
[00:01:44] Unknown:
So, hey, I'm Michelle. Nice to meet you, Tobias. I'm a data developer at Shopify, and I've been there for a couple years.
[00:01:52] Unknown:
And going back to you, Zishan, do you remember how you first got involved in the area of data management?
[00:01:56] Unknown:
Yeah. So when I joined Shopify, I had worked on the Chrome team before, so I got handled with the task of building our online storefront data collection pipelines. So I did a bunch of JavaScript, and effectively, like, we were building Google Analytics for Shopify, both for internal analytics and also serving merchant data, which kinda then snowballed into building a bunch of internal tooling for data scientists for, like, funnel analysis, like, stream analysis. And then a point came where, like, I was the person who was being pinged every time someone wanted to measure something new. So we started to start a team around, like, specifically all the tooling to collect our data and how we manage schematization of our data versioning and especially, like, privacy annotations.
[00:02:41] Unknown:
Yeah. It's definitely a very broad and deep area that you can kind of go a lot of different directions with. So I can imagine the level of complexity whenever but somebody says, I wanna measure this new thing, and then saying, well
[00:02:54] Unknown:
Yeah. Exactly. Like, a lot of times, it's like, you know, I wanna measure everyone across every site. We're like, probably can't do that.
[00:03:03] Unknown:
And, Michelle, do you remember how you first got involved in data management?
[00:03:06] Unknown:
Yeah. So through University of Toronto's applied computer science master's program, I was paired with Shopify on an applied research project where I landed fortunately on Zeeshan's team, and we were building and prototyping a specialized query engine specifically for funnel queries, which we found were the kind of least performing class of queries for internal analytics at Shopify. After that, I joined Seshaan's team where we were working on Shopify's event collection framework. Eventually, we started getting to the other end of this collection and looking into storage optimizations and query engines, and that's really how we got into exploring dbt as an orchestrator of SQL statements to to build a data warehouse.
[00:03:47] Unknown:
And before we get too far along on the story of how you got to where you are with DBT in your data warehouse, can we just give a bit of context for anybody who's not familiar with what Shopify is and just describe the platform and maybe talk through some of the types of data sources and formats that you're working with?
[00:04:05] Unknown:
Yeah. So Shopify's goal is to enable commerce for everyone, whether it be e commerce, whether it be, like, physical point of sales, and to enable a platform so that you can plug and play every part of the platform. So, like, if you were to use a different shipping engine, you can use, like, Shopify provider shipping rate, or you could use a third party shipping service. We have, like, an app store platform where, like, features that are not in Shopify or if you want customizations outside of Shopify, you can opt out. So the idea basically is, you know, like, Shopify is there for all your commerce needs.
[00:04:39] Unknown:
So on the warehouse side, we or at the lakeside, we are pulling from lots of internal databases that are sharded MySQL tables, hundreds of third party APIs, and a lot of streaming event data over Kafka. We also have some occasional dumps of data from customer surveys, things like that.
[00:04:56] Unknown:
In terms of the types of analysis and products and decision making that you are making based on the data that you're collecting and the processing that you're doing with it, what are some of the
[00:05:10] Unknown:
impacts that that information has on the business and the direction that you take the product? 1 of the key use cases for data is, you know, your standard business analytics, so, you know, funnel analysis, conversion, segmentation cohort. Over the last year or so, Shopify has also made a lot of focus on performance optimizations. So, like, a lot of our event streaming data is going into the service called Fast Shopify, where every team can see their biz performance metrics, which is a key driving factor in, like, engineering decisions and technology choices. Shopify is also powering a lot of, like, customer facing products with this data, so, like, Shopify fraud protection, where for a small sir fee, Shopify will provide insurance on fraudulent activities or Shopify Capital, whereas Shopify will fund merchants to help them scale.
[00:06:00] Unknown:
There's definitely a decent amount of scale and organizational complexity within Shopify. I know that it's been around for a number of years, and I've been hearing about it for quite a while, both in terms of the platform and people who are building customizations on top of it. And I'm curious how you have structured the organizational aspects of your data teams to be able to manage the different reporting requirements and managing the analysis to make sure that you have the appropriate domain expertise and some of the technological systems and processes that you have in place to be able to manage all of that organizational and technological complexity?
[00:06:40] Unknown:
Shopify has a top level data org as, like, a separate pillar in the company, and we've split data science and engineering in 2 different ways. Data science uses a hub and spoke model where every data scientist reports into the data org, but they're embedded in their engineering teams. For data engineering, there is a dedicated data platform, team in the data org where everyone effectively reports functionally according to the layer of the platform, whether it be, like, ingestion, processing, serving.
[00:07:12] Unknown:
In terms of some of the systems that we have in place to allow and support the needs of those use cases, it's a whole swath of them. So we have some custom data extraction tooling that's built on top of JRuby, a PySpark framework called Starscream. We leverage Presto and BigQuery for our query engines and and analysis, as well as other custom data stores such as Druid for serving. We also use Amplitude Mode and Notebooks for reporting as well as ad hoc query.
[00:07:41] Unknown:
And I know that in the blog post that you shared with me to prepare for this conversation, you mentioned that Starscream was, for a while, the core element of your reporting capabilities and that that ended up slowing down some of the delivery of ad hoc or relatively simple queries. And I'm curious, what was the tipping point that pushed you into deciding that you needed to build a new component to your data platform to be able to handle these types of use cases?
[00:08:12] Unknown:
Through, I guess, data surveys and just chatting with data scientists, the tipping point really was how long the iteration cycles were from prototyping to production data modeling. People would generally prototype in SQL and then convert into PySpark code and consider things like performance optimization, storage optimizations, and it would just take weeks weeks of work. So we saw an anti pattern where data scientists were very commonly materializing scratch, views, or tables instead of converting their SQL based prototypes to PySpark. So this led to lack of reviewability, sharing, testing, freshness. It was very manual.
So these antipatterns as well as just understanding that the long iteration cycles were caused from going between different languages and different systems.
[00:09:01] Unknown:
The crux was that the user experience with Starscream was very engineer centric. And back when Starscream was built, we didn't have, like, a data scientist and data engineer separation. Like, everyone was effectively an engineer, and, like, you know, everyone was expected to write Python. And the way StarStream just output the datasets, the metadata, and, like, just the storage format was also just not most optimized. We did not have partitioned tables output for the longest time. So we just figured that, like, we want to go where the users are, and the users are with SQL.
[00:09:33] Unknown:
Yeah. It's definitely remarkable how durable the SQL format has been where there have been a number of cases where people have tried to move away from it and say, oh, well, you know, it's it's not a proper programming language. I do so much more if I use Python or use r or whatever it might be, and then eventually, the pendulum always swings back to SQL.
[00:09:53] Unknown:
Yeah. Yeah. And, like, even basic, like, join optimizations, if you're trying to do it in, like, raw Python, you have to understand what's happening. But if it's, like, the declarative model, we can push it all to the query engine and and let it do the magic. Definitely a lot to be said for
[00:10:08] Unknown:
separation of concerns and letting the optimizations happen at a layer further down from where you actually have to deal with it on your day to day. In terms of your selection process, once you decided that you needed to go a different direction with managing this aspect of your data platform, you wanted to push it into a data warehouse, what were the criteria that you were using to measure the various candidates against, and what was the ultimate decision that you landed on?
[00:10:35] Unknown:
1 of the criteria was that we realized we had been building a lot of custom tooling over the years, and we wanted to adopt something that was being developed out in the open and that we could stay closely aligned with. So we didn't really wanna build a very custom workflow.
[00:10:49] Unknown:
As we said before, we wanted the tooling to be declarative. So whether it be SQL or some other form that was fully declarative, we did not want people to be concerned with the internal details of the platform. And to be honest, it was mostly a question between, like, using any form of, like, SQL based engine or orchestrator and, like, the only viable 1 out there was DBP. So the real question was, do we use DBT or do we build our own orchestration system, which I think this is actually Shopify's turn to building a SQL based modeling tool.
[00:11:24] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/ datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask. And in terms of the actual storage layer, what were your criteria for determining which warehouse to use?
[00:12:28] Unknown:
Most of Shopify is on Google Cloud, and we wanted to run this team as a dedicated team, without blocking too much on other teams. So we basically went for BigQuery because it's, you know, a fully hosted service, and our experience with it before using DBT had also been pretty positive. That being said, we wanted to be query engine agnostic, and we are looking at other storage layers in the future.
[00:12:54] Unknown:
You decided on BigQuery. You landed on DBT as the actual tool chain to use. I'm curious what your approach was for vetting it in terms of understanding this looks like a good tool. I'm going to actually run it through its paces and make sure that it does all the things that I needed to do. And then once you decide, okay. This is going to do what I want, then onboarding new users into it and figuring out what the repository structure is going to look like, what some of the organizational standards are going to be in terms of the formatting and the overall workflow.
[00:13:29] Unknown:
So the project structure, the directory structure was 1 of the biggest unknowns and really something that DBT leaves very flexible. So we spent, I think, a few iteration cycles going between alphas and betas, but landed on a project structure where we have backroom models and they have some access policies around those. For example, by definition, a backroom model is something that has some rough edges and requires some domain expertise, like, or team level expertise to understand its caveats. So we restrict building on top of such models unless the downstream model is owned by the same team. So the whole backroom pipeline is owned by the same data team. And gradually, those are promoted to other layers for exchange and for front remodeling, and that's the project structure that we landed on that that worked well for our organization.
[00:14:19] Unknown:
It's definitely 1 of the challenges when you get a tool like this where there are some opinions, but then there are other areas where it's flexible just understanding what opinions to form if you're just coming into it for the first time. Yeah. Definitely. As far as the actual data modeling approach, you mentioned having these backroom tables for being able to prove out some of the raw source datasets. And I know that there's been a lot of debate, particularly in recent years with some of the newer query engines as to whether to use things like the star or snowflake schema from the early days of data warehousing or using things like Data Vault for being able to manage a lot of different data sources with changing needs within the business or just going with wide sparse tables because the query engines can handle them now. And I'm wondering what you have landed on and what your process was for determining how best to handle some of those modeling questions.
[00:15:15] Unknown:
We've mostly stuck with Kimber and Ross dimensional modeling and materializing into a Snowflake schema. Although we have individual teams, which, based on performance optimizations, will use denormalized and or white tables for critical datasets. Our thought process around this is that we want data to be modeled, using Kimball and Ross, but then it can be published in various layout formats based on the use case.
[00:15:41] Unknown:
I think part of that thought process has also been, you know, consistency and just choosing a modeling paradigm is super critical and important, maybe even more important than choosing which 1.
[00:15:52] Unknown:
Yeah. It's definitely important to make a choice because then you at least have something to build from. Whereas if you leave it up to everybody, then, you know, if there are 5 choices, then, you know, they'll they'll make 6 different decisions with it. And then in terms of the actual workflow of being able to manage the DBT pipelines and structure the orchestrations, you mentioned that you're using star schema or, like, the approach for actual the dimensional modeling aspects. I'm wondering what you have seen as opportunities or requirements for adding in some layers of abstraction to buffer you from changes in the source datasets so that you don't have to rework all of your models downstream every time some API version changes?
[00:16:39] Unknown:
So to that end, we adopted a pattern that DBT recommends called staging. We called it base models. It's akin to a layer as well, so it's the first layer right after raw sources is this base models. And this is really their best practice for limiting references to raw data. So our base models serve as more or less a 1 to 1 interface to raw sources, but they're almost like a snapshot of the contract of the raw source. They don't change the grain or do any renaming. They're just this 1 to 1 interface, and it serves to protect us and users from breaking changes in raw sources, which by definition going to introduce breaking changes that are external to our system.
So once these breaking changes do happen, do inevitably happen, the idea is that you only need to apply the fix to a single place in the base model as opposed to every individual downstream model that depends on that raw source. So there's a single point of failure and a single owner that can propagate those changes.
[00:17:39] Unknown:
And you mentioned that you had some organizational experience with using BigQuery before you made this transition to building the DBT workflow and fully investing in that. And I'm curious what you learned before using DBT that informed some of the decisions that you made as far as how you wanted to structure some of the transformations, how you wanted to manage things like partitioning schemes, just some of the organizational knowledge that was useful from working with BigQuery that you then translated into DBT and Codified?
[00:18:15] Unknown:
So before BigQuery before DBT, we were using Redshift a lot on our warehouse, and I think Redshift was our original warehousing solution. But then shoveling data from our GCS Lake into Redshift was becoming too much of a trouble, and we wanted to consolidate everything in 1 provider. So we decided to switch over to BigQuery. But then what we realized was that doing a lot of those things that we would have to do, for example, in Presto or Redshift, which included, like, partitioning or clustering, we would have to preprocess a lot of our data before we load it into the warehousing solution, whereas with BigQuery, we could mostly give it r k files and ask it to partition or cluster. And so the whole data loading workflow went from the order of couple of hours to a couple of minutes for each dataset, which was super useful, and it also meant that, like, our spark cluster was not getting hammered with all these just materialization queries to load into any other warehouse.
Once we kinda switched over to BigQuery, I think it was about a year before we started working on DBT, we also spent a bunch of time just, like, studying more query engines, and then we realized, like, BigQuery's, like, flexible scaling capacity was really useful for our use case because Shopify is, like, growing exponentially and, like, just forecasting how big our cluster needed to be and how much provisioning we needed to do. BigQuery just, like, took all of that away from us, and it was just super nice so far. Although, Michelle has been working more with BigQuery, so she can tell us some headaches if she has.
[00:19:47] Unknown:
There have been a couple headaches, but actually nothing too major. Like,
[00:19:51] Unknown:
the numeric support, the lack of high precision numeric support was the bane of my existence for a couple months. Oh, yeah. We tested some queries that we were testing in Redshift, and we found, like, large window operations worked much better on
[00:20:03] Unknown:
Redshift, although I'm not sure if they've been fixed. This was, like, late 2018, I think. Yeah. That's another nice aspect of BigQuery was the server side version upgrade. So, like, any updates are just sort of automatically propagated to us as customers, and we don't have to kind of upgrade as we do with Presto. So if it's been fixed, it just kind of magically gets applied.
[00:20:25] Unknown:
Yeah. It's 1 of the benefits of serverless is that you don't have to look behind the curtain to understand how it's working, and it is just sort of magical from that regard. But then when you do need to look behind the curtain, you can't. So it's the the double edged sword. You mentioned some of the workflow of moving some of the workload into BigQuery has reduced some of the pressure on your Spark clusters, and I'm curious if you can talk through some of the ways that your data warehouse and DBT workflow integrates with the broader data platform and some of the impacts that it has had in terms of the other infrastructure investments that you've made or areas that you've been able to scale back certain work or consume from the data warehouse and downstream systems?
[00:21:07] Unknown:
In reality, I think that integration with the rest of our data platform and BigQuery and DBT was the main pain point that we are dealing with as the result of our choice to use, like, a fully integrated engine like BigQuery, where our ingestion systems don't load all raw data to BigQuery for cost reasons. And we also don't export every BigQuery table back out to the data lake kind of by default, although we're looking to that in the future. So it has been a pain point. As we rolled out Seamstress, we sort of just dealt with this by declaring at the start that any models built with DBT on BigQuery would be terminal nodes and not actually accessible to other systems. And similarly, source data can be loaded through Starscream and used as an input to a Seamster model, but this would be on an opt in basis.
It is on a road map to sync Seamster models to the data lake and, in this way, make them accessible for reading in other systems and also publishing to other downstream systems. Yeah. That's kind of the current state.
[00:22:04] Unknown:
The cost aspects of things like BigQuery are definitely interesting how it informs some of the other architectural decisions that you make as far as how often to access it, you know, how often to run through a rebuild of the data warehouse or versus just doing incremental builds. And, again, 1 of the trade offs of it's useful, but then you do have to think about what is the cost aspect. I mean, everything has a cost, but it makes it a little bit more explicit when it's a line item in the bill.
[00:22:32] Unknown:
Yeah. Totally. Although, like, our data scientists have been telling us that the estimated cost and scan cost while they're editing in the editor is like a very good proxy for them to say how fast this query is gonna run. So they've been loving it. But, yeah, I think BigQuery kinda forces you to really, you know, acknowledge is materializing this table worth it or not. And, like, if you should even be, like, doing it lazily or less frequently than whatever you're used to.
[00:23:02] Unknown:
I think it's also interesting to dig into some of the continuous deployment aspects of your DBT workflow and some of the decisions that you've made as far as which tables to rebuild versus which ones to build incrementally and maybe any edge cases that you've run into as far as how to manage that incremental updates and what to use as the high watermarks for the different source and destination tables?
[00:23:27] Unknown:
Can start by tackling that question and talk about our CI pipelines and some of the decisions we've made around that. So the goal for our CI was to ensure that data scientists couldn't check-in or merge a model or a change to a model that would fail in production the next day. The only reason things should really break are for infrastructure reasons that our team could look after. So to that end, we extended DBT a little bit to add this concept of DAG validation, where we build the entire warehouse except materialize everything as a view to a temporary schema in BigQuery.
And at compile time, BigQuery can determine whether or not the SQL is valid, whether it has any typos or incorrect type references or anything wrong with that SQL statement. So to that end, we can validate the actual SQL and ensure that it will run successfully the next time in production. So in addition to that, we have also introduced the ability to validate schema files. So dbt comes with schema dot yaml files that you can attach to a given model. However, the schemas specified there are really just documentation. They're not enforced anywhere, and you can't really treat them as metadata, and, of course, anything like that will go out of sync eventually.
So we've leveraged the same DAG validation idea where we build everything as a view. But before we clean up the validation dataset, we query the information schema and determine the actual schema for the dataset. We also do this on CI. So we ensure that both the SQL model is correct and the schema associated with it is correct.
[00:25:03] Unknown:
Yeah. It's definitely an interesting problem case as far as the documentation getting out of sync. And it's 1 of the reasons that it's interesting to hear from people who are actually using dbt into production because it's not something that you would necessarily think about when you're first exploring it until you really get deep into it. And I'm curious if there are any other edge cases that you've run into or shortcomings in terms of the out of the box capabilities of dbt that you've needed to augment or work around in your own systems.
[00:25:33] Unknown:
So 1 of the shortcomings, I would say, was the lack of unit testing, really unit testing on small statically supplied data. DBT does come with this notion of schema tests, which are really, really great and and flexible for testing in variance on your data. But ultimately, they run on production data, which makes them both costly and also blind to the cases that are likely to happen but just don't exist in the data yet. So because of the cost and the time, it's not likely something that you would run kind of in a semi automated way in CI and really by relying on users to run before before they check things in. So to that end, we went through many iterations, but we implemented a Python based unit testing framework in the end, whereby users could specify mock data in any format that they are comfortable with, whether it's a Python dictionary or a data frame or even kind of inline CSV style string.
And we basically run their model, but pipe in those mock datasets as the inputs and compare the result of that, which is run against actual BigQuery to an expected mock data. So those are a lot more like traditional unit tests, and they can be run on CI, on every commit, and, can defend against regressions in a more scalable way. Just to add to that, I think, Michelle, we've also been looking at great expectations and integrating great expectations as, like, a after materialization,
[00:27:00] Unknown:
like, data quality check, which, like, running it after every run might be too costly, but that's something we're considering, like, to be running data quality checks at a regular cadence out of band of the modeling process?
[00:27:13] Unknown:
Yeah. We've actually added support for that recently. So we are running the schema test out of band kind of after the materialization. And I believe Great Expectations is also integrated with DBT. We haven't started using those yet, but you can specify Great Expectations as schema test in DBT. So, yeah, we're still leveraging the schema test for sure.
[00:27:31] Unknown:
Yeah. It's definitely interesting how the overall aspect of data quality has been gaining a lot of ground what your experience has been as far as managing some of the data discoverability aspects of the datasets that you're managing with DBT and how that might cross over to some of the rest of your data lake for being able to understand, you know, I have this information in the data lake. I wanna be able to perform these transformations on it or get this information out and then discovering that that's our actually already been managed through this dbt workflow to maybe save people's steps?
[00:28:13] Unknown:
I don't think we have the perfect solution for this, but Shopify has an internal tool called artifact, which will document every dataset that exists in our lake. And so when you look up a dataset there, you can see whether it's coming from DBT or whether it's coming from Starscream, but, really, there's a lot of, like, lineage tracking and history tracking that we need to bake in for it to be really useful. I feel like that's 1 of the main things that is a missing piece right now in a lot of the platforms.
[00:28:43] Unknown:
And I'm wondering if there are any other either technical or structural or team patterns that you've settled on as you get more familiar with DBT that were not obvious when you first started experimenting with it that you think would be useful for other people who are getting started with DBT or may already have been using it for a while?
[00:29:03] Unknown:
There's 2 or 3 things that are super important. So what we found out with ChargeStream was that giving the user the ability to schedule individual runs for each treat each dataset as independent gives them a lot of power, but also lets them kinda go out of sync with other datasets. So for Seamster, what we're trying to enforce is that we materialize the whole warehouse in 1 single materialization flow. So that basically gives you referential integrity between each dataset. So for example, like, if you have a shops table, every single dataset materialized through seamstress will use the same instance of the shops table, which makes it super easy for data scientists to reason about.
And by not focusing on, you know, whether I should schedule this thing every 6 hours or every 8 hours, like, our data freshness is between 15 minutes and a couple hours for our datasets, but we have taken the conscious decision of only materializing this nightly so as to, like, you know, separate the operational use cases from the pure analysis use cases.
[00:30:08] Unknown:
In terms of the scalability of the DBT workflows, I'm wondering if you've run into any complexities there, any challenges, and maybe any types of metrics or monitoring that you perform on the pipeline execution to be able to feedback into ways that you might want to either break apart or consolidate various aspects of the DAG?
[00:30:30] Unknown:
So right now, we monitor every single individual node, and we treat the success of all nodes as the success of the DAG. But maybe Michelle can talk a bit about, like, performance optimizations that we're looking at in the future.
[00:30:44] Unknown:
In terms of breaking up the DAG, it's sort of not necessary because of, like, we're just materializing the entire thing, so there aren't really performance optimizations to have by, you know, getting, like, this connected component and this 1 and running them separately and, you know, not reprocessing datasets needlessly, for example. So we just haven't come across that yet with our scheduling approach. 1 kind of quick scaling limitation that we ran into fairly quickly was with our CI approach and how we were both running unit tests, which actually hit and make BigQuery requests and queries, and how we build the entire validation DAG in CI.
This was fairly easy to mitigate where we have more of a slim CI running now, and we just determine the set of changed models and can use DBT to very easily figure out what are all the affected upstreams and downstreams of that change and only build the DAG validation for that portion and only run unit tests for affected models. So on the other end, though, just in terms of data scaling, when we first rolled out, our goal was to make the simple thing simple. So simple batch modeling, full drop datasets were our main starting point. But now as we mature, we do want to enable incremental modeling, but we do have this desire to abstract some of the kind of recurring complexity that data scientists would have to deal with in terms of managing their watermarks or checkpoints and to apply a correct filter to obtain only new data on a given run of a model.
This becomes pretty tricky to do and potentially dangerous, especially as the number of inputs grows and data scientists have to make considerations on how they will handle late arriving data. So So something that we're currently building is support for a set of standard macros that make use of dbt's static insert overwrite materialization strategy for for BigQuery, which is the most performant way they found to do kind of this append only style incremental modeling such that it's safe and reliable for data scientists to leverage. So really trying to take on the complexity of managing those watermarks as opposed to having everyone think about those on their own. Another scaling aspect is scaling in the number of contributors
[00:32:54] Unknown:
to the projects. So in the past, we've had several projects at Shopify that have started out as mono repos, but there have been, like, growing so large that we have started thinking about splitting, like, Starscream projects into individual DAGs or sub repos. With Seamster, I think we're gonna continue sticking with the monorepo model. Would mostly be aiming at, you know, having some sort of consistency check where, like, unused datasets get descheduled after 90 days of inactivity or other more intelligent checks like that instead of, like, adding the burden of coordinating across multiple repos for people.
[00:33:32] Unknown:
Another thing that I'm interested in understanding, how you're managing things like master data management for being able to establish canonical references for a particular metric or the specific meaning of a given scalar value within the context of the data warehouse and just some of the challenges that you're facing on that front?
[00:33:53] Unknown:
1 way in terms of standardizing the definitions of columns, we've added a whole swath of checks called data warehouse rules in our CI, and 1 of them is that we enforce a globally unique column description for a given column name. So if a column name is introduced once in our data warehouse, it has a consistent description on every schema. And we can leverage something like that to build, let's say, like, a glossary of team specific words, for example, or team specific fields.
[00:34:18] Unknown:
And so in terms of the actual schemas that you're generating, I'm wondering if you have explored anything along the lines of creating some specific contracts and semantic meaning into the ways that the column names are formatted to help reduce any sort of ambiguity as to what a given column might mean in context?
[00:34:42] Unknown:
So this is something that we looked at in the past where we had this team that worked on a data naming convention. And I'm assuming you're, like, talking about, you know, prefixes to column names to denote if it's a certain value type. We found that, like, adoption just by documentation was not really successful. So with Seamster, what we're trying to do is we haven't enforced this, but we will adopt some naming rules, and they will be enforced through the system programmatically. And our experience has been that anything that can be enforced programmatically can
[00:35:15] Unknown:
be enforced. But other than that, like, in a large organization, it's very hard to keep the sort of consistency with so many new people being added regularly. As you continue to build out your data warehouse and continue to work with DBT, I'm wondering what are some of the other features or improvements that you have planned or problem spaces that you're excited to dig into?
[00:35:38] Unknown:
Yeah. There's a whole bunch of these. 1 somewhat straightforward 1 is we want to support a more complex directory structure for leveraging dbt's ephemeral materialization, just to support logically breaking down really long and complex hairy queries, such that these intermediate pieces are more or less private implementations for a given model and can't be accessed by others. So that's 1 thing we're looking to add. We were exploring multiple
[00:36:04] Unknown:
engines for DBT. So instead of just BigQuery, we're looking at accelerating and or reducing costs by continuing to use our Presto cluster for lower frequency or lower tier datasets?
[00:36:15] Unknown:
Yeah. So kind of exploring what, like, a multi engine setup of DBT looks like. Because right now, each DBT project is tied to 1 engine. And if we had multiple, the DAGs wouldn't really have awareness of 1 another, so building 1 more level of abstraction on top of the engines. And as our DAG grows larger, we've, like, definitely
[00:36:33] Unknown:
been thinking about, like, lazily materializing datasets. And we've applied this technique to some other, like, our schema management projects in the past. If a dataset is not queried in mode or somewhere else for 90 days, we'll just stop materializing it and just start materializing it again when someone accesses
[00:36:50] Unknown:
it. 1 more frontier that we'd also be interested in exploring is using SQL to write early streaming jobs or models. A good candidate for this seems like Flink SQL, so we'll be looking into
[00:37:01] Unknown:
prototyping an adapter for that. The pendulum swing of SQL has definitely been going pretty full bore in the direction of being in favor of it. And the fact that there are a number of different projects that are pushing on SQL for streaming engines is a pretty strong indicator for that. So I'll definitely be interested to see some of the ways that dbt can be used within those streaming contexts for being able to manage the pipeline of transformations that you're generating across these unbounded datasets. And so now that you've been using dbt and managing this data warehouse for a while now, I'm curious what you have seen as some of the unexpected or innovative or surprising ways that the DBT toolchain is being used or that the datasets that you're building are being managed or used for interesting use cases within the business?
[00:37:52] Unknown:
We launched for general usage about a quarter ago, and we've also launched with a pretty constrained use case for serving internal analytics with a freshness of 1 day dataset. So we haven't really seen a wide variety of unexpected use cases quite yet. We have seen some pretty interesting kind of dynamically generated SQL queries with Ginger, and we've seen some interesting side effects. 1 of those has been that data scientists are just leveraging storage optimization such as partitioning and clustering a lot more readily since dbt's interface for specifying such optimizations is really lightweight.
Another kind of forward looking statement that we haven't observed yet but will likely be true is that we do expect more full stack engineers to adopt a SQL based modeling framework as opposed to learning something like Spark and managing a Spark based pipeline.
[00:38:39] Unknown:
In your own experience of managing this project from the beginning and sharing it throughout the organization, onboarding new people to it, I'm wondering what are some of the interesting or unexpected or challenging lessons that you've learned in that process.
[00:38:56] Unknown:
As we started this project, we rolled it out first for a private alpha for a small set of really keen data scientists and found some of our early feedback to be really valuable. 1 of our initial shortcomings was that we tried to implement a CSV based style of unit testing where data scientists would encode their static data in, you know, not in line, just actual CSV files. And we found that this led to some pretty bad unit testing practices, specifically that unit tests should really fail for 1 reason, but we found that because of the overhead to create CSV file for every input, data scientists were really kind of clumping the unit test into 1 really large mega sequence of CSV files, which would make it really, really difficult to debug or review. And their only real alternative was to make a ton of unit tests and consequently a ton of CSV files, which would similarly be very difficult to review and maintain. And we're kind of hiding all this complexity between inside this, like, a folder of CSV files. That was 1 of the biggest pieces of feedback we had from the from the alpha, and the other 1 was that we were really just very loose with our definitions of a front room, back room, and we really had to call in and set up a dedicated data modeling team that could establish some of these definitions more robustly.
So that would be the backroom, exchange layer, frontroom that we described before. Yeah. There were also,
[00:40:19] Unknown:
like, some organizational challenges for teams that are already using Starscream and are already pretty proficient with the tooling. So we have to figure out the right incentives for people to move or even rewrite their jobs from Starscream to Seamster. So for now, we're focusing on teams that are building new data models so that we can accelerate them. But in the end, we will have to offer either performance gains or more productivity gains long term for people to consider. And we're also thinking that instead of just terminal nodes, if people can slide dbt into their DAG, they use they can use Starscream for a bit, then dbt for certain models, and then go back to Starscream if they're doing, like, machine learning. That could be a good use case. Yeah. On the note of coexisting systems,
[00:41:04] Unknown:
1 challenge that we ran into for sure was the need to reconcile data models between 2 systems because an essential part of migrating would ultimately be asserting with confidence that 2 data models are in fact, like, interchangeable and can be swapped out. So to do this, we need to create a diff of the 2 datasets, across different systems, and this is just a really manual process and quite involved to actually ensure that you fix input datasets such that both systems build on the same set of inputs.
[00:41:34] Unknown:
As you continue to push further with where you're using the Seamster tool and bringing more workloads into the data warehouse, what are some of the cases that you have either already experienced or that you anticipate where you think that using dbt and the data warehouse is the wrong answer and teams should be looking to either other existing solutions or new solutions?
[00:41:58] Unknown:
I think SQL like, our principle is that SQL is good for 70 to 80% of the use cases, but not for everything. So if you're doing any sort of, like, self join shenanigans or anything, like, that is hard to express declaratively, then maybe you need to go on to a more imperative model, and go to a full fledged programming environment. We don't want to end up in a place where you're doing, you know, ninja macro magic to get your SQL queries to work. We would much rather prefer that you just use PySpark directly at that time. Also, if you're doing any sort of, like, customer focusing, like, real time reporting or any sort of streaming where you are modeling with Seamster and then loading into Druid or ClickHouse or Pinot, you might just be better off by taking a fully engineering solution and using Flink or Dataflow directly and then loading into those warehouses. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred
[00:42:57] Unknown:
contact information to the show notes. And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Zeeshan, why don't you go first?
[00:43:09] Unknown:
So what my observation being in this field so far has been that there's a lot of tools to process the data and crunch it, but there's not a lot of tooling around data ownership, versioning, and lineage, and especially, like, around PII and handling sensitive information and almost, like, semi PII information, like IPs, user agents, and tracking cookies, because in certain jurisdictions, a combination of those could be considered a PIOI. So it is incredibly hard for data scientists to figure out, you know, how do they handle these things, How do they, like, rotate them out? If they're using some sort of, like, redaction system, if they need to make a breaking change, how do they coordinate all these breaking changes with other teams and other data science teams? So we're doing a lot of this by educating people, but I think, realistically, the modeling tooling should be integrating very tightly with the metadata tooling. So I'm almost imagining, like, you know, some sort of, like, Versus code type feel, whereas you're making data models, you're getting annotations and notifications that you need to be thinking about, like, these PII and versioning and backfilling things.
[00:44:17] Unknown:
Yeah. Definitely seen some cases of using the metadata management and being able to label for things like PI. But having that feedback into your development environment as you're iterating on the models and the code that you're using to process is definitely interesting and useful next step. I'll have to see if I can find anything that's even trending in that direction.
[00:44:37] Unknown:
Michelle, do you have anything to add? I definitely agree that there's sort of a lack for a more centralized almost like metadata schema for datasets from a platform tooling perspective and finding that more and more as we're building new tools in in the data platform that every tool needs to reinvent its ownership model or how learning is done or how PII annotation is done, monitoring latency. So something that kind of syncs together all of this in a standardized, more holistic way that can be leveraged for a consistent user experience.
[00:45:05] Unknown:
Well, thank you both very much for taking the time today to join me and share your experience of building out a data warehouse platform and managing the tool chain around DBT and your experience of working with that and managing that as you continue to scale with it. It's definitely very interesting and useful problem space, and it's valuable to hear from people who are going down that path so that it helps other people who are coming along after them. So I appreciate all the time and energy you've put into that and the time you've taken to share it, and I hope you enjoy the rest of your day. Happy to be here. Yeah. Thanks for having us. For listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction to Guests and Their Roles at Shopify
Overview of Shopify's Data Sources and Warehouse
Organizational Structure of Shopify's Data Teams
Challenges and Tipping Points in Data Management
Choosing BigQuery and DBT for Data Warehousing
Project Structure and Onboarding with DBT
Data Modeling Approaches and Best Practices
Lessons from Using BigQuery and Redshift
Integration with Broader Data Platform
Continuous Deployment and Incremental Builds
Data Discoverability and Lineage Tracking
Team Patterns and Best Practices
Scalability and Performance Optimization
Master Data Management and Canonical References
Future Improvements and Features
Innovative Uses and Lessons Learned
Coexisting Systems and Migration Challenges
When Not to Use DBT and Data Warehouse
Biggest Gaps in Data Management Tooling