Summary
Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff)
- What are the roadblocks to data testing/validation that you see teams run into most often?
- How does the tooling used contribute to/help address those roadblocks?
- What are some of the error conditions/failure modes that data-diff can help identify in a dbt project?
- What are some examples of tests that need to be implemented by the engineer?
- In your experience working with data teams, what typically constitutes the "staging area" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.)
- Given a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests?
- In application development there is the idea of the "testing pyramid", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects?
- What are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be?
- Beyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.)
- What are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers?
- When is Datafold/data-diff the wrong choice for dbt projects?
- What do you have planned for the future of Datafold?
Contact Info
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Datafold
- data-diff
- dbt
- Dagster
- dbt-cloud slim CI
- GitHub Actions
- Jenkins
- Circle CI
- Dolt
- Malloy
- LakeFS
- Planetscale
- Snowflake Zero Copy Cloning
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer, and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudderstack today. Your host is Tobias Massey, and today I'm interviewing Gleb Myshansky about how to test your DBT projects with DataFold. So, Gleb, can you start by introducing yourself?
[00:00:43] Unknown:
Yes. Hi, Tobias. Great to be here on your podcast again. So my name is Gleb. I am CEO and cofounder of DataFold. We automate testing for data and, analytics engineers. Before DataFold, I've been, a data engineer, data product manager, data analyst for pretty much my entire career and got to work at companies of pretty much all stages and build multiple data stacks. So a lot of what we're doing here at DataFold is informed by my previous experiences and my passions for building great tools for data engineers like myself.
[00:01:24] Unknown:
In terms of DataFold, for people who haven't listened to your past appearances, if you wanna just give a brief overview about what it is that you're building there, and then maybe we can talk through what's happened since the last time we spoke, which the last time you were on was, July of last year. So it looks like we have about a yearly cadence. So last year in July, we talked about the open sourcing of the Datadiff tool, and then the previous July is, I think, your first appearance on the show where we talked about DataFold when it was a little bit earlier in its journey. So if you wanna just kind of give the overview and share what's new.
[00:02:00] Unknown:
Yeah. Absolutely. So we focus on automating testing for data developers. Specifically, at the moment, we're focused on enabling teams building on DBT as their tool for data orchestration. The way, we define data testing is essentially making sure sure that the code that data developers write, which is typically SQL, plus the DBT specific domain specific language, is actually correct before it gets deployed. So, ultimately, the goal for us is to make sure that data teams can ship their data products, which are, you know, datasets and downstream applications they're building on top of the TBT models are high quality and that they ship them faster, and those things go hand in hand.
So we focused on making sure that the code is verified and validated during development, so when developers are actually writing it and then doing the deployment process, so doing pull request review and CI stages. And the way it works is it's fundamentally based on 2 technologies. 1, it's called Datadiff. It's actually an open source tool which is self descriptory. Datadiff helps you compare tables of any size within or across databases relational databases. So you can use the tool to, for example, compare your development and production tables, or you can use that to, compare an output of 2 different SQL queries and so on. And the way that data diff is applied by DataFold is we use it to help data developers essentially preview the changes they're making to the dbt models.
In other words, making sure that whenever the code is changed, the developer is fully aware of the impact that this code change has on the actual data produced. And that's a really important concept for testing, and I'm kinda excited to explore more. And then the, second component of a platform is data lineage. So, essentially, DataFold mines the metadata, analyzes the logs from the data warehouse, and, integrates with BI tools and different data applications in order to build a global graph of dependencies for the data platform, essentially understand how every column, every table is produced and consumed, and how the end nodes of data consumption, for example, you know, BI dashboards or sinks reversal TL sinks into business apps are dependent on certain tables and columns in the warehouse and, in the dbt project.
And there we and the way that those 2 technologies come together is we can help data developers understand exactly what's gonna be the impact of the code change, not just in terms of data. For example, I change the given dbt model, what's gonna be the output, how my columns is going to change, but also see that impact through the entire data platform. So see how the downstream tables and metrics and columns will be affected, see how the dashboards and, you know, reverse NTL syncs will also be impacted, and all that information is provided to developer, to the code reviewer, to the data team manager during the code deployment process.
And that helps enhance the visibility and ultimately make sure that the team is, comfortable making any kind of change, can deploy things quickly and with high confidence. So that that is essentially the essence of, automated testing.
[00:06:01] Unknown:
And testing in the broadest sense in the software ecosystem has a number of different aspects to it. Software application teams will typically have something called the testing pyramid where it focuses on unit tests and integration tests and system tests and then maybe some manual tests. Testing and data is still there's not a lot of rigor or structure around how best to do it. There are people who are doing a great job of it. There are people who aren't doing any of it. There are people who are saying testing is actually just anomaly detection in your data warehouse. I'm just wondering if you can talk to what you see as being the main roadblocks that teams run into in the space of data testing and validation and some of the reasons that they maybe just give up as after they first start to explore what is possible and what can they actually get done?
[00:06:51] Unknown:
Yeah. Absolutely. I think testing is almost a philosophical question because it can indeed be defined differently depending on how you view the problem. To me, it comes down to making sure that the data products we deliver, and by data products, I mean tables, columns, dashboards, machine learning models, they meet expectations of their users. And those expectations typically come down to 2 fundamental questions. Do we believe that the data accurately represents the reality, the business reality? Right? So for example, if a model is describing our metrics, our KPIs for the business or our or key hours, do we believe that those actually are correct representations of those metrics?
If our data describes our users, for example, it's a dimension table that should contain every user of our application or our business, does it actually contain accurate information about our users? The second dimension of that expectation is, does the data stay consistent over time? And what's interesting is that I think by now, we realize that we can't really treat accuracy and completeness of data as something absolute just because no 1 actually knows what the ground truth is, especially at large companies with lots of different complex datasets and complex businesses.
There are really examples where we can say that our data is completely accurate, and it counts every single transaction, every single order, every single event, except for very those use cases where it's absolutely absolutely warranted. And so what matters is not necessarily that the data is absolutely accurate, but is that it stays consistent over time. Because if it's inconsistent and we are changing definition of how we count active users or how we count account for revenue, then it's really hard to make any business decisions and make sure that the business users are comfortable using that data day to day in the work. And so testing comes down to essentially 2 things.
Right? Making sure that we believe the data is accurately representing the reality and that it stays consistent. The definitions stay consistent over time. And when definitions are changing, that's also okay, but we have to make sure that both the developer and the stakeholders are actually well aware of those changes and align on them. Right? So we can't change the definition of an active session or an active user that sometimes is needed, and many companies, especially in the growth stage, are doing that regularly, but it cannot happen unexpectedly because that undermines the fundamental value of, whatever data products we are building. And so if we start with that assumption of what testing is and we go into the how, you mentioned kind of a parallel to software engineering.
I think that testing in data is somewhat more complex than in software engineering because, 1, as a software engineer, typically, I can fairly easily tell whether what I'm doing is working. As a back end engineer, if I wrote a function, I can run a unit test. Right? If I write an API, I can probably run a few sets of tests, integration tests, and unit tests to make sure it's correct. If I'm a front end engineer, I can do the same. Plus I can also probably preview the results of my work. And there's a very concrete definition usually of what it means for the system to work correctly. Whereas in the data world, 1, the definition of correctness is old, is often very loose, and 2, oftentimes, it's really hard to just decouple things 1 from another. And a good example of that is decoupling data from code. So a lot of times in unit testing, you are passing standardized inputs into your unit test, and you're expecting a very predefined set of outputs.
And that pattern is notoriously very hard to replicate in the data world because you have to curate datasets that would actually represent meaningfully the whole complexity of data that is, in my experience, extremely, extremely hard to condense to a set of mock kind of unit testing data inputs. And so you kind of end up testing data and the code at the same time, which makes this noisy, and that is 1 problem. The other challenge is, in general, that, I would say that the in modern data stack, the explosion of all the tools in in the modern data stack allowed rapid creation of data assets. So it is very easy to run queries thanks for more warehouses. It's very easy to package them into pipelines, thanks to frameworks like GPT and orchestrators like Dockster, and it's very easy to pump different kinds of datasets into your warehouse.
And so we can build really, really fast, and we can bring a lot of different datasets, but the approaches to testing hasn't evolved as, as fast as the actual tools to build things. And in terms of what we're seeing data teams struggling with and certainly something that I've struggled in my career as a data engineer is the complexity of the data that I'm dealing with, the amount of tables and columns and changes I have to make is just orders of magnitude larger than, the testing I can do either manually by running queries myself and validating my work or by writing any kind of assertion task that would essentially say, you know, whether my a given column has certain number of values or is unique or is not null.
And the challenge is that currently at the disposal of beta teams, by default, there are only a handful approaches that can be leveraged for testing. Right? So 1 is assertion test when you predefine expectations of data, and then you run them against your models and tables either in development or in production. Those typically are ran based on live data, so data in your warehouse. There are some frameworks that allow to run those as unit tests, but those are currently, I would say, fairly hard to use. Although we've seen some companies build unit testing framework on top of DBT internally. For example, Shopify.
And then beyond that, your best testing technique used to be just writing a lot of SQL queries and validate validate that your work was correct. So when when we start at DataFold, we try to find a better way to test to give people visibility into what changes they're making.
[00:13:51] Unknown:
And another aspect of it as well is that in software engineering, testing has, for the most part, become a very widely adopted practice. If you're going to get a job as a software engineer, testing is 1 of the things that you're expected to know and know well. I'm curious if you're starting to see any similar expectation on the kind of hiring and skill building side for data practitioners to have familiarity with testing and validation, or if it's still very early in that cycle of testing as a skill that you're just going to have to learn as you go through it and experience the pain of not having anything tested?
[00:14:29] Unknown:
Yeah. It's a great question, Tobias. I think that right now, we're still in the phase of really defining the best ways to test data products and data pipelines. I think with the advent of dbt and dbt becoming sort of a standard platform for running SQL first data pipelines, the prevalent pattern for testing has been writing what's called DBT tests, which are essentially assertion tests that can validate specific assumptions about columns or tables. And this framework actually has matured quite a bit and includes different kinds of tests starting from validating uniqueness of columns or the not null condition for columns to testing referential integrity. So making sure that tables have the same sets of keys and can be joined without duplicates or missing records or, you know, very complex logic that would pre aggregate metrics across multiple tables and compare them. So that framework is fairly mature. The challenge is that because the complexity of data environments is exploding massively, we routinely see teams with 100 and thousands of dbt models in their project.
And I think a recent post by Tristan, the CEO of, dbt Labs is very illustrative of the almost exponential explosion of complexity of DBT projects that that we're seeing. It is very unlikely that we're gonna get any meaningful coverage in terms of the assertion tests of our project. And so the question becomes, what do you do for 99.99% of cases that can go wrong with your models and the code that you wrote? And here is where I think we, by default, go from kind of a skill into more of, like, a craft and intuition and uncharted territory.
And so I think that that's where the data developers have to rely on their gut instincts and apply a lot of manual testing and write a lot of queries and kind of really figure out what they should test, and what can go wrong. And I think this is incredibly, in my experience, error prone and also very importantly tedious and labor intensive work that's that slows you down, that's not fun, and most importantly, doesn't necessarily guarantee you good results. And so the approach that we have taken here is to essentially enable the impact analysis with the open source data dev tool that would allow any developer to almost preview their changes in data when they change the code and thereby getting getting full visibility into the impact of the change. So regardless of whether you wrote the, you know, the tests for all these different conditions, DataDiff can help you understand exactly how your data is going to change. And that information alone is incredibly helpful in understanding regressions, understanding any kind of unwanted and unwanted impacts.
But it's not to say that Datadiff replaces assertion test or dbt tests. I firmly believe that those type of tests need to be written and should be written for the most important assumptions about the data, but we know that they're never going to be exhaustive. Never going to we know never gonna have the code test coverage that we would like to get.
[00:17:51] Unknown:
And that that's another interesting aspect as well is the question of coverage where in a software project, you can measure your coverage based on the number of lines of code that were processed in the step of actually executing your tests. Whereas with data, you can get the coverage metrics for your code, but it's not really straightforward how you would understand what that coverage metric looks like for your overall data, particularly given the fact that there are upstream considerations, downstream considerations. Maybe you can get coverage across 1 slice of the data layers, but you're not gonna get a comprehensive view of test coverage and what are all the ramifications of those tests to those downstream use cases.
[00:18:32] Unknown:
Yeah. Absolutely. And there are a few inherent problems here that we can see once we start digging on the parallel with software engineering. 1 of the challenges is that SQL isn't inherently composable in a sense that when we build a table, typically, that takes a form of a pretty long SQL query, and that's almost like a monolith piece of code that easily can span hundreds of lines of, of code with a lot of complex logic in between. What's also hard is that within that query, when we are defining columns, all of these columns are defined at once. Right? We don't really take out a definition of a column, and we can't really run or test it separately from the pretty much definition of the entire table. And that is quite contrary to the design of modern software engineering languages where we would be able to almost pick out any kind of sophisticated logic into its own function, its own component, its own class, and then be able to test it separately.
So it's it's very kind of monolithic that makes it hard to decompose really complex business logic. And because of that, it's really hard to say, well, what is your actual coverage? Are you trying to cover every column? Are you trying to cover every table? Because every single column can have also very different expectations and also failure modes in terms of what can go wrong. What I've seen being really effective in terms of setting standards for testing within Teams is starting with really basic assumptions. For example, typically, every table in your data warehouse should have a primary key. Right? So a key that identifies a row uniquely, and it's really important that you that key is actually unique. Right? By definition, it should be not null, and that the user of a table, so someone who discovered the table and tries to query it, actually knows what the primary key is. And unlike software engineering where primary key constraints are typically defined, you know, in the data definition language and enforced by the database, we don't really have that luxury in our kind of all app analytical world. And so a really good practice that I've observed teams do is essentially require that every every table, every dbt model should have at least 1 uniqueness test on a column or combination of columns, thereby defining a primary key for the table. And that is pretty easy to write, pretty easy to add, but extremely effective because you can avoid a lot of errors just by doing that. You know? And similarly, when you add any column that contains sophisticated business logic, for example, you may be classifying your customer accounts into different groups or tiers based on certain case when logic. Right? That can be credibly complex.
Writing a task that would just describe what kind of values that column should, accept is not only helpful in making sure that this, definition stayed consistent, but it also helps with documentation. So your tasks almost become documentation. And so requiring the the key metrics, the key, fields that essentially contain the core functionality of every model are covered with at least some tests, I think is also very healthy practice.
[00:21:51] Unknown:
And in terms of the DBT projects that you've seen, the way the data diff is being combined with the built in capabilities of dbt. Wondering if you can talk through the types of failure modes and error conditions that you're able to catch with that combination that you're not able to address with DBT alone or may where maybe you need to bring in a more heavyweight testing and validation workflow.
[00:22:17] Unknown:
Yeah. I think, to answer this question, it's helpful to look at the typical deployment process for a dbt workflow. So let's say I'm a analytics engineer, and I am working on refining logic in a table that, calculates the active users. Basically, it contains all the users, and it defines whether the user is active or not based on certain, activity trends that they exhibited. And let's say I need to define redefine the definition of an active user based on the duration of sessions that they've had. Right? So if I change this logic, I would change some SQL. I would probably build that, table and maybe some downstream tables in my local development environment.
Once I feel good about it, I would push the change to GitHub or GitLab and open up for request. Right? And then I would request review. And let's say you are a reviewer, maybe you are staff data engineer, maybe you are a data team manager, maybe you're just an expert on this particular domain, or maybe you're just my teammate who is available to review my code. Right? And so how do we make sure that the change I'm making is actually correct change? And once deployed, it will do the exact things it's expect expected to do and not cause any bad effects. Right?
So we can look at the code at the change of the SQL code, but it's really, really hard to understand from just reading a SQL code, especially when it comes to complex business logic around, you know, sessions, definitions, or anything that has to do with a lot of different data sources. What's gonna be the impact on data? So we can't get this from the code along. Right? We can look at the tests, and let's suppose we have certain tests written, for example, that session duration cannot be less than no less than 0 and that there are only certain types of statuses that a user can have, like active, churned, etcetera.
But that doesn't actually tell us what is gonna be the impact on our business metrics or whether they're gonna be any other impacts that we're not accounting for. So in other words, we don't really know what we don't know, and this is where Datadiff comes into place. So Datadiff would tell us based on the changed SQL code, what is gonna be the difference in the data if that code was merged. For example, Datadiv could tell us that if this code was merged, and deployed, we would have fewer rows in our users table because I switched the source of a session event, and maybe the the source I had actually has accounts for fewer users because it started sending later than the previous source.
And I just may not have anticipated that at all. Right? And if I were to just merge this change, I would end up with your users. It could also tell me whether the change in definition of an active user I made is actually correct 1, do I end up with the proper bucketing of users and help me understand for a few few particular users whether they're mapped correctly. Right? And it really helps me understand what is gonna be the impact on data both at a statistical level and a at a role level as a developer. Now you, Tobias, as a reviewer, when you look at my code, you have the same problem. Right? You have to essentially give me a go or no go, which is a lot of the times quite a bit of responsibility by just looking at my code and a couple of assertion tests. Right?
And, you may have even less context than I do because I, as a developer, at least wrote the SQL, and you may have no idea what this SQL is supposed to do, what this model is doing. And very importantly, both of us may not know what kind of data dependencies exist exist downstream. Right? Maybe finance team builds an export in their Google Sheets that they use to present to the board that we're just completely unaware of. And my, let's say, benign change of definitions I'm doing for my marketing stakeholders is actually going to completely change their calculations for the board retroactively. Like so coming back to the consistency topic, that is a very, very bad outcome if we just, you know, I report the change, you review it, we approve it, we deploy it, my stakeholders are happy, but then some other important stakeholder I didn't account for is badly impacted. And so Datadiff plus lineage really helps solve this problem because you as a reviewer looking at my change, would be able to see exactly what is gonna be the impact on data for the model I modified.
You will be able to see what are the downstream effects of this change on other models. Maybe there are certain known, sometimes they're called golden datasets, sometimes they're called metric datasets, right, that are just sources of truth for metrics that are also gonna be affected that we'll know. And then we will know what kind of BI dashboards or other data apps are gonna be impacted by the change. Right? And all of this information is condensed in the pull request. So very importantly, it exists exactly at a point in time where we're making a decision whether this code should be shipped or not. And I realized that for a lot of data partitioners, my lengthy description of this process and all of the analysis that's going on may sound like, wait. This sounds like way too much in an overkill for data work. Right? And I think for certain teams, that's maybe an overkill. Right? If you're just building out DBTN, you're trying to move fast, that's okay.
But a lot of teams exist in the reality where they have mission critical business use cases where they either are powering executive reporting or machine learning models or they activate data through business apps where such a change that I'm describing, if it happens, you know, without, the important stakeholders, without the developer being aware, right, if we're introducing this kind of regression, would actually have a very significant business impact. And so in that case, we actually have to treat the change management process like we would do it in software, which really understand what the change is doing, account for impact to make sure that we have a full kind of deploy preview before before merging.
And so that essentially is the goal, to provide the full visibility into the change that would be kind of augmenting the assertion test that would test very specific types of failures.
[00:28:48] Unknown:
Another interesting aspect of the challenge of testing in data teams is that there isn't really any clear cut way to define or implement a so called staging environment, where when you're working in software, it's still not a perfect solution, but you can usually at least get closer to replicating what production looks like for a given application architecture or be able to stub out the interfaces to be able to run a component of a microservices suite on its own to make sure that things are actually executing properly. When you're working with dbt and you're dealing with large amounts of data, you're dealing with a data warehouse, it's not gonna say impossible, but it's definitely impractical to say, I'm going to copy everything into a different environment and run all my tests there, do all my development there so that I can be confident in the data once I get to production.
Because by the time you actually get to production, what you are dealing with in your staging environment, if it's a copy, is already out of date. And I'm wondering if you can just talk through some of the ways that you see teams tackling this challenge of creating a, staging area for their DBT projects and, what you see as some of the relative trade offs of those approaches.
[00:30:01] Unknown:
Yeah. That's a great question, Tobias. I personally think that the simplified management of environments that dbt brought is actually 1 of the most powerful and often underappreciated features of dbt that is actually quite revolutionary for a typical data engineering workflow. For example, in tools like Airflow that preceded DBT where I spent the majority of my career working in, there hasn't been a first class concept of a staging environment, and it was notoriously very hard to test your changes and sort of like developing in production was the norm because of just how hard it was to create the these staging environments.
And I think that just like in software, staging is incredibly important because that is the part of the process where you can actually validate your and and test your changes before they are deployed. And it's a very important part of the process because in a development environment, engineers can do whatever they want. Right? Everyone can have their different setups, use different tools. For example, in the context of dbt, they can be using dbt cloud IDE or they can use dbt core locally and only use cloud for production. Right? They can be all all these different setups. But then staging in a standardized way standardized way to say, okay. I have new code. Let me run this code on the data and then see and validate whether this new environment is actually correct and if I should replicate it in production.
And so the types of tests you can run there are massive. So you can do you can do a lot of things, including DBT tests, so you can do diffs. And so the question is, how do we get to those environments? Right? So I think in DBT, the creation of environments themselves has been massively simplified just because of the, clever Jinja templating and substitutions and variables. So that's 1 aspect. The second aspect is it's not just enough to say, okay. We are going to, like, run something in a different environment. We also have to make sure that the data that we use to run the new code is actually representative. And there are a few techniques here. So 1 technique is just to use production data and build everything off the off the production data. That is sometimes beneficial, but sometimes it can be quite costly because you're running you could run a lot of a lot of computation just to validate a single change.
And so dbt actually evolved quite a bit in terms of making those staging builds efficient. So there is a slim CI feature in dbt cloud that allows you to build a staging environment by reusing as much data from production run as possible so that you're only building things that actually going to change. And then here, we're using all the data from the production environment, which saves quite a bit of time and compute costs. Similar things can be done on dbt core with, state and defer flags. And the other aspect of building staging that also and development environments that has improved is the evolution of the modern cloud warehouses, such as features, such as serial copy cloning, which allows you to essentially create a completely new environment without having to copy data. So developers can actually get the complete copy of production environments, so potentially terabytes, petabytes of data, but they are virtualized in the sense that they wouldn't be there wouldn't be change unless the developer actually does so. And so this allows us to develop and build things and test things without incurring massive compute or storage costs.
So the 3rd component of that is running a continuous integration process. So staging environment doesn't build itself. Right? You have to have a process that does so for every change that you're making, and typically that is done once a once code is checked into our repository, and then we open a pull request or merge request. And then we have the CI continuous integration pipeline, run certain validations, build the staging environments, and run run the tasks on that staging environment. And so for dbt cloud users, that is very simple to do because dbt cloud comes with kind of all the batteries included running those. Dbt core users can use stand alone CI runners, borrow the stack from software engineers, so runners like GitHub Actions or Circle CR Jenkins can be used for for doing that. And, basically, those 3 things combined allow us to make sure that every single change that we make to the code base goes through the continuous integration staging environments with real data. We run all the tests we want, so linting, assertion tests, data diff. And then if all those tests run correctly and are good to go, the reviewer give us gives us a final thumbs up, and then the code is merged. So I think that CI and staging environments is something that coming back to the best practices question, something that every data team running DBT these days should build because it's very easy, and it can really elevate the speed and quality of the work of the entire team.
[00:35:16] Unknown:
Another interesting aspect of the staging question is the ability to do that branching and merging workflow where there are some examples of that out in the wild, but largely still unsupported. In particular, I see, like, FS as an opportunity for that. Snowflake with their 0 copy cloning approximates that to some extent. There are projects like PlanetScale that have that branch and merge capability built into the database layer. I'm just curious what whether you see that as a necessity or a nice to have in this overall space of being able to streamline the development and validation and deployment workflow for data teams and if you've seen any substantial movement in that direction as a, cohesively across the ecosystem.
[00:36:07] Unknown:
Yeah. Absolutely. I think that Git for data for actual data has been discussed for quite a while now, and you mentioned a few fantastic technologies. I would also add DOLT, which is a very Git like protocol for managing relational datasets. In my view and my experience as data engineer, I think that the versioning and merging and the typical Git like procedures are most important to deal with the code itself. However, once you have the, you know, staging environments and dev environments built and tested, I'm not sure that the complexity of adding the data branching and versioning is actually worth it in terms of, implementing that procedure to merge the data versus just rerun it in production at the moment, which is, which is the prevailing pattern. So I think the the the ultimately, it seems that the 2 patterns that exist there are 1, we what happens upon the code is merged. Right? You we either take the new code from the main branch, and we we run on in production, and then we update all the data, and now we have our code changes applied to production, or we build a staging environment, and then we merge the staging environment data into the production data using some of these techniques that, let's say, like, a fast can offer. And to be honest, I'm not sure that the typical analytical workflow is where those technologies should be really applied.
I can see them being incredibly useful in 2 instances personally. 1 is when we are curating datasets and we have to have incredibly high accuracy and collaboration on curating particular datasets. For example, if an aggregation has any sort of repository or if we have to maintain in in the form of relational tables any kind of information that has to be incredibly accurate and updated by different people or processes, then using Git like techniques with DOLTs, for example, could be very useful and interesting way to have a great collaboration of data. I haven't seen that being used as often in the typical analytical workflow because of the added complexity and because the the focus of the typical data workflow is to build at scale fast with as much interoperability as possible. And so the typically, those workflows would be done on kind of general purpose all app engines versus the specialized kind of version engines.
Although, we're we're seeing the that the major warehousing players are also looking to add, those features. I think we we'll have to see how exactly they will impact the workflow. But at the moment, I provide that we're using code versioning and we're building staging environments in CI, and we are having comprehensive testing. I'm not convinced that applying versioning to the data itself is actually going to be that much of an impact on the typical data workflow.
[00:39:06] Unknown:
And in terms of the overall ecosystem, I'm curious if there are any just general design paradigms or, community attitudes that contribute to the overall challenge of being able to test with data. Is is it just that testing data is inherently problematic and always will be, or is it that the people who are building all of the technologies that support the data infrastructure haven't put in the time to actually simplify the path of validating our changes?
[00:39:39] Unknown:
That's a great question, Tobias, and I think it's both. I think there are certain things that are inherently hard about testing data. Some of them we mentioned such as the absence of the ground truth, the complexity of datasets and source data, the languages we use to express our data transformations. Right? So I I think that we're making improvements in a lot of those directions, and we also see new languages come up such as, for example, Malloy that is way more powerful and composable, and expressive than SQL that are have an ambition to really solve some of these problems.
For the time being, though, I don't think that SQL is going away anywhere, so we will be probably dealing with these challenges for quite some time. I think it's a question of maturity of the overall stack. For the last few years, it was focused on creation of things, and now we are now in a different phase where a lot of the focus is around managing complexity, managing reliability, enabling more and more people contribute to the code basis. And I think that the magic of GPT has been that it enabled companies to include vastly vastly greater number of people in the organization to contribute to the analytical code base, which is kind of a new wave. Right? We've had the wave of self serve BI probably about 7 years ago, and now we have the wave of self serve data development where people can not only consume data for all predefined interfaces, they can actually create their own datasets for their needs and then get them in the tools and in the forms that they they want. And so that is a my that is a massive people problem in the sense that we have to build tools that enable developers to do the right thing easily. And that is my fundamental belief that I'm trying to apply at Dataflow is testing is hard. It is tedious. It is error prone, and it requires a lot of skill and craftsmanship to do manually well.
And so I wanna build tools that take as much manual work as possible and make it easy for data practitioners to do the right thing, which is, for example, to understand the full impact of their change. Because, ultimately, the data quality problem is less so of a technology problem. It's a problem of having a great workflow where your tools are supporting you in the decisions that you make. Data developers shouldn't be spending time writing all these manual validations and auditing and testing as much. They should spend their time thinking about the business problem and how to apply that to the code they're writing and how to best present the data to the stakeholders.
And then the tools should make it easy for them to do this as fast and, reliably as possible. That's what ultimately DataFold's mission is.
[00:42:32] Unknown:
Are there any other aspects of the work that you're doing at DataFold, the open source Datadiff tool, the overall process of testing DBT projects that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think 1 of the
[00:42:46] Unknown:
frequent questions I get is sort of where what's the difference between the our open source tool, beta diff, and our cloud product? So I think it's worth clarifying. So Datadiff is a utility tool that helps compare data across any database or within database at scale. It has integration with DBT, which allows data engineers to plug it in their development workflow. So it can be installed into the dbt project as a stand alone package, but it reuses a lot of the configuration and credentials from the local dbt setup, and it allows you to essentially do dbt run or dbt build and then follow that with a dbt, data diff command, and that gives you the difference between data in your local environment and production. Essentially, it tells you how does your current code change that you've just made is going to impact the data locally, and the output of that tool is provided in the terminal, so in a command line interface.
Now the cloud product is focused on enabling teams. And so when it comes to automating testing for teams, it's really important to, 1, make sure that the testing is not necessarily done by individual engineers, but it's actually performed by, an automated process as part of the continuous integration. And so the cloud product offers the GitHub and GitLab apps that run diffing automatically for every pull request that exists that is opened against the dbt project. It also provides collaboration features around understanding deep diffs at greater depth.
So the open source data diff is CLI only, whereas the cloud, you have a lot of different slices of the data. You can look at diffs on a statistical level and see how your change affected a given distribution of a metric. You can also look at diff on the visual basis for every row in your table. So you can see, for example, for a given user ID, how did a given dimension change? And, also, very importantly, the cloud product offers the column level lineage probability that ultimately stretches out the impact analysis from just the model that we touched to all the downstream models and then the downstream BI tools.
So, typically, we see individual analytics engineers and data engineers adopt our open source tool for their individual workflows because it just speeds up your development and helps you build faster and with higher confidence. And then Teams adopting the cloud product to really roll out the automated testing for their entire team and whoever wants to to the DBT project as well.
[00:45:26] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? It's hard to point to a particular
[00:45:44] Unknown:
narrow gap. I think in general, I believe that people should be able to do more with data. More specifically, data engineers are doing way too many things manually. They're writing too much code. They are doing too much, actually, no code reading investigation testing relative to their software engineering counterparts. I just think that in software engineering tools can do way, way more for you than currently for the data developers. And it's not necessarily a single gap. It's probably a 1, 000 different small gaps that exist, that we really aim to solve for data practitioners and data developers. So my my goal is to make sure that data engineering is a really enjoyable experience that feels productive, that is relatively easy to master because of how much tools can actually do for you so that you think about building your datasets, which is incredibly important for the business in general more. You spend more time on thinking about the the questions about what what kind of data definitions I need and how my business should apply to to the data and less on the technical details. So how this code should be written, what task should I write, how do I validate this change, you know, how do I write this part of the documentation or this string of the YAML
[00:47:09] Unknown:
config? Absolutely. Well, thank you very much for taking the time today to join me and share the work that you've been doing at DataFold and some of the ways that teams who are building DBT projects can think about how to improve their overall testing and validation strategies. Definitely always a good topic to explore, so I appreciate the time and energy that you and your team are putting into making that an easier problem to solve, and I hope you enjoy the rest of your day. Thank you, Tobias.
[00:47:40] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show. Sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Gleb Myshansky and DataFold
Overview and Evolution of DataFold
Automating Testing for Data Developers
Challenges in Data Testing and Validation
Skill Building and Testing Expectations for Data Practitioners
Failure Modes and Error Conditions in DBT Projects
Staging Environments and Continuous Integration
Branching and Merging Workflows in Data Projects
Challenges and Future Directions in Data Testing
Differences Between DataFold's Open Source Tool and Cloud Product
Biggest Gaps in Data Management Tooling