Summary
The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try!
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy and Simon Eskildsen about their work to open source the data diff utility that they have been building at Datafold
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the data diff tool is and the story behind it?
- What was your motivation for going through the process of releasing your data diff functionality as an open source utility?
- What are some of the ways that data-diff composes with other data quality tools? (e.g. Great Expectations, Soda SQL, etc.)
- Can you describe how data-diff is implemented?
- Given the target of having a performant and scalable utility how did you approach the question of language selection?
- What are some of the ways that you have seen data-diff incorporated in the workflow of data teams?
- What were the steps that you needed to do to get the project cleaned up and separated from your internal implementation for release as open source?
- What are the most interesting, innovative, or unexpected ways that you have seen data-diff used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-diff?
- When is data-diff the wrong choice?
- What do you have planned for the future of data-diff?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Datafold
- data-diff
- Autodesk
- Airbyte
- Debezium
- Napkin Math newsletter
- Airflow
- Dagster
- Great Expectations
- dbt
- Trino
- Preql
- Erez Shinan
- Fivetran
- md5
- CRC32
- Merkle Tree
- Locally Optimistic
- Presto
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Random data doesn't do it, and production data is not safe or legal for developers to use. What if you could mimic your entire production database to create a realistic dataset with 0 sensitive data? Tonic dot ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it's made from production. Using universal data connectors and a flexible API, tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de identification, and ML driven data synthesis to create targeted test data for all of your preproduction environments.
Your newly mimicked datasets are safe to share with developers, QA, data scientists, heck even distributed teams around the world. Shortened development cycles eliminate the need for cumbersome data pipeline work and mathematically guarantee the privacy of your data with tonic dot ai. Data engineering podcast listeners can sign up for a free 2 week sandbox account. So go to data engineering podcast.com/tonic today. Your host is Tobias Massey. And today, I'm interviewing Gleb Mojansky and Simon Eskelsen about their work to open source the DataDiff utility that they have been building at DataFold. So So, Gleb, can you start by introducing yourself?
[00:02:07] Unknown:
Super excited to be here again, Tobias. Thanks for having us. I'm Gleb, CEO and cofounder of DataFold. DataFold is a data quality platform. We automate data quality assurance for data engineers, analytics engineers. We do this by automating all the daunting tasks that are related to testing things and observing data in general. And, Simon, how about yourself?
[00:02:28] Unknown:
My name is Simon. I used to be a principal engineer at Shopify, where I worked primarily on the infrastructure. These days, I run a little consultancy where I work with half a dozen companies on the most interesting engineering challenges that I can find, and this is certainly 1 of them. So I'm really excited to talk about it today.
[00:02:46] Unknown:
And going back to you, Gleb, for folks who haven't listened to your prior interviews, if you can just refresh our memory about how you first got involved working in the space of data. Yeah. Absolutely. I started working in a space of data back in 2013,
[00:03:00] Unknown:
and I happen to be the 1st data engineer at Autodesk's Consumer Group. So a very large company building with a lot of really cool technologies, let's say leading edge at that time that were emerging in data space. And then I joined Lyft and stayed there for about 3 years, was 1 of the founding members of Lyft's data team. My journey there was primarily related to building tools to enable data engineers and data scientists to create things and not break them. And I've certainly done a lot of breaking myself. And then most recently for the last 2 years, I've been working on DataFold where we've been working on making data engineering a more enjoyable and productive experience. So sort of continuing my mission that I developed while working in the industry.
[00:03:45] Unknown:
And, Simon, how did you first start working with data?
[00:03:47] Unknown:
I actually don't have a ton of data experience. The majority of the work that I worked on at Shopify for the past 8 years or so was on the production engineering infrastructure side. Essentially, just playing whack a mole with all the bottlenecks as we got, you know, kind of had the sliding window of every Kardashian family member do their big launches on Shopify, and then making sure that we would stay up and stay ahead of it. That was my role that I worked on for, a long time there, especially database adjacent because that kind of becomes your fundamental bottleneck. The end of my time at Shopify, I founded a little department within the company called new capabilities where we toyed with all kinds of technical limitations within the company. And 1 of the ones that I was working on was a new search platform.
And for that, of course, search is 1 of these things that really climbs or kind of, like, strides the boundary between the production engineering realm and the data realm because you have to funnel all this data to the search cluster. You have to massage it, but you also have to search to serve a lot of queries, right, online. So that was 1 of those barriers that I was trying to get down, and it exposed me to a lot more of the data tooling because there's a little bit of a weird chasm between the production engineering space and the data engineering.
[00:05:01] Unknown:
Now, in terms of this DataDiff tool that we're talking about today, I'm wondering if you can just start by describing what it is and some of the story behind how it came to be and why you're now releasing it as open source?
[00:05:13] Unknown:
Yeah. Absolutely. So Datadiff does a very simple thing in principle. It compares 2 datasets or 2 data sources and tells you the difference between them, same the way it's like dev tool for Word documents or what git diff does for the source code itself. And we found that this tool has lots of fundamental applications in the data engineering workflow and other workflows as well. And the general story behind that is back when we started at Datafold, we were looking at the data engineering workflow and we were seeing a lot of things that were suboptimal in the sense that people were spending a lot of time doing manual work, and that work seemed to be a good candidate for automation.
For example, the 1 of the most painful things that I observed being data engineer is change management because in data pipelines, typically there is a lot of business logic which is expressed in, you know, languages like SQL or dialects of Spark, PySpark, etcetera. And that business logic is typically very complex. So there's a lot of rules. You know, for example, what goes into calculating an active user or lifetime value or other concepts. And that logic is very hard to test. At the same time, typically, in companies that are data driven, so they really apply data to make decisions either by machines with machine learning or business intelligence with people.
They do a lot of iterations on these code bases. And so back to the change management, the problem that I saw is when we evolve this business logic, which is, you know, encoded in, let's say, DBT models for Airflow or DAGs, how do we know this change is doing exactly what it's supposed to? And so testing that turned out to be extremely painful, and there was no really good good answer to that. So what we saw people doing, whether that myself, is basically you make a change, you build some sort of a development or staging version of a new dataset, and then you compare it to production. And then you try to see, you know, is it actually doing the right thing? Is it the right new representation of the business?
Can anything be wrong with this? And it's a very painful manual process, so you just have to write a lot of code ad hoc to do it. And so when we thought about, okay, how do we automate this workflow? We came out with the concept of Datadiff because ultimately it comes down to understanding the impact of the change in the source code data. So that's how the data tool was born and we launched it 2 years ago while Dataflow was in Y Combinator summer batch. We launched it. We started getting traction with our first customers.
And, basically, the use case was when someone opens a PR, they would go to DataFold, and they would validate their change, see how things are changing both in terms of rows and columns and distributions, and then they would be able to merge the PR confidently or share this report with our stakeholders. And so then the next thing we discovered is that the users wanted to bake this into CI flow. So instead of actually going to separate tool and doing a diff, they wanted to have this report, the impact report, be produced automatically in CI in tools like GitHub and GitLab. So that's what we built next. And then what we learned is that it's not enough to just know the impact on the data side that you're actually changing because data is very interdependent and typically when you touch 1 thing, likely you'll have repercussions in other places. So someone may have built, a dashboard report or an online machine learning algorithm that relies on your data, and it's not necessarily apparent that those dependencies exist.
And so the question that our users were asking, can you help us identify the impact on the downstream things? So that has put us on a journey of building what's called data lineage. So understanding how the data flows throughout all the pipelines and we had to do this on a columnar level because that's pretty much the only way to actually assess the impact of the changes. And as you probably know, this is a very, very hard problem. We knew it was hard, but we didn't really know how hard it was. So it took us about 9 months to actually build it to the point where we're confidently building these lineage graph for large customers. But ultimately, what this allowed Datadiv to do is to show for any change done anywhere in the data pipeline, show exactly how the current data side is going to change, what's gonna be the impact on the data applications, who among the users using data such as data scientists or even execs looking at dashboard is going to be affected and, you know, all information being provided in the CI flow. So that turned out to be very, very impactful, but that's actually not quite the the use case that we are gonna talk about today.
The third thing we learned when we launched Datadiff is that the users wanted to diff, not necessarily for regression testing in the same database, but they wanted to use this tool to compare datasets across physically different databases. So for example, Postgres and Snowflake or MySQL and Databricks or 2 different servers of Postgres. And when they were coming to us with these questions, we're like, well, you know, why interested in this? Like, what exactly is the problem you're trying to solve? And what we learned is that they were trying to use this tool to validate data replication. So today, with the, you know, ALT pattern which is prevailing and everyone centralizing data in the data lake, the common pattern is that you're piping all data you have in the data lake. Or even if you don't have a data lake, you're still piping it around, for example, from transactional systems into your data warehouse and then maybe to another data warehouse.
And moving data between systems, especially systems that are, you know, not made by the same vendor, is a pretty complex process because typically there's a lot of data. It's very complex, but also coming at a higher rate. And so it turns out that many companies that we spoke to had a lot of problems with ensuring the quality of those data application processes. And now this is a very big market, right? And there are really big players there, both vendors and open source solutions like Airbyte and Debezium, both doing streaming and batch transfers.
But nevertheless, it sounded like there was a big problem of just no matter how you copy data around, it was hard to make sure that you end up with the same data, you know, in the end. And so we thought it's a really interesting use case and that aligned nicely with the concept of Datadiff comparing datasets. So we started iterating, and we built the first naive implementation that pretty much required us using a third party system. We used Spark at the time. So Spark would go into both systems, download datasets, compare it in memory, and then provide the results. And that turned out to be very brittle and very expensive because ultimately, you know, modern data volumes are such that a lot of the times transferring large volumes of data over network is very, very, very expensive and prohibitively expensive in terms of network IO and in terms of infrastructure.
So that was 1 problem. The other problem was that when you're trying to download data from different systems into 1 system, then you have to deal with all sorts of different incompatibility. So we pretty much scrapped this approach. And the other reason was basically if you're copying data from 2 systems into 3rd system, then you're pretty much redoing the work that, you know, Fivetrion and Stitch, you know, the other vendors are doing. So we didn't want to build a data replication company. We wanted to find a way to just assess the quality of that replication. And so as we were doing more research into this tool and different approaches, we were basically looking for a way to solve this problem without actually transferring all the data over network, which sounded a pretty hard problem to solve. And some time ago, 1 of our lead investors, Sarkhananzaro from Amplify, sent me an article that Simon wrote about a proof of concept of an algorithm that actually allowed to compare datasets across systems without actually transferring them over the network, which I thought was really clever.
[00:13:14] Unknown:
And so I was really, really impressed by someone's work. So I reached out and said, hey. Do you wanna build this tool and make it open source together? Because it sounds like a really big problem when you care about this. And then I'll transition to Simon, and we'll hear his part of the story. Yeah. So I have a blog or newsletter that I call Napkin Math. And this is sort of something that grew out of just a lot of technical reviews in my previous role where we would talk about, you know, okay. This system is gonna do x, y, and zed, and then someone will come along, and they would say, you know, it's too slow to have this solution, so we gotta have this other solution. And what I observed of some of the best engineers around me was that they have this intuition for how fast systems perform. Right? They can very, very quickly tell you, well, this is gonna be too slow to transfer all of the data. You need to do something along this line, and they can design systems in, you know, minutes that otherwise can take a lot of iterations. So I started a blog kind of around this concept of building intuitions for how systems can perform, how fast can you check some, you know, a 1, 000, 000, 000 rows. You can calculate that very, very quickly if you have access to a little cheat sheet that I've made. And so 1 of the problems that I saw that was really recurring was this problem that Gleb's describing. Right? If, like, you're replicating from a to b, and you're not confident that b is gonna be the same as a. 1 of my pet peeves is that even if the replication mechanism is in theory a 100% reliable, right, because you're using a replication stream and blah blah blah blah blah, yadayadayada, like, you know, it's based on some paper and so on, You know, no programmer is immune to an early return or some exception or a truncated value or a, like, you know, type incompatibility or whatever that actually causes data loss. So I've never trusted those mechanisms at all. It's always been a bit crazy to me how much of the data realm run on big where queries been updated at, where you can lose hard deletes and things like that and do financial reporting on that.
And where I got exposed to this problem was that at Shopify, I was working with a 2 team that was writing a tool to move shops from 1 shard to another. So this was super important because you can imagine that in a big sharded setup, you essentially have some version of the bin packing problem of trying to balance all the shops across all the shards. So if you have, you know, a 1, 000 shards and you have millions of stores, you wanna balance them across all of these shards to minimize the cost and how many shards you need, traffic, and so on. So we needed the actual mechanism of kinda surgically taking 1 shot from 1 shot and moving it to another, but we didn't trust. You know, 1 of the interns wrote a TLA plus proof that this algorithm was correct. We did all kinds of things. But 1 of the best engineers I've ever worked with, Florian Weingarten, who worked on that team, said we also need to check some, you know, a and b, even though we have this proof. And then he additionally said, also, we're gonna have this intern coming in write a separate implementation of the mover that only verifies it just in case as well. So we had all of these mechanisms to check. And then once we solved the problem there and the shop mover moved to into production, I started seeing this problem everywhere. Right?
If you are so many engineers spend a lot of time handling webhooks. Right? That's also moving from a to b, but there is no good way of checking it. If you do a 500 and they don't resend it, you're kind of SOL. Right? On the data side, right, you have these very complicated pipelines where you're moving from, you know, a MySQL replication log into Debezium, into Kafka. Then you have these transformers putting it back into Kafka, and then you extract it out of Kafka into another system, then it puts it into another database that then does and I'm like, you know, even if everything is 1 out of a 1, 000, 000, 000 error at a big company, it's gonna happen a lot. And even worse is when you then do a count on 1 end at the other end of the system and accounts don't line up. Like, how are you gonna find out what that missing row was? Right? So I kept seeing this problem everywhere. So I decided to write up this article on my blog about how I thought that you could use this kind of checksum based approach, which we'll describe in more detail later on, to move all the computation into the database and even across very different databases like Elasticsearch and MySQL and Snowflake. All of them have this common denominator of being able to do sums and checksums on the columns that you care about. So we came up with that approach, and while we didn't deploy it, at least not in my time at Shopify, I wrote the article. And then when Galab reached out to write an open source version of it, I mean, that was absolutely perfect fit. So that's sort of how I ended up in this problem.
[00:17:44] Unknown:
And as far as the utility of the DataDiff tool, you mentioned that it does give you this capability of understanding, okay, I have a mismatch between my source and my destination systems. This will actually help me understand what the source of that mismatch was or what the missing data happens to be. But there are also a number of other approaches to data quality as far as making sure that, you know, as I'm replicating my data to running my pipeline, I can be alerted on anomalies in terms of changing volume where I expect that I'm going to replicate 10, 000 rows because that's what I usually replicate within this 24 hour period. All of a sudden, I'm replicating 50, 000 or 1, 000, and so I wanna be alerted and be able to look into it. Or, you know, I want to make sure that every time I replicate this table, it has columns a, b, c, d, and e. And now all of a sudden, I'm also shipping column f, so I wanna be alerted about And I'm wondering if you can just talk to some of the ways that this Datadiff tool composes together with other utilities for data quality and data monitoring and sort of the general health and well-being of a data system?
[00:18:47] Unknown:
So data monitoring is a really, I think, powerful concept. But in general, it relies on an assumption typically that the data that we saw yesterday or the way the data looked yesterday, right, or last hour, whatever the previous time window is, is roughly what we expect today. Right? So basically just looking for anomalies relative to the previous distribution. And the reason why this approach is used mostly in most tools is because that's pretty much the only way we can generalize for kind of monitoring data across thousands of tables and tens of thousands of columns. Otherwise, we wouldn't be really be able to do it at scale and accommodate for all the different types and uses of data.
But the problem of that approach is signal to noise. Right? Because if you're applying this assumption that, well, you're going to alert if the data doesn't look like it used, you know, used to look yesterday, we are basically prone to pretty high ratio of false positives. So alerting on things that aren't actually anomalies and false negatives. We're not catching things that are anomalous. And so I think whenever we have an opportunity to know for sure if the data is actually correct and incorrect, this is a luxury we should definitely seize.
And interestingly, in the case of data replication, it's actually 1 of those cases where we can know for sure what is right and what is not because ultimately, the whole use case is about you're moving the same dataset from point a to point b, and you know that if the data is exactly the same, this is success, and if it's not, this is a failure. Now the second question is, what is the magnitude of this failure? But a lot of the times, the use case of data replication is to leverage data from production system in the analytical warehouse. And because it's coming from production system, data teams typically like to call it source of truth because it's basically what, you know, the app, the actual application, the actual business logic generates. And so if this data ends up not being reliable, this is pretty much destroying the last hope for, for analytics.
And that's why I think it's really important to get this data, not I think, but in general. We know this is very important to get this data right. And so coming back to Datadiv, power of this approach is that it doesn't generally rely on the anomalies. It basically is able to verify at scale whether you actually have data the same data in both ends. And, obviously, you can slice and dice it, and you can, you know, compare a given partition or only data that was copied between the certain points in time. But, ultimately, it is the the beauty of this implementation is that it is able to tell you at scale, and we're talking about 100 of millions, billions of rows, checks performed, you know, in single minutes across systems, physically different over the network, whether datasets are in sync. And at the same time, it is able to pinpoint up to individual rows that are out of sync. That was not previously possible, at least to our knowledge, at all.
And the way that this tool can integrate into different systems, you know, including data quality frameworks, is actually fairly elegant because we intentionally shipped it as an open source package that's on GitHub, but also can be installed via KIP, and it can be embedded into data orchestrasur, like, you know, Airflow and Daxter. It can be embedded into data quality frameworks such as great expectations. So you pretty much can just run an expectation that your data is in sync while using Datadiff as the actual tool to check that expectation. You can also wrap it in DBT and ensure that, you know, if you're leveraging data from different systems, that data is in sync. In terms of how it plugs in into other tools and, you know, in in, let's say, DataFold platform, which is the part of the platform which is not open source, it's basically to provide a more complete end to end solution to solve a particular problem. For example, if you are actually validating their application with Datadiff and you identify a certain discrepancy, your next step typically is to assess what is the impact of that discrepancy. Because, again, let's say we're talking about replicating hundreds of tables, tens of thousands of columns, which is the case for even, you know, medium sized business these days, you are almost guaranteed to have some anomalies, some inconsistencies flowed up here and there. But not all of them probably warrant your, you know, data engineering team's time to go and investigate, let alone wake someone up at 4 AM in the morning with a pager alarm. Right? So how do you know what what the impact is? And so this is where the larger data platform comes into play. So, for example, we can leverage the same column level lineage functionality that I was talking about to basically rank any anomaly against the importance of data. So understanding whether a particular column or table is used in any business sensitive applications, online applications, What is the total amount of usage that any column gets? And based on that, treat those anomalies differently.
That's just 1 example. So we intentionally build this tool to be really embeddable and very particular in terms of what it does so it can be plugged into any system.
[00:24:10] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity, With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast .com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. Another interesting element of how to think about using this tool is the case where you're doing change data capture, and so you're also replicating data that gets deleted. And so you might have some of those deleted records in the destination system that no longer exist in the source system. I'm wondering how you approach being able to identify and account for that fact that I expect that the destination system is going to have more rows.
This is the signature that I expect them to have. So when you come across this, ignore the fact that there are more rows in the destination system. Don't alert me on that. But if there are other variances between the 2 systems, I still wanna know about that as well. It might help now to explain
[00:25:56] Unknown:
how DataDiff actually works and then answer your question. Because to answer that, I think we have to understand exactly how it works. Right? So the way that the tool works, right, is that if you imagine kind of a spectrum where on 1 side of the spectrum, you have account. Right? So you do account in the source and destination. If the numbers are the same, you're reasonably ish sure that they're probably maybe the same, but you don't actually know because you could have lost some updates or you could have an incompatible data type or anything like that. Right? So that's account. And most of the time, if the count's close enough, you're gonna just kinda back off because figuring out why the count is different is gonna take a long time. So very low fidelity, but also quite fast on the database. So that's 1 side of the spectrum. Other side of the spectrum is to literally compare every single row. Right? Have some kind of mediating process or some kind of cluster like Trino or Dataflow or something like that that can do these massive queries and then shuffle around and do a massive join across multiple databases and figure out whether everything's the same. Doable, but a lot of companies, that would take quite a while to run. Right? That's a very expensive operation, not just in terms of runtime. Right? This could easily take hours, weeks, months on large things.
Network costs, cloud costs, like, it's a very expensive operation and time consuming. Right? And what's frustrating with a lot of data tooling is that it is so time consuming and you, like, sit down and you have to, you know, go for a lot of coffee breaks while you wait for things. And we wanna try to minimize that a little bit. That's the other side of the spectrum. So what we wanted to be able to do with this tool is that it very elegantly slides across the spectrum. Okay? And the question is, well, what's in the middle of that spectrum? What's something that's pretty fast but it's also able to tell you what the differences are? The algorithm that I propose in the blog post and that since me and the other engineer on the project, Erez, who you've had on the podcast before about Prequel and he's written other open source projects, him and I are working together on this project, the algorithm that we've come up with is essentially that if you imagine you have a 1, 000, 000 rows or a 1, 000, 000, 000 or whatever on the left side and a 1, 000, 000 on the right side, what we start by doing is dividing it into segments. So let's say we have 10 segments. So segment number 1 is ID 1 through a 100000.
Segment number 2 is a 100, 001 through 200, 000, and so on, so on, so on. Right? All the way up to the last segment having the last 100, 000 IDs. We do that on both sides. Then we find the lowest common denominator for figuring out the identity of that segment on both sides of the databases. Right? So if this is like Postgres to BigQuery or Snowflake or something like that, you can imagine a query that essentially does a sum of the MD 5 of the columns that you care about for that segment. Right? So it's like some MD 5 ID and then maybe updated at and maybe other columns that you care about where ID, you know, greater than equal 1 and less than 100, 000. You do that on both sides. So let's say on the first segment, the checksums line up. Right? The sums of DMD 5 are the same. Great. We don't have to look at that again. The literal mediating Python process that is that data in between, you know, it it got, you know, 32 byte checksums from both. Compare them, and you're done. Right? Moves on to the next segment. In fact, it does all of this in parallel. Right? But it goes to the next segment. Let's say now on the next segment, the checksums are different.
So the left side and the right side have a different checksum. Now all we know is that these 2 segments are different. Then we take that segment and divide that again into 10 new segments. Right? If this was the next segment, it's a 100, 000 ID, a 100, 000 to a 100, 000 and a 110, 000 to a 120, 000 and so on into 10 segments. Then we check some of those, and then we find that, okay, it's the middle segment. Right? In that middle segment, that's different. We download all the rows because they're below this threshold of 10, 000. We download all the rows. We dip them, and then we print out that single row, whatever that's different. So now we can do this very fast. Right? So on a table with, let's say, a 1000000000 records account in something like Postgres takes maybe half a minute to a full minute. We can do this full check summing operation between 2 databases in 3 to 4 minutes for a 1000000000 records, and I didn't even test how long this would take. I haven't run that test yet because I don't have that much time of how long it would take to download all the rows and compare them. Right? But, typically, what I find is that the check summing tool is about 5 to 6 times slower than a count, and downloading every row is something like 20 to 30 times slower than our check summing approach.
It's really quite fast and single digit minutes is great. So that's essentially how it works. So in terms of then to go back to your question of how we detect deletes, well, it's very simple. Right? The check sums wouldn't line up for that segment. It's then gonna go down and find the ID range where it doesn't line up, gonna download all the rows, and then it will see then it does, you know, just a simple in memory operation in Python to see which rows are different, and then it will spit out, like, a little 2 pole that has a minus sign and then all the columns that you cared about. So that's how it does it, and that's how it works.
[00:31:08] Unknown:
So in terms of then saying, Okay, I see that there are values that are in the destination system. Now I want to understand that, okay, these actually exist because these rows were deleted from the source database. I imagine then factors into whatever additional logic you want to wrap around the execution of the DataDiff utility.
[00:31:25] Unknown:
That's right. So, I mean, you could write a complete replication mechanism based on this. Right? If you have something that does an updated at greater than, if you have even a fairly large table, you could just completely replace that with data diff, which is gonna be more reliable for you. Or you can, you know, create an SLO and be alerted. Right? If you're doing financial reporting, right, like, if the numbers don't line up, like, literally, someone could go to jail. Right? Like, this can be very, very important.
[00:31:49] Unknown:
So that's right. Yeah. And I think the the other part is we also, as a user, you can materialize the results of the different process back into database, both, you know, either a side or b side. So for example, if you're moving data from Postgres to Snowflake, you can materialize the diff itself into either of those and then you can use the ITools to investigate those and run queries and see exactly what doesn't light up and run analysis. You can also put anomaly detection on top of this. You can also build automatic remediation mechanisms where you would have a job that would look at whatever the diff was and only request that certain rails are actually, you know, resynced.
So, basically, the idea is to give a really solid building block or blocks for people to really make those replication processes robust.
[00:32:39] Unknown:
I think another very common use case, right, is, for example, someone who's moving from, let's say, Heroku to another Postgres instance. Right? They wanna make sure all the data is the same. Another thing that I think is actually quite important for Datadiff and something that's often neglected is that default data types between databases are quite different. Right? So to give you an example, I was talking to someone who's using Datadiff, and they are I think they have 5tran or something along that to replicate a currency value from Postgres into Snowflake. And this adapter, even though the origin amount is a, you know, exact number, which is what you want for currency, you don't wanna use floating points for currency as anyone who's dealt with money knows. But by default, Fivetran, I think it is, will create it as a float in Snowflake. Right? And data diff will then tell you, hey. Like, we're gonna do our best here. Like, if all the floating points lines up with your numeric value to an exact precision, the checksums are gonna line up. But data diff will also warn you that your schema is problematic.
Right? So you inherently are going to have data loss at some point. And I think that's gonna be an incredibly important part of Datadiff as well because if you use something like Fivetran, right, it just kinda creates the schema.
[00:33:49] Unknown:
From what I've seen, I wouldn't really trust it to make good decisions there. Digging into that aspect of the potential variance in data types between the source and the destination systems, I'm also interested in digging into the selection of the hashing algorithm. You mentioned MD 5 as the default that it's using out of the box. And I also know that while it is 1 of the faster hashing algorithms, it also has, you know, the very real potential of generating hash collisions. And I'm wondering what you have done to mitigate that potential issue or giving the opportunity for end users to say, I actually want to use a different hashing algorithm.
You know, if the database supports it, I actually would prefer to use, you know, SHA512 or something like that and spend that extra time that is going to be required for the extra compute that is necessary to compute that hash. Yeah. I think for data verification,
[00:34:36] Unknown:
hash collision is kind of a theoretical problem that you shouldn't really have to worry too much about. Right? This is not a cryptographic hash, so we might actually want to use something like CRC 32, which is a little bit faster. In reality, I would love to use something like XX XXHash, right, which is about a 100 times faster and can be SIMD than MD5. The reason why we're using MD5 is just because it's ubiquitous. Right? I think, in reality, what we might do at some point is just to not even hash it, but literally just try to sum if they're integers or floating points because for data verification, that's usually sufficient. Really, what I would love to have happen over the long term, right, is that this becomes a fundamental primitive, not just of databases, but also of SaaS APIs.
I could see a future where something like Postgres, you do create index of a type Merkle tree, and then you can do these kinds of operations in constant time because they're done as records are inserted. Right? Similarly, this could also be the case for SaaS APIs, which we would love to support at some point, where the drivers are often of even lower quality. So if you're, you know, taking orders out of Shopify and putting that into your data warehouse, we would love to be able to negotiate with an API and use the exact algorithm that I described before of exchanging hashes until we figure out what rows we've missed from our wacky, you know, webhook infrastructure. So I think there is a lot to do here, and there are also databases that don't even support MD 5. Right? So, for example, Elasticsearch doesn't support MD 5, even though it does support a lot of analytics queries. So in there, you would either have to script it in their little, like, cute scripting language, but really, you probably have to fall back to a sum. So this becomes a little bit like negotiating a TLS version. Right? Where you might wanna start, like, oh, cool. You have XXHash, like, 1 of the fastest hash algorithms out there. Great. We'll use that. And then you regress all the way down to just literally summing on the, you know, ASCII values of strings. Right?
And then we see another thing, which is something like MSSQL, where hashing is a 100 times slower than in Postgres. It's something a rest found. So it's almost unusable. Right? So there's a lot that could be done here. And, of course, for this to be a successful project, we just have to start with the lowest common denominator. But I think the ambitious vision here is that database vendors and SaaS API vendors really get on board with this, and we describe in an RFC what it would look like to do this negotiation in a really fast manner. And I think the API companies aren't should be incentivized to do this. Right? Because now instead of doing a full, you can just exchange a couple of hashes and figure out whether things are in sync or not. Another interesting element of the way that you approached this solution is how you thought about the
[00:37:13] Unknown:
selection of language environment. And you mentioned the algorithmic approach, but how to think about the potential ways that it might be integrated into the overall data workflow and organization and what types of interfaces you wanted to be able to provide and support versus saying, okay. Here's a command line utility. Here are all the parameters that you wanna give it. Good luck.
[00:37:34] Unknown:
Yeah. Exactly. So, yeah, we've chosen Python, and I'll go into a couple of pros and cons of of that. I think, in general, what's been most surprising to me is, like, we've really been looking hard for a tool that is comparable to Datadiff. Right? Like, we don't really wanna write this. Like, prefer preferably something would just exist that does this. And I think it maybe comes down a little bit to this CASM I talked about before between production engineering and data engineering. Right? In order for you to write a tool like this, you really have to understand how both a columnar database works and a relational database works. Right? A columnar database is very slow at doing something like a min max query on a where, which is something that we use, but a relational database is incredibly fast at that. Right? So having the knowledge of both is incredibly important for writing a tool like this.
So in terms of the runtime, we chose Python, and the reason why we chose Python is because we need a lot of high quality adapters, and we see that the majority of the people who are gonna be using this out of the gate are gonna be people who are are gonna be data engineering teams who are responsible for that replication going into their analytics database of their data lake. Python is the ubiquitous language in that realm. Right? It's approachable. So the obvious kind of retort using Python will be, well, you know, it's a slow, like, JIL language, whatever. But because we're offloading almost all of the work into the database, this just doesn't really matter very much at all. It only really matters when we're downloading all these rows to compare them in memory, but this is a tool that should be used when the vast majority of records are the same, not when not when everything is different. Right? Then, yeah, you might get into a situation where the serialization and the network traffic is gonna matter a lot more, but that's not really the realm that we're trying to occupy with this tool. Right? Your replication mechanism should be pretty close. And if it does lose data, it's gonna lose it in adjacent segment next to each other, which DataDip is gonna be good at.
The 1 place where this might have trouble is if you're trying to, let's say, check whether all of a table is in a Kafka petition, right, which is the kind of thing that a tool like Debezium does. It takes the replication stream from MySQL or Postgres and pops it into a Kafka top. For something like that, Kafka, you know, checks them all on that. So you kind of have to, like, you know, spin up a bunch of threads, download all of them, and then run it yourself. That would be 1 where the runtime is gonna start mattering a lot. Right? Now you're doing a lot of serialization and deserialization, which is the kind of thing that a scripting language is poor for. For now, I think that the approachability from the data engineering community with Python is more important than that. And for verifying something like Kafka or similar systems, we would then recommend that you use something like KSQL or Flink or Beam or something like that and write an adapter for that, which can do that kind of aggregation.
[00:40:17] Unknown:
As far as that actual broader integration into the data flow, you mentioned a few different ways of saying you can wrap this with great expectations, you can wrap it in dbt, you can run it as part of your data orchestration engine. What are some of the approaches that you've seen folks using now that this is available to integrate DataDiff into their overall workflow of whatever their data replication scheme might be or the ways that they think about synchronizing different systems or maybe some of the unexpected types of synchronization that you would see where it's not just from your, you know, application database to your data warehouse.
[00:40:51] Unknown:
Overall, I'd say that the project is still quite young and, you know, we're speaking very early in the, I would say, adoption phase. So the users who we've seen adopt DataDiff already and also those who inquired and kind of have been our design partners in building this, typically, we're interested in using the tool in both kind of ad hoc sort of, you know, surgical. We need to check this. This looks off way. So we built a CLI interface and made it really easy to just run a quick check for anyone who wanted to. And then the second most popular use case was to wrap Datadiff into a data orchestrator such as Airflow to basically run the validation in alignment with the general data transfer process and other processes in the overall data e e ELT process.
So, basically, let's say, although the replication process can be continuous and done via streaming, at some point, the transformation logic is mostly expressed in SQL and mostly done in batches. And so what the users wanted to is basically embed data diff in, let's say, an airflow DAG and run the check on all the source tables before the actual business logic that takes this data and transform this into the tables that business uses kicks in. And so for that synchronization, they wanted to embed into Airflow. So basically, wrap it into an Airflow operator or a Duxter solid is probably the 2nd most popular use case that we've seen so far. I would say in terms of the most unexpected use cases, and I think Simon touched on this a little bit, is that there is general interest in validating data replication across systems broadly, not necessarily just relational or columnar databases. Because if I think about this, you know, the modern pattern is we copy data from all the places and we put it in the data lake, and then from there, we copy it back into other systems.
Right? So there's constant movement of data in their organization. The larger data platform is, the more data driven the businesses, the more pipelines exist. And, you know, how do you ensure the reliability of all of this orchestration? Moment you start, you know, the moment some data becomes inconsistent somewhere, this inevitably cascades and you have ripple effects, and it becomes prohibitively hard to fix things and to even find out where things went wrong. So putting something like Datadiff in place for data replication in general seems like a very good idea and I think a pattern that we will see data platforms increasingly adopt, not necessarily with the tool we build, but in general in terms of checking the data replication.
And so with that, you know, with that in mind, a lot of the systems that give data to for further analytics, they don't have a relational interface. Right? So talking about APIs for systems like Stripe and Shopify. And so, basically, what we've seen that wasn't expected, at least to me, initially, was that there was a very strong interest in having validation of those systems. How do we know that, you know, total revenue as accounted in Stripe actually matches what we have in Snowflake and then flow to execs, you know, on a kind of CFO dashboard. Right? And that's a very, very, very hard problem, which doesn't currently have a good answer. And we have some ideas how to implement this, you know, even without necessarily relying on, you know, the third party vendors adopting our protocol. The grand vision is that probably there will be some open source protocol adopted that would enable these kind of validations for any systems that participate in this kind of overall data exchange in the data platform. And that's probably the most exciting
[00:44:31] Unknown:
part of where the tool can go. The other interesting element of this is the fact that you did decide to release it as open source. And, actually, before I get too far into this, just a kind of meta question and meta comment. When you first sort of presented this topic as an idea, I was originally thinking that this was sort of an open sourcing of the utility that you had had built in as sort of the core of DataFold of the being able to see, like, as you make the changes to a pipeline, like, what are the impacts that it's going to have in your data warehouse? Is that accurate as to what this sort of the open source version of, or is this a completely brand new tool that was kind of designed and implemented as open source from day 1?
[00:45:07] Unknown:
So Datadiff that we released as an open source tool is an extension to our proprietary Datadiff product. Okay. It is complementary in that it opens up the additional use case of comparing datasets across databases while also supporting the original use case of comparing datasets within the same database. And, you know, it may be not very conventional for Evander to, like, suddenly release an open source tool that would actually be quite the core of what the vendor is doing. And, you know, our reasoning was the following. In general, I am, I would say, very ultra pragmatic when it comes to open source and that I believe that open source is great in general. But when it comes to actually teams deciding whether to adopt a vendor solution versus an open source solution, I think that they need to consider all pros and cons and really look at this not just from, you know, can we tinker with this, can we extend with this standpoint, but also consider, you know, potential maintenance costs and other factors that can play out in general sort of economic cost and benefit for the company of using a particular solution.
I've seen, you know, those decisions being made when, you know, teams over indexed on a particular dimension while overlooking others. And so I think that with that regard, why would, for example, it makes sense for someone to build open source tool and for someone to adopt open source tool in a data space? I think there are probably 3 broad categories. So 1 is probably extensibility. So, for example, it makes sense for, you know, a tool like Airbyte to be open source because people can contribute connectors. Right? That's a very obvious way to kind of benefit the community, but also increase the value of the product. Or for DBT because it's a platform and platform upon which people build things, and so having an open source code makes sense because it can extend it in all the all the different ways.
The second round probably is embeddability. If the form factor of the tool is such that the primary use case is to embed something in in another product. And the 4th 1 is security. Right? So if the tool is open source, then it can be adopted by anyone, and it's harder for vendors because sometimes the compliance just not wouldn't allow you to come in as a vendor and, you know, for instance, host very sensitive data. So when it comes to our main product, the Dataflow product and the original DataDuff tool that we build for the use case of automating testing of ELT pipelines or the T part of ELT, right? The data transformations because the tool was targeted at testing that changes to business logic where you actually do expect quite a bit of changes.
It inherently needed a very rich UI with a lot of data visualizations and slicing and dicing things to see, okay, you changed this definition of revenue, how this impacts your distributions of other financial metrics, and how potentially this impacts the change of data over time so that you can actually understand what it means for the business, what it means for machine learning. Are you going to have any SKUs and features? And so because of that, you know, really rich opinionated UI and also because to integrate that tool into CICD, we needed to connect with multiple applications such as, for example, GitHub, Slack, DBT Cloud, Snowflake, all at the same time and orchestrate this testing workflow.
We just didn't see a big value in open sourcing this because we knew that it would be really, really hard for someone to really extend this And then having a consensus driven conversation about a very opinionated UI probably would be also hard. And, also, we wouldn't imagine someone embedding, you know, that tool somewhere else. And as far as security goes, we solved it with an on premise deployment model where we could deploy into customers VPC using Terraform, so we didn't really have this as a roadblock. So that's why we focused on for our main platform that was built around this automation of regression testing to be proprietary.
For cross database diff, on the other hand, it's a very different form factor and a very different way that we saw that teams wanted to use it. Right? UI wasn't important because, ultimately, you do expect data mostly line up, and it's not about testing things when you are making changes to source code. And so it's not a human driven workflow. It's actually a machine driven workflow where you continuously replicate data using, you know, batch and Airflow and Airbyte versus or, you know, CDC with Debezium and Kafka, and you want to find out whenever things are broken. So it's sort of an automation workflow by definition, and therefore it makes sense to have a tool that can be embedded in those automation workflows. So we talked about Airflow, right, for example, and you know, requiring someone to call a vendor API from Airflow would be kind of clunky and probably would defeat the purpose of a really robust system that would validate things.
And that's why because of this form factor, we decided to make it open source so that it can be actually implemented in the most effective way, the way the users wanted it. Also makes perfect sense for being extensible because the form factor is quite simple. You connect 2 databases and then the tool does the job. There is no complex orchestration between multiple tools like in our main product. And so at the same time, we knew that we needed to implement lots and lots of different connectors because while the modern data stack is standardized on, you know, maybe 4 or 5 major platforms, you know, BigQuery, Databricks, Snowflake, Redshift.
For source systems, for transactional databases, less is much, much longer. Right? So you have big enterprise, you have tons of different open source solutions, not to mention more exotic systems like, you know, Elasticsearch and API driven stuff, and so we want to make sure that we enable a way for users to use our algorithm by contributing connectors and supporting whatever version of whatever software they run. And we also saw that it would be actually quite simple to add those. And so the combination of this extensibility and embeddability aspect made us basically truly as an open source tool. And like I said, you know, while we release it with MIT open source license, so very, very, you know, liberal, and we're not planning to monetize this particular tool, the larger vision for us is that the default platform provides a more complete solution as and we wanna give away core functionality such as verifying data replication consistency for free. But ultimately, if you're a big company and you want not just to know whether 2 tables are in sync, but you want to do this for hundreds of tables and you want to build a process around that where you actually build a playbook for a team of how do you check those things and how do you react when things go wrong, that requires much more than just a simple tool to know whether a and b datasets are in sync. Right? You need to understand how you prioritize those issues, how you react to them, how do you do root cause analysis, and that's where the bigger platform comes into play. So that's pretty much the logic.
[00:52:05] Unknown:
In that process of saying, okay. We're going to release this as open source. This is part of our overall strategy, both from a kind of community play and organizationally how we think about the way that it fits with our business. What are some of the challenges or the kind of design elements that went into how you thought about the open sourcing of that project and how you engaged with sort of making it available to the community and just the overall kind of messaging around it, how you thought about, you know, where is this going to live in terms of, like, the GitHub organization? Is this a DataFold project? Is this its own project? Like, all of the logistics that go into saying, I have a project idea. Now I need to actually go and release it and make it available to people and, you know, build a community of users around it.
[00:52:51] Unknown:
Yeah. So I think in general, we have a luxury of not really betting for that tool to be, you know, how to make living as a company. Right? Unlike some open source vendors that basically have 1 single project, and then they have to have a pressure to monetize it, and they have to make all the different trade offs of what features are in the core product and what features are in the cloud or enterprise versions, and that can create a lot of friction with the community. We just wanted to create a tool that would be very valuable and would really deliver to the maximum extent possible within a well defined scope. Right? So we know exactly what the tool is supposed to do, and we wanna make sure that it does it extremely well, and we're giving this out as an open source project.
And obviously, you know, we're a business, so it's not necessarily a charity project. We are doing this because as a company, we see our mission to enable data engineers and make data engineering a more enjoyable productive experience. A lot of issues that we're solving to get there are related to data quality. So that perfectly plays into our mission. And we also see that some of the fundamental aspects of data engineering workflow that are pretty much absent, like why isn't data diffible in general? Right? Why is it so hard to compare data? I feel like this is something as fundamental as Git, right, in terms of, like, version control.
And so we wanted to make sure that we create some really fundamental things and kind of give them momentum in the community so that both the community can build on top of this and everyone improves, and also we can build on top of this and offer more sophisticated tooling and workflows and further automation. So that's kind of the the general philosophy. But in terms of, you know, how we use this and how we wanna promote this, so we've announced it, I would say, semi broadly already, and we've seen, you know, very positive interest and people came and started using it in ways we mostly would expect. And the plan is to continue promoting and investing in this project to gain adoption.
We probably are not going to try to create a, you know, a stand alone community for the tool just yet because I think there are already too many communities in data space. You know, my Slack tab where you have all the different Slack spaces working in the meme member, it's already full. And so I don't wanna add another 1 and, you know, have people gonna jump between Slack communities. So we actually are present in both local optimistik, which is vendor agnostic, and we provide support there in a dedicated Slack channel called tools data diff. And then we also present in the dbt community in the Tools Datafold channel. So anyone can, you know, come in and ask questions and give ideas about that. So we kind of went, you know, very pragmatic about the community part and we wanna meet people where they are first.
[00:55:40] Unknown:
It's time to make sense of today's data tooling ecosystem. Go to data engineering podcast.com/rudder to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity. The guide includes architectures and tactical advice to help you progress through 4 stages, starter, growth, machine learning, and real time. Go to data engineering podcast.com/rudder today to drop the modern data And so in terms of your experience of building this project and releasing it and figuring out how it's going to be used, how to factor it into the overall ecosystem of data capabilities that organizations have? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:56:27] Unknown:
I don't think there's been anything super surprising, partially because I think between Erez and I, we'd had a lot of experience working on these types of things before. I think the biggest thing which we knew was gonna be a grind is that there are so many data types between the different databases, and a lot of them are incompatible. So Erase has kind of been in this, like, data type combinatorics hell. He called it purgatory, I think, for the past few days, where to try to explain it. Right? 1 of the things we noticed with 1 of the companies using it was that, by default, Snowflake has nanosecond precision on their datetime. But something like Postgres has microseconds, so 6 digits of precision versus 9 if you just do a raw datetime.
MySQL has 0 digits of precisions by default, so just seconds. So if you try to compare the same timestamp on all of those and we concat them to strings for a variety of other reasons that I won't go into. But, of course, they're not gonna be the same. Right? Because 1 is gonna have 9 zeros, the other 1 is gonna have 6 zeros. And the third 1 is not gonna have any zeros at all after the floating point. Right? So then, okay, cool. Like, you know, then we just make sure we pad the integers, but then you discover that the rounding is different. So for example, Snowflake doesn't round a date time if you truncate it, but Postgres does. And what they do on insertion is also different. And I'm throwing a lot here, not because it's important to understand, but imagine that and then across all the data types for different databases. Right? So we have a test suite where it is running 100, soon thousands of tests of, imagine, like, snowflake timestamp with 9 digits of precisions against Postgres with 6 digits of precision.
And you can imagine things like big end versus small end for the same number, numeric versus float, floats with different precisions. And, again, they have to negotiate to then downcast the precision. So this combinatoric, like, explosion of data types is a pretty difficult problem or just quite the grind to do and also to do performantly. It's 1 that we knew that we had to do, but we thought that we could at least for date time, that DAP will standardize, and we didn't have to worry about it for the release that we're doing this month. But we've had to do that out of the gate. So, yeah, if you're trying to verify, like, polygons across databases, that might not work yet, but patch is welcome. But we just kinda have to make it through this matrix of type combinations over time and build really good test infrastructure for that.
[00:58:58] Unknown:
So luckily, I'm the business guy, and I'm a little bit obstructed from the grind, but I'd say in general that this is 1 of those problems where it's, like, 80 20. Right? Like, it would many other problems in data. So we build the prototype really fast, but then to actually make this tool generally applicable and production ready for all the, I would say, even the the mainstream data ecosystem takes a lot of effort and time. And so we hope that those efforts that we invest in making sure that they require polish and robustness will be appreciated by the community, and we'll have others join in and have help make this tool really broadly applicable.
[00:59:38] Unknown:
And so for people who are trying to think about how do I make sure that my data replication is running properly and that my overall data system is, you know, functioning the way that I assume that it is and want it to be. What are the cases where a Datadiff is the wrong choice and you might have to go with a different utility, or it's just not a useful
[01:00:00] Unknown:
tool to actually solve the problem that you might have. We've covered a couple of them, but I'll quickly summarize the 1 that I see. So 1 of them is that if you have a lot of diverging records, right, if you're trying to use it as a massive scale, like, replication mechanism. I think if it's smaller scale, like, you know, tens of thousands or 100 of thousands or even millions of records, it's fine. But if you're trying to replicate billions of records, this is not the right tool. The check summing is gonna have so many mismatches that you're doing more work than just pulling down every record. So it is for mostly similar tables, right, like point 001% mismatches.
These engines that don't have aggregation engines that I mentioned before. Right? So for something like Kafka, where it can't do the computation in the database, there are others like that. You will need something that sits on top of that to do the aggregation, like Trino, Presto, Dataflow, like Beam. There are lots of those with very, very large tables. And here, you know, we can do 1, 000, 000, 000 of rows in a few minutes. But if you are in the trillions of records, the algorithm will fundamentally work, and it will run, and it will probably do it in in hours or so. But in those cases, you're gonna wanna do a little bit more finagling with only verifying windows of time. So, say, like, only the past, like, week and things like that, which is very, very doable, but just requires a bit more tuning. The larger the table is, the harder it is for us to guess the defaults. But for the majority of cases, it should be a matter of, you know, coming up with 2 URI strings from something that has permission, and then it it should work. But there's lots of tuning available. Like, your DBA might have concerns about running lots of these big checksum inquiries on a production read replica so you can limit the amount of concurrency.
We can limit the size of those queries. There's lots of things like that to placate everyone to make sure that you can run this in production. Your DBA should be very happy with the queries we have if you have the right indexes. As you continue to iterate on this project and build it out and do bug fixes and feature enhancements, what are some of the things you have planned for the near to medium term? Let me just mention 1 other thing that I'm just thinking about now where I don't think we've made clear so far. Right? If you have a very wide table with like a 128 or whatever columns, right, Datadiff is not a tool where you're going to verify every single 1 of those columns. We can, and it's gonna be faster than the alternative of pulling out every record. But the idea of Datadiff is more that you, you know, navigate this spectrum I talked about before, right, of how certain do you need to be. For most cases, like with 1 of the early adopters, they are verifying the ID, and they're verifying the updated at that is either, you know, something that's trigger based or something that in Django or Rails or whatever, it's making sure that you always update that when the record is updated. And then the 3rd column that they update on their wide table is some kind of currency amount. Right? So some monetary value.
The settings or some JSON serialized object or some boolean, all of that stuff is not as important. If those 3 are right, we can be reasonably confident that everything is okay. But if you're trying to verify everything, Datadiff will do it, but it's not really what it's designed to do, and you shouldn't expect the kind of performance that we've talked about. But it's probably still gonna be the best thing available. But it is really about striding that compromise. So in terms of what's planned for the future, right now, we're rapidly gaining more and more people who are trying it, and they're gonna find that some of their data types maybe won't exactly work. Coming up with the Yuri can be difficult. Maybe there's some, like, option or whatever. So we're just eagerly listening to users and trying to make it easier to understand. That's the number 1 priority right now.
There are things that we can do to make the performance of the tool better over time. I'm not expecting an order of magnitude. But, you know, theoretically, we should be able to get close to account over time. But at some point, that's gonna be on the database vendors more so than us. More adapters. Right? That's why we've chosen Python, which probably has more adapters than any other ecosystem already, and their maturity is good. That's something we're gonna rely on the community. And then it's just gonna be a grind of adding more data types, non integer IDs, so such as UUIDs, arrays, JSON,
[01:04:01] Unknown:
strings with different encodings, like, I don't know, like, MySQL throws away, in some cases, UTF characters at more than 3 bytes, all these crazy shenanigans that we're gonna have to support over time. Are there any other aspects of the Datadiff tool or the problems that it's solving that we didn't discuss yet that you'd like to cover before we close out the show? I think we covered the bulk of it, and I think something worth mentioning is that, again, we are very early in the process. So I think we've opened up the repo about 3 weeks ago, 3, 4 weeks ago, and even the amount of interest we've got so far is so so exciting and, you know, unusual to see, I think, just in my experience of dealing with open source projects. So I'd say that what I'm most excited about is what comes next.
And I think that, you know, a year from now, I'll probably have a lot of learnings. The tool probably will mature quite a bit, and we'll hear about all the different unexpected ways that people
[01:04:51] Unknown:
have used the tool. That probably would be worth another conversation then. I just wanna reiterate, I think, the beautiful vision that Gleb has for data tooling, right, which is that today, no 1 runs that count on the source and target because they're afraid of the answer. Because if the 2 counts line up, you don't know what to do next if the table is sufficiently large. Right? And it doesn't have to be that way, and it doesn't have to take that long. And it doesn't have to be that if something is missing or a number doesn't line up, you click a button to do a complete reimport, and then, you know, you go away on a 6 hour coffee break to wait. Like, these feedback loops in data land that people in data put up with are ridiculous to me. Right? I come from a background more of, you know, relational databases and things like that, and the feedback loops there are so much faster. And I think that there's a lot that could be done in the data tooling space, especially when it comes to these mechanisms where we move from, you know, the classic infrastructure realm into the data infrastructure realm where there's such a lack of tooling. So I'm really excited for people to start using this. And, yeah, as Gleb said, you're probably gonna run into some kind of error, but we're so eager to help people get this running right now. So, probably, you'll get a reply on the issue within a few hours.
[01:06:03] Unknown:
Well, for anybody who wants to get in touch with each of you and follow along with the work that you're doing on Datadiff and other projects, I'll have you each add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yes. I think there's a perspective
[01:06:23] Unknown:
of, you know, my day to day work. 1, you know, that Data Vault is focusing on, which is the general data and analytics engineering productivity. So kind of building on what Simon said in terms of the feedback loop, I think there are other problems as well. So now we have the infrastructure solved with great platforms like, you know, Databricks, Snowflake, DBT, but things are still falling through the cracks. People are still doing things manually that they shouldn't. And I think Pedram from Hightouch put it really well in the term quality of life. So I think what I'm most excited about is building tooling that dramatically improves the quality of life of data engineers, analytics engineers, because that will, I think, empower much more effective use of data because, you know, ultimately, these roles are bottlenecks of what data gets created and fed into all the downstream applications, right, like machine learning and business intelligence. So I'm very excited to solve more of these problems within DataFold.
But outside of DataFold and what is our scope, I'm actually very curious and and bullish about, I think, the emerging field of semantic data management. I think that 1 of the challenges I've observed in the past years is that we learn how to process data fast, really, you know, large data fast, but we haven't really learned how to make data more meaningful for the automatic processing. Right? So we still require people to really understand what every data point means, and there's always this layer of interpretation that we need to put a person to write SQL to actually be able to translate a business question into, you know, a SQL query and into the technical implementation. And this is also 1 of the very interesting things I observe in data space. There are no there's no such role as a data QA engineer. I haven't really seen that, and I don't think that's because we are really confident in data quality.
I think it's because, unlike software where you can clearly define what is right and what is wrong, and then you can put someone who is, you know, qualified QA engineer and tester, and they can actually run through the script and they can tell you, yes. This is right or wrong. They don't necessarily have to have the software engineer acumen. In the field of data, that's not possible. We actually need data engineers who are extremely trained, overqualified for this role doing QA because it's impossible to do QA until you know, unless you actually bring all this business context and meaning into it. So coming back to the semantic data management, I think you're starting to see this project pop up internally at companies. For example, the work that Chad Sanderson has been doing at Convoy building the chassis chassis platform that kind of maps the business entities to events and then helps both people and machines within the data platform consume data more more meaningfully.
[01:09:02] Unknown:
So by defining and leveraging the kind of the formal definition of relationships and business meaning, I think that has a lot of future. So I'm really curious to see what comes out out of it. Yeah. I think, I mean, as I mentioned earlier, I'm not super deep on the data workflows. Right? I think I mentioned before, I think the feedbacks are slow. I think from working a bit on search, which is 1 of these really gnarly problems where you kinda need very strong infrastructure team, very strong back end team, very strong data engineering team, and a very strong data science team. The reverse ETL process of getting things, like, kind of back into production database from dataland still of getting things, like, kind of back into production database from data land still seems like there doesn't seem to have emerged super strong patterns for me there. That was really hard, I think, on the search side, and data replication was just 1 of those problems. So that's definitely 1 that I see. Yeah.
[01:09:47] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you've been doing on Datadiff. It's definitely a very cool utility and 1 that I'm excited to start incorporating into my own platform and workflow. So I appreciate all of the time and energy that you've been putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thanks for having us. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to DataDiff and Guests
Gleb's Journey in Data Engineering
Simon's Background and Experience
Overview of DataDiff Tool
Challenges in Data Replication
Technical Details of DataDiff
Integration with Other Tools
Hashing Algorithms and Data Types
Use Cases and Adoption
Open Sourcing DataDiff
Lessons Learned and Future Plans
When DataDiff is Not the Right Tool
Closing Thoughts and Future Vision