Summary
A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
- Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining some of the situations where reconciling data between databases is needed?
- What are examples of the error conditions that you are likely to run into when duplicating information between database engines?
- When these errors do occur, what are some of the problems that they can cause?
- When teams are replicating data between database engines, what are some of the common patterns for managing those flows?
- How does that change between continual and one-time replication?
- What are some of the steps involved in verifying the integrity of data replication between database engines?
- If the source or destination isn't a traditional database engine (e.g. data lakehouse) how does that change the work involved in verifying the success of the replication?
- What are the challenges of validating and reconciling data?
- Sheer scale and cost of pulling data out, have to do in-place
- Performance. Pushing databases to the limit, especially hard for OLTP and legacy
- Cross-database compatibilty
- Data types
- What are the most interesting, innovative, or unexpected ways that you have seen Datafold/data-diff used in the context of cross-database validation?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
- When is Datafold/data-diff the wrong choice?
- What do you have planned for the future of Datafold?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- Datafold
- data-diff
- Hive
- Presto
- Spark
- SAP HANA
- Change Data Capture
- Nessie
- LakeFS
- Iceberg Tables
- SQLGlot
- Trino
- GitHub Copilot
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) and use code **dataengpod20** to register today! Promo Code: dataengpod20
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data.
Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Forms and data pipelines. It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/dagster today to get started, and your first 30 days are free.
Your host is Tobias Massey. And today, I'd like to welcome back Gleb Myshansky to talk about how to reconcile data in database environments. So, Gleb, for folks who haven't heard you in the past, can you just give a brief introduction?
[00:01:50] Unknown:
Absolutely. Thanks for hosting me again, Tobias. I'm glad I am CEO and cofounder of DataFold. We automate testing for data engineers, and I've been thinking how long I've been doing data, and it's almost been 10 years. And, over the course of those 10 years, I've been lucky to work on building and scaling data platforms at 3 very different companies, including Autodesk, Lyft when I was 1 of the first data hires and also a seed stage startup before actually I'm working on starting DataFold. And, I can surely attest that a lot has changed in the world of data management over the last 10 years, but I think some of the problems that we may be talking today are are still the same, and if not, more acutely felt by data partitioners. So excited to chat.
[00:02:42] Unknown:
You couldn't have picked a, more interesting 10 years to get started in data because it feels like the past decade of data has had more in it than maybe the past half century before it. Although, I mean, I'm I'm sure everybody probably feels that way.
[00:02:55] Unknown:
Yeah. For sure. I think we've definitely seen, a lot of great innovation happening across all fronts. But to a large extent, 1 of the, I think, biggest things that happened over the last 10 years is the evolution of the data warehouse in terms of the scalability and in both storage and compute and also just how basic it feels today to be able to just process and store enormous amounts of data and also integrate all possible data sources in 1 system. And I think that kind of gave foundation for modern data platforms and other tools and also spurred whole bunch of new problems that only existed at Big Tech maybe a decade ago.
[00:03:39] Unknown:
You mentioned that you got started about 10 years ago and that you came in via Autodesk as your first data role. I'm just wondering what it is about this space that has kept you engaged and interested over the past 10 years that motivated you to wanna just keep digging deeper into it.
[00:03:57] Unknown:
Yeah. I think, when I first started my career, I wasn't necessarily set on data. I was curious about it. But the more I get exposed to the data space and data engineering in particular, I was really excited to see all the opportunities that I saw at different businesses of different, you know, sizes and stages. So Autodesk being, you know, really mature corporate technology business and Lyft, going for hypergrowth period when I joined them. What was really interesting is to see all the possibilities that data could make impact on the business, but at the same time, seeing enormous bottlenecks that existed to actually be able to use that data and activate it. And those bottlenecks were mostly existent not in the end application, so there wasn't necessarily a problem of, you know, how do we train a machine learning model or how do we how do we build a dashboard and visualize it. All those problems were related to how do we actually get that data delivered and in a clean and reliable and timely manner for those end applications.
And that's why I got so excited about data engineering because I felt like that while a lot of the end user applications or the final applications for data were evolving really fast, and a lot of innovations have you know, had been happening in machine learning at the time and even in database technology. The workflow of getting, you know, raw data collected, integrated into 1 place, cleaned, and processed, which is kind of the classic chunk of data engineering, was just consistently terrible throughout all of my experiences. And the problem wasn't that, you know, we couldn't buy software or couldn't afford it. It was just there wasn't really a lot of tools that could support the workflow to make it productive and to help, engineers deal with the complexity and the quality issues. And I just been fascinated by this problem, and I kind of switched from being a hands on practitioner building pipelines to, first being an engineer working on tools and then eventually becoming a data PM at Lyft focusing on basically ETL tooling and anomaly detection and, you know, cataloging and your ability to basically improve the workflows of data engineers. So that's that's the problem I've been passionate about for for years now. And what what we're doing at Dataflow is pretty much exactly solving that problem.
[00:06:28] Unknown:
Bringing us now to the discussion around reconciliation and how that is applied and the strategies for doing that in different database environments. So I'm wondering if we can just start by unpacking that term reconciliation and what that means in the context of data and database engines.
[00:06:49] Unknown:
Yeah. Absolutely. I think reconciliation is 1 of the dimensions of the broader data quality problem space. You know, just like in software quality, data quality is a very multidimensional problem where there are problems of timeliness and data completeness, and I won't recite all the all the terminology. But, essentially, what it means is we have to solve different different problems at different workflows. Right? When engineers are writing code, when they are looking for data, when they are deploying their code, when they're maintaining data pipelines in in production.
And reconciliation is just 1 of those workflows that I think hasn't been really served by tools in my career yet being very important and hasn't yet that much spotlight, I think, in general in the last few years as the data quality tools have been evolving. So I think it just deserves to, be unpacked and talked about because I certainly felt that pain, and I've seen a lot of really, high performing data teams facing a lot of these issues. So what is data reconciliation? Ultimately, it's all about understanding the differences in data across database environments or within the database environments, and that can be applied in a number of workflows. So as an example, when we change the code as data practitioners, which is what something we do all the time, we can introduce new metrics. We can change definitions for metrics. We can refactor our code, which is usually these days SQL that runs on all of databases, changing that code, for example, to be more performant unsold performance bottlenecks, we need to make sure that when we deploy this code, it doesn't introduce any regressions. And so the reconciliation applies here in the form of we wanna make sure that the data which is produced by the updated version of our SQL code is actually the same, at least in the areas where we expect it to be the same with the previous 1. And so that is an example where we would reconcile or diff the data which is in production environment versus staging, which is built with a new version of the code. And that typically is done within the same database environment, maybe just different tables or or different schemas. Another example of reconciliation and maybe a more advanced or harder 1 is when we have to reconcile data across completely different physically different database systems.
And that I've seen happening in 2 major workflows. 1 of them is what we call data migrations. So I think a lot of, state of practitioners have seen those in our tenure. It's basically when you operate a certain data stack, and at some point, you outgrow as a team that data stack either because of data volumes or because you need to reach for better capabilities, and you need to move all of your workloads to a new system. So for example, during my tenure at Lyft, we spent over 3 years migrating from Redshift to a data lake that was built on top of Hive ecosystem and was served by 3 engines at the time, Hive, Presto, and Spark. And that's, is something where I spent a lot of my time on. And 1 of the most painful experiences that I've had during that process was basically, for any migration to be successful, you have to prove to yourself and then to data users who are ultimately consuming that data that whatever you have in the new environment, once you migrated your jobs and ported your SQL queries and and data is the same as what you had in legacy environment. And if it's not the same, that that's a big problem because maybe those metrics were reported already and users just don't trust the data. And if they don't trust the data, we, as data engineers, fail because no 1 wants to use our our pipelines.
And we basically needed to reconcile a lot of our metrics, a lot of our core tables between Redshift and the, the new data lake. The challenge is that you basically need to understand, in this case, the differences at both the high level. So, for example, let's say we ported the SQL code from Redshift to Hive or to Presto. We run the queries. We did the backfill, and then the numbers just don't match on the high level. Let's say the counts don't match. What do you do next? You have to understand why exactly exactly they're not matching, and so you start drilling in. And drilling in when you have 2 different environments where you can't simply issue a SQL query that just, you know, does, like, a join of the tables and tells you about the rows that are not matching is extremely hard. You basically start doing a lot of manual work, like try to, you know, do samples and try to find examples of data not matching or you start doing group buys. And that has been extremely laborious process that really slowed us down. And if you multiply this by, you know, all the engineers working in the project, you get enormous loss of productivity, and the entire project gets delayed. So that's something we've we've, we've struggled a lot. And the third example of reconciliation that is also becoming ever more important, I think, as the data platforms become more mature and they start to be really the center centerpiece of all data being integrated is validating data replication.
So in the migration example, we talked about, okay, we have a legacy environment, we have a new environment, We the goal is to move all the workloads to the new environment, shut down the legacy tool, get rid of it, and move on. With replication, the story is not over once you move the data. Right? Typically, replication is done between the business apps that sometimes are backed by databases like, you know, SAP HANA or Oracle or OLTP databases that back production servers, so MySQL, Postgres, sometimes NoSQL databases like Mongo and Dynamo. And those, those databases, they operate at high scale and they're really optimized for high throughput of conductions, and they contain really valuable data that, in my practice, data teams really love to take advantage of. And historically, you know, obviously, data teams been operating in a lot of event data that kind of is explicitly instrumented in application code or in the web apps, but transaction data is really, really important because of that as the source truth. Right? If someone is sitting in our microservices database, that is pretty much as close to the source of truth as possible. So leveraging being able to leverage the data in the warehouse alongside event sources has always been an increasing desire in my experience, my career, and what I've seen happening these days.
And so when you wanna replicate data across another basis, that is a hard problem because those OLTP, which stands for light production processing systems, they provide a high throughput, high scale. You can't just, you know, query them, export data, and then import it into your warehouse at scale. So a lot of the times, teams would embark on using what's called change data capture streams, where you would stream out a a stream of updates that happen to the database and you ingest it on the other end, and then you, through that, understand and replicate the all the tables that you have in your OLTP databases in all app.
And that is a very hard problem to solve, but today, there exists a lot of both proprietary software vendors and open source tools like Debezium and Kafka to do that, but that process is never perfect. And we can talk about kind of why and and what are the challenges. And so being able to ensure that when this process is performed, you actually have consistent data on on both sides and whatever you have in your production system that you deem source of truth. It is actually replicated properly on your other end where you wanna use it, becomes really, high priority for the data team as far as their data quality journey goes. Because otherwise, if you're not sure about the consistency, you can't really say I'm using social true data. And so solving that problem basically requires to build a system that is able to understand the differences between datasets at scale across different databases. And that's a very hard problem, but we've been working on this for a few years now, and, I'm excited to unpack it more.
[00:15:35] Unknown:
Another interesting aspect of this replication use case is that it's presuming the ELT approach of I'm just going to do a bit for bit copy of the source data to the destination with the node transformation in between. And I'm curious what are some of the stumbling blocks or additional challenges that are introduced when you do have some form of transformation in the middle before you get it into the destination and then being able to validate that the data that you have in your target is actually all the same data that you have in your source even accounting for the mutation that's happening? Yeah. It's a great question, Tobias. I would say if someone is listening to it and they are actually doing transformations in flight, they should probably stop listening and rewrite it to the ELT right now. But seriously, I think that this pattern that was more prevalent maybe in the early days of data, and
[00:16:26] Unknown:
and we've seen less and less happening these days because I think teams realize that even moving the raw data around is hard enough. Applying transformations to it in flight at large scale is just error prone and also makes the whole developments inflexible. And, therefore, the industry massively adopted the ELT partner where you just dump whatever you have, and then you build downstream pipelines. And even if someone is wanting to do streaming transformations in flight for, you know, legitimate use cases, like, for example, online machine learning model inference and fraud detection where this is really warranted, we still see that there is, like, a kind of an an outlet for the raw stream that is always saved in a big lake or in the warehouse so that it's always can be, can be queried. But, yeah, if the transformations are done, that only makes the reconciliation process harder.
[00:17:18] Unknown:
Another interesting aspect of reconciliation as you were talking about earlier is the case of I have all of my data in the warehouse. I'm doing some transformations, but now I need to modify that transformation a little bit to address some new business use case or address some data quality issue. And that brings to mind situations like the Snowflake copy on write tables where I'm going to write my new DBT code. I'm going to deploy it to a branch, and then I'm going to create a copy of that source table, run the new transformations. And now I need to make sure that all of the same rows are being processed, but that my additional logic is also being accounted for. And I'm curious what additional stumbling blocks you run into for those types of reconciliation use cases.
[00:18:04] Unknown:
You mean, basically, when someone is working on us on an environment that is, like, for from production and the production can move on because it's get gets on baseline? Correct. Yep. Yes. That is that is interesting problem, which is what we call data drift. Ultimately, to be able to reconcile data for testing the code changes use case, especially in in, you know, CICD processes, you ultimately need to be able to compare apples to apples. And luckily today, the more modern data transformation frameworks such as DBT, they facilitate the creation of proper staging environments.
And DBT, in particular, has this feature set, which is called state and defer. And the idea is that a lot of the times when we wanna do some development or we wanna build a staging environment, we can basically reuse the production data that's currently ready gets computed for everything that we know hasn't been changed and only do the computation on the changed code and changed datasets. That minimizes the the data drift. I think where because the same data is reused for both the production system and staging. The I think where the things are getting harder is when your production system is changing very frequently when you have, you know, either live tables or your jobs are running just very, very frequently. And so in this case, the data drift is likely to happen because the production is just updated so so often. In this case, there are a number of techniques that teams can use, including watermarking.
So applying filters to only look at data at particular timestamp, but really works for 5 tables or append only tables, as well as using time travel where you can query dataset at a particular point in time. So you can actually compare your staging environment to production environment as of particular date. So you know that, you know, you can kind of freeze your data in time. So, luckily, there are, more advanced techniques, that more data warehouses support that make comparing data, reconciling it in a in an efficient manner possible.
[00:20:15] Unknown:
Which brings to mind also some of the versioned data lake and data warehouse systems such as lake f s or Nesi where you wanna be able to say, I only want to look at the data as of this commit or as of this, tag in the data system so that I can reconcile across them and ignore the fact that there is new data that's been appended, which which is some of the benefits that you get through, like, the iceberg table spec or the lake f s, sort of versioned s 3 data store.
[00:20:46] Unknown:
Yeah. I think that I I'm really curious to see the how how the version control for data proliferates in, like, the mainstream data pipelining. I think these they are some really powerful ideas that for, you know, at large scale, data lakes definitely make the workflow smoother. But so far, at least in my experience, this still lives in, like, the advanced category, whereas, like, the the mainstream workflow operates on a more simple construct. But that's that's definitely an area, I'm watching closely.
[00:21:16] Unknown:
And digging now into the error conditions that you run into. So maybe taking it, a couple of categories at a time. So I'm replicating my data from my source transactional database into my data warehouse, and everything has been running smoothly. But now I'm running my reconciliation report, and, oh, no. I'm I'm getting an error. I'm curious what are some of the types of sources of those errors and some of the downstream impact that those are likely to have as you continue to propagate changes downstream?
[00:21:50] Unknown:
Yeah. I think that the the the challenges that exist in cross database data application, they stem from the fact that we're dealing with different systems operating at high scale, and so you inevitably have to make trade offs. And 1 of the trade offs is that you're not really trying to continuously, like, refresh the data. You're not, you know, dumping the entire table and then serving it, which would certainly be more robust and simpler. But a lot of the times, your scale or your, you know, load constraints on your conduction system just cannot allow that, and so that's why the change data capture kind of technique is utilized. And in terms of the types of issues that can occur, there's really a lot. For example, schema changes When schemas are updated on the source side, updating them on the target side could be quite tricky. And the other issues are with just the fact that we're talking about streaming datasets. When you're dealing with streaming, it's all about trade offs. Right? And you start running into issues such as events ordering, event duplication, and you kinda have to make trade offs whether you want the data to be delivered for sure or you want the data to be delivered exactly once, but not necessarily both things at the same time. And because those the stream actually contains the, basically, transaction log for the database, if you're missing certain you know, even 1 event sometimes that actually can have pretty big big consequences on how the data looks on the other end. The other challenges that are most specific to how like, what systems are used is also how the transactions are handled and, like, where where exactly are the transaction boundaries on the OLTP side and how the deletes are handled because the data could be deleted, but not necessarily accordingly represented on the other end. And I think that the biggest problem that teams are facing is that the systems that are designed to transport data, They actually don't really have robust mechanisms for kind of QA ing their own work. So their work is kind of to deliver the data, but then they leave the QA to the user. And there is no there hasn't been much tooling that would actually allow engineers to say, like, oh, yeah. We have a, you know, problem here or the data is not not fully consistent. So that's kind of in terms of the the problem space that we're seeing. In terms of how to solve this problem, how to actually reconcile data and validate it, it also has a number of challenges. And there are probably 3 main groups of issues. 1, just the sheer scale of data, and that means that for the same reason you do CDC versus just export and, you know, import data in bulk, you can't compare data by pulling it out from the database into some third system and running computations like in memory. Right? Because, like, if we were to try to build something naively, we could use something like, you know, Apache Spark that has amazing interoperability that could query all the all the other systems, import everything into 1 data frame, and then compare it and and spit out the results. So it turns out this is, a lot of the times, not an option for teams because of the 1 load that can be imposed on both databases, and 2, that it means you have a lot of data traveling over the network, which at scale gets expensive and becomes slow. So what that means when we are doing reconciliation and we are diffing data, we have to do this without pulling the data out or with pulling only the minimal, you know, amount of data relative to the entire size of the table out. How does this work? So that's kind of where the magic of, you know, data diff and data fault comes in. Basically, we have an algorithm that relies on hashing data on both sides and then doing algorithmic search on those hashes to basically understand where the discrepancies are and then drilling in when we find discrepancies and kind of being able to go from a 100, 000, 000, 000 row table to particular row and particular value that are different. So that itself is is a very hard problem just from the performance and algorithmic, perspective. The other challenge is, again, similar to why CDC is hard, is you're dealing with data systems that have been built completely differently by different teams with, like, no real expectations that they would have to have a handshake or work together. Right? So the only thing they have in common is probably SQL.
That means that you can have different data types, different ways that those data types are handled, represented, even ordered. So collations, which is the mechanism in by which the database actually decides how to order values in a given column are also different and differently implemented across systems. And some systems have more options in terms of how you collate the data and some have fewer. But if you can't order data in the same way, that also makes comparing the data really hard, especially if you try and do this without exporting the entire table. So but just to give you a glimpse of some of the challenges we're seeing with, cross database compatibility and, reconciliation.
[00:27:08] Unknown:
Are you sick and tired of salesy data conferences? You know, the ones run by large tech companies and cloud vendors? Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around. I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees, 100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI. The community that attends data council are some of the smartest founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data and AI.
And as a listener to the data engineering podcast, you can join us. Get a special discount off tickets by using the promo code depod20. That's depod20. I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.
[00:28:08] Unknown:
And then in the situation of a warehouse migration project, what are some of the different manifestations of error conditions that are likely to occur and some of the types of resolutions that teams need to be planning for? Yeah. I think in terms of the migration,
[00:28:25] Unknown:
if we think about the entire project, the idea is pretty simple. We take all the jobs we have on the legacy side. We take the code, let's say, in the majority of cases at SQL, let's say, running on some legacy system. We convert it to, you know, a new dialect, and these days, it's actually pretty easy to do with open source tools like SQL Gua, and that is easy. However, the devil is in the details because even if you do the 1 to 1 automatic comparison of SQL, things are not really matching most of the times because you still have to rewrite certain parts of the queries, certain constructs are not supported because the data types are no longer the same, because you inevitably need to refactor some things. Otherwise, they just, like, won't run on your system, and you end up with certain discrepancies. And those not don't necessarily have to be large to actually pose a risk to the project.
You know, even if you're missing, you know, point 0 1% of your transactions or your revenue calculations are, you know, less than 1% or 0.1% off. This is still, a lot of times, is a nonstarter for the business to be able to accept the migration. Right? So and then the the other the other challenge is that the sheer fact that comparing data is so hard, even if you're likely to have data fully matching, unless you can efficiently understand that and prove that, it still becomes a really, really big uphill challenge.
[00:29:51] Unknown:
Getting a little bit off topic here, but just as a point of curiosity, I'm wondering how you have seen some of the data virtualization engines used as a means of maybe simplifying that migration path of being able to say, well, I've got my source system in Redshift. I'm trying to go to Snowflake, but I can federate across them using, like, a Trino or a Presto, and if that's something that you ever see people address as a means of reducing the user experience drift during that migration phase.
[00:30:19] Unknown:
Yeah. I think we we sometimes see the virtualization happening, especially when the target the destination for migration is a data lake with some sort of an open source engine, like you mentioned, you know, Presto, Trino with very high interoperability across the entire spectrum of systems. And I think what that allows to do is also reduce the pressure on the business while the migration is underway. I think, in general, speaking about migrations, 1 of the best practices that I recommend when someone is undergoing migration is b make sure that even if you haven't ported all the jobs that produce the data, that you can at least serve all the data with a new system as early as possible in the project because that just massively reduce the user dependency on the loan system, gives you more leeway, and reduces the pressure and load and kind of the overall pain that we experience. And so to that end, if you're operating with a kind of data virtualization layer that just makes it simpler, If you're not, then you just, you know, regularly copy data from legacy into the new system, and you serve it from from there.
[00:31:28] Unknown:
Beyond just the bare facts of I have this piece of information in my source system, I have the same piece of information in my destination system, I'm wondering what are some of the ways that the vagaries of different database engines and the ways that they manage the storage layer, in particular things like partitioning or updates, how that adds complication to the work of bringing data between different systems, and being able to ensure that your business logic is able to be, implemented effectively across them, or is it largely just a case of everything is SQL and I just don't I I just let the database engine worry about those vagaries so I don't have to? Yeah. Well, I think there are 2 questions. Like, 1,
[00:32:15] Unknown:
how it complicates the actual, you know, moving data part migration or application, and the other, how it complicates the reconciliation process. So in terms of the actual moving of data and, like, grading the jobs, The good thing is that with modern systems, usually, they are far more obstructed than the legacy systems in terms of what the user is expected to worry about, to care about, and the failure modes. And, you know, to give an example, in Hive, I believe there are over a 100 different parameters that you actually put in in your SQL query that dictate, you know, how much memory you allocate for your map tasks and, you know, your shuffle stage. And so that just means that there's so many different failure modes and bottlenecks that the user needs to worry about, and mastering such a system requires really deep expertise in a particular technology. Whereas if you talk about something like Snowflake, it basically just works. And although, obviously, there could be better performance or worse performance, the baseline is just so much easier.
And so, typically, when users are upgrading their data platforms, they're going from something which is slow, complex, hard to run, requires deep expertise to something which is much simpler and more robust. So, usually, it is easier to make it run on the new system and as opposed to, like, other way around. But, certainly, to your point about partitions, that's another example where certain systems that were legacy, for example, could not work other you know, unless they were the data was partitioned. Right? In the new systems, that is not necessarily true. They can just operate on an entire table, and apply kind of clever, you know, clustering techniques and and other things to to make things run well. So, again, we're talking about, like, reduced, complexity.
When it comes to data reconciliation, things are a little bit more interesting. And because when you reconcile data, when you dip data in the case of, let's say, Dataflow comparing something like SQL Server to Snowflake, we have to deal with the legacy system in its, you know, entirety and all all its complication and solid challenges. And so while implementing high you know, high performance diffing algorithm that would compare something like BigQuery to Snowflake is is relatively straightforward given, you know, a good high performing algorithm. If you're dealing with something like SQL Server Oracle, that becomes much harder because, 1, the data is distributed it's stored actually in typically a row fashion.
And whereas in your OLAP systems, especially modern ones, column functions that operate on the entire column run really fast and they're optimized. In the OLTP database, they are not necessarily used to operate on such fashion, which is actually required for cancellation. Also, they have been built, you know, for different types of transact you know, different type of operations and analytics performed on them, And that requires and has required us to apply a lot of really deep database optimizations that are specific to each engine to make sure that when we do a cancellation of, let's say, a, you know, a 1, 000, 000, 000 row dataset with over a 100 columns between an OLTP proprietary vendor database and, you know, something like Snowflake or or Databricks that is performance and utilizes the compute on the kind of more bottleneck side, which is usually the the legacy database as efficiently as possible.
So that's definitely a a harder challenge.
[00:35:46] Unknown:
Talking about all of the complexities of managing reconciliation, the different failure modes, the air conditions, all the systems that I need to be aware of, and the of how they interoperate. Where's the easy button? How how how do I how do I make it easier so that I don't have to worry about all the different ways things can go wrong and how to fix them? Well, you just use DataFold.
[00:36:06] Unknown:
So, basically, just to be clear, you know, we're not in the business of actually moving the data. This it's, I would say, a very different problem because moving data is a large problem space that can be done with open source technologies, can be done with vendors. Sometimes you want to do real time event processing in the in the meantime, so we're not really in that business. Our job is to make sure that data practitioners have really efficient data quality workflows solved, and that's just an example of a a data quality workflow that we're automating. And so what we are doing on our end to make sure that this process is as painless as possible for the practitioner is just, again, abstract them away from all of these complexities.
So we you know, our team deeply researched the internals of all the database systems we're operating in. We mapped all the data types and how they map across different databases and what can be compared as is, what needs certain trade offs to be compared. You know, floats and decimals is an example of a really tough, pair of, data types to be able to to compare. We've also applied a lot of performance optimization techniques, and I would say that the the other interesting kind of innovation or sets of problems that we had to solve in our product is how do you implement and think about the user experience? Like, okay. We have a system that pushes all these databases to the limit and provides you with this diff report. But how do we actually make it use useful to a practitioner?
And what we discovered is that a lot of the times, for example, users don't necessarily need every single row, every single discrepancy accounted for. A lot of the times, what they really wanna understand is what is the magnitude of those differences, Which columns are affected? Is the area affected spanning a particular date range? And based on that, they can actually make a lot of the decisions on how do they deal with this discrepancy and how do they remediate. And so some of the interesting things that we built recently is, for example, the streaming of the results and in real time and also real time updates sorry, and, sampling. So streaming results is interesting because in classic sort of data engineering workflow, you are used to your query being atomic. Right? So you have a query. It may be doing a lot of computation. You sign in the database. You wait for x number of minutes. If it's something really big, could, you know, run for an hour, and then you get it back. But in the case of cross database diffing, sometimes it actually takes longer, but you would actually benefit from seeing partial results of such a query. And so we actually implement an algorithm that allows the user to see intermediate results. And at some point, they may say, oh, you know, I see that there's at least, like, 0.1 percent difference in the row count and this column that I really care about being equal. I see already there are, like, more than 10 discrepancies. I just gonna stop this process and work remediation. I don't really need the entire 1000000000 row data set to be examined. So that's just an example of, like, a really interesting UX hack that we discovered that simplifies the workflow. And then the sampling is also interesting because of the same reason users are a lot of the times, interested in having a representative sample of those differences.
Because if you have, you know, 100, 000, 000, 000 by 150 column dataset, that's a lot of data to go through, especially, like I said, on the LTP side. But what you may actually be interested in is understanding which columns are equal and which columns have at least x percentage of discrepancies. And so you want, as you go and you diff those results, you actually have a representative sample of that data, which is quite hard to do because you don't know how those discrepancies are distributed. So you have to be clever about how do you statistically catch that up with that example of of discrepancies in your new dataset. So those problems weren't necessarily apparent to us when we embarked on this, but the more we work with our our users at scale, those were some of the interesting problems we had to solve.
[00:40:07] Unknown:
The data type piece in particular is interesting and 1 that I've had to deal with in some of my own work. The the the floats and decimals that in particular caught my attention because by default, a lot of the systems that will just shuttle your data between point a and point b if it sees that it's a numeric type and it's not an integer, then it just says, oh, it's just a float then, which can be quite problematic if you actually need the precision of a decimal, particularly if you're dealing with monetary data. And I'm wondering of some of the ways that you've had to deal with some of those edge cases of mismatches in support for complex or custom types between database engines as well as some of the implicit type conversions that might happen in the process of moving from point a to point b? I think that probably deserves,
[00:40:55] Unknown:
like, a whole section of documentation of how this works. But in general, I think that we we thought about this problem hard, and we came to a conclusion that we should do as least as possible implicitly because we cannot assume for the users how they want the data compared. And so, basically, the kind of decision framework that we settled on was that in the case that we can logically conclude that things can be compared exactly, like, you know, in a certain way even though those even those data types are different, we would do that. However, when we think that users may interpret the results differently or they actually would want potentially different things depending on the use case and what is the semantic meaning of the data, we try to surface the decision to the user in terms of, for example, how do we handle precision and then how those, you know, types are mapped to each other necessarily.
So that's a kind of continuous work in progress, but at least we were able to set on these set of rules and and build with them in mind, which simplifies at least some of some of the problem space.
[00:42:05] Unknown:
Another interesting aspect of this problem space, particularly when you're moving data between systems, is the question of cost management because a lot of cloud will bill you for data coming in and data going out. And so, ideally, you would minimize the amount of information that has to be shuttled between those systems. And in the work that you're doing of adding a means of seeing, I've got all the right data in the right places, How do you try to mitigate that cost of transfer and minimize the amount of information that you have to exfiltrate from the database to be able to do your job. Yeah. I would say this
[00:42:47] Unknown:
cost management, performance managements, network, and IO bottlenecks are definitely a primary constraint that we optimize around. And so when we settled on building the system for across database data dipping, we 1 of the 1 of the constraint was that there is no way that we would be exporting, you know, an entire very large dataset out to compare. And that's why we settled on this more kind of harder algorithmic approach that allows us to minimize the data that's gets on top of the network. And, you know, without going too much into the weeds, I would say that the way it works is by default, the data that is sent over the network is the metadata. So it's hashes of the rows and columns that are compared.
And then as we drill in into the discrepancies, we are sending certain chunks of the of the data of the table where we find those discrepancies. And, therefore, you know, only a small fraction of the data is ever sent, and then the user actually has control over how much they're sending. So So for example, they can say that I, want to diff this table, but I wanna make sure that I'm not sending more than, you know, x x amount of rows over the network. Or they can say, I want to diff those 2 tables, but I want you to stop. And that I would basically get exactly what I want if I find at least, you know, x percentage of difference in this column or in the row count or in primary keys, basically, depending on the what what the user wants. So, basically, to, summarize, this is definitely the primary design constraint to absolutely minimize the amount of data that's sent on the network, and only a small fraction is usually sent. And we give the user control of how much is, is sent so they can control those costs.
[00:44:46] Unknown:
And in your work of trying to empower teams to validate the quality and correctness of their data flows, ensure that all of their database systems are healthy and that their business logic is doing what it's supposed to do? What are some of the most interesting or innovative or unexpected ways that you have seen the DataFold product and the DataDiff utility used in this context of data reconciliation?
[00:45:14] Unknown:
Yeah. I would say that, we talked about the 3 major use cases. Right? So change management of your data processing code where you compare staging in production. We've talked about migrations. We've talked about replication. The use case that we haven't talked about, I also didn't anticipate, because I personally haven't dealt with that in my career, using reconciliation for compliance reasons and for auditing all data. So a fair number of our customers are operating in a financial services space, and they get audited by, you know, professional auditors on a regular basis. And I think the interesting trend that we've seen is that previously, all the financial statements, everything that is actually important to the business would just live in the ERP systems and only be accessed from those systems. Whereas today, because when companies are reporting to their users, when they're reporting to their customers, when they're reporting to, you know, the government and authorities, they increasingly, you know, rely on the metrics that are computed from their analytical sources, or they would blend their kind of ERP transactional data with those analytical sources. And so that actually imposes a entirely different degree of kind of accuracy and com and control that you have to implement in your data workflows that then it then existed previously. And so we've seen Datadhip being used by data partitioners to basically demonstrate to auditors that when they make changes to, you know, key metrics and key pipelines when they move data around, that they don't lose data integrity. And that's really important. That saves them a lot of time, you know, saves also companies, the auditors bills because people don't have to work on this manually.
And I think also will lead to increased usage of data, in those, you know, high compliance environments, which I'm personally very excited about.
[00:47:18] Unknown:
Absolutely. Yeah. The the compliance question is particularly interesting to ensure that we have all of the data that we're supposed to have and none of the stuff that we're not.
[00:47:27] Unknown:
Exactly.
[00:47:28] Unknown:
And in your work of building the Datafolds product and working with practitioners to ensure some of these challenging quality questions, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:47:42] Unknown:
So I would say that definitely since we're ultimately opening up a whole new type of tooling and the workflow for data partitioners, and I mean data diffing, that's not something that, as a tool, has been available widely to, the data partitioners before, I think, we released our open source toolkit and, and the cloud product. We have both the joy and kind of the responsibility to come up with all the terms and make a lot of different design decisions, how this type of work and workflow is done. And we've definitely learned a lot from our users.
And I think some of the interesting realizations that we kinda talked about already, for example, in terms of the user experience were that, you know, performance matters to an extent. User experience in terms of how the data is presented, what is presented first, how you think about the coverage of kind of your identified discrepancies and differences versus maybe the depth, like how much data was actually, you know, compared. All those trade offs definitely, were not obvious to us when we'd marked on this, and we had to, you know, build something that on the 1 hand is opinionated, but on the other hand is supporting workflows of a wide range of data partitioners operating in a completely different industries, you know, financial services, health care, Fintech, and all of them have some data in common and definitely have common workflows. But a lot of times, the nature of why they're doing this and what exactly they care about is is different. So I would say that in general, product design and UX have been probably the primary challenges.
Not to discount, obviously, all the hard problems we had to solve on the pure computation and algorithmic side of things just to make sure that we compare data robustly and it runs as fast as possible.
[00:49:39] Unknown:
And for people who are trying to incorporate this reconciliation loop for their data in their database environments? What are the cases where a data folder or data div are the wrong choice?
[00:49:53] Unknown:
So we don't support NoSQL yet. This is something that we've heard a lot from our, you know, prospects and customers that they want to add support for. We have pretty good ideas how to do it. But as of, you know, today, March 4 you know, March 14, 2024, this is not something that's possible, although we do support a wide range of SQL engines. And I would say if you're looking for a tool that actually moves data, definitely, we're not we're not the choice. We are there to provide visibility and to test. And the same goes for remediation. We got some questions from certain prospects. You know, hey. Can you actually go since you know so much about our data and and discrepancies, why can't you just go in and, you know, fix these issues for us? But we thought about this, and and we decided that that's just not something that we want to do because, again, just because how those pipelines can work in so many different ways and those application systems built in so many different ways. We wanna be providing information and visibility and then let the users act on this themselves.
So you can basically get results from Dataflow using UI,
[00:51:05] Unknown:
API, or we can even materialize them back in your database so you can create as a table, but we don't want to actually change your production data. That's something that it's actually in scope of the team using it to do. As you continue to build and invest in this space and expand the DataFold product, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?
[00:51:31] Unknown:
Yeah. That's a great question. NoSQL support is 1 as well as just broadening the systems that we work with for data reconciliation. We're continuing to improve performance, and we do have a lot of interesting things and hacks that we discovered that we'll be just implementing in production to make it run faster. The kind of continuous replication monitoring is their more recent, thing that we shipped where you basically instead of, like, asking data fold every time via API or u UI to compare datasets, you can just schedule those checks and then have data fold continuously perform those checks. That's an area that we will be continuing to evolve in. And there's a lot of other things, but they're not necessarily related to to reconciliation. So maybe we can talk about them in the in the next episode.
[00:52:18] Unknown:
Absolutely. Are there any other aspects of this workflow of reconciliation in your database engines and the ways that you are give giving visibility to that process with DataFold and DataDiff that we didn't discuss yet that you'd like to cover before we close out the show? I would say,
[00:52:35] Unknown:
I I get a lot of questions. You guys built the open source Datadiv tool that's been pretty popular, and you have the cloud product. What is the difference? How do I know what should I use? And the answer is actually quite simple. So the open source tool allows you to compare data within databases or across databases. It implements most of the basic adapters. It is a self contained Python package that is really easy to install, and it's really built and optimized for an individual partitioner. So let's say if I'm a data engineer and I just have to compare something that I just copied around or I wanna, you know, maybe test my DBT code when I'm making changes. So that is the perfect tool. If I'm a team and if I'm an organization and I have to basically do these dips at large scale, in terms of both the volume of data, but also a lot of them, and I have to, you know, share results within my work. That's where the cloud comes in. So the cloud operates actually a different engine that requires cannot basically run on on a single, you know, as a single Python application.
It's far more performance and more sophisticated in terms of how it handles large scale and kind of cross database data mapping and how it how it operates. Basically, what kind of primary keys it supports in a table. So it's basically built more for the team and and enterprise user. And the analogy that I like to give, it's not fully applicable, but it I would say it's illustrative, is there is Git and there is GitHub and GitLab. Right? So you can imagine Datadiff is Git in a sense that it gives you the fundamental fundamental ability of comparing your your datasets, and it's open source and it's free. But if you are a team and then trying to collaborate on a software project, then you use GitHub or GitLab where you actually have the full workflow automated and you have the application UI and, you know, full, full service. So that's kind of the the differences that I think is important to highlight. But individual partitioners can for sure start with open source.
[00:54:40] Unknown:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are up to, I'll have you add your preferred contact information to the show notes. And for the usual final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today. That's interesting.
[00:54:58] Unknown:
I think that there are a few things that I'm excited about. 1, there's been a lot of talk about how, you know, the hype cycle shifted from modern data stack or data engineering to AI and, LLM applications and what it means for for the practitioners, what it means for companies building in the space. I'm very excited about this. You know, we're not an AI company. We're not building models. But I think that for any business to actually be able to effectively apply those LLM applications in their domain for their business, for their users. The only way they can do this is by, you know, feeding in their business context and data. Right? If I'm a support agent that is backed by, let's say, GPT, I need to know all the previous orders and all the history I have with this particular user and all the previous interactions. So that data better be correct. And that's where, you know, old school data engineering comes in where now delivering these this data faster, more timely, more reliably, and just more of it, more context becomes ever more important. So I'm very excited about this as a entirely new field for how the data pipelines actually get consumed.
That's kind of 1 thing I'm excited about. But to answer your question about tooling, I'm excited about all the innovations that are happening on the kind of the ID and data engineer tooling support with enablement of, you know, of, LMS. I think that's as it usually happens to data relative to the software space, we're still behind the level of support that software engineers get. Right? You know, Copilot works really well for for software engineers, but there's still quite limited support we get as data engineers. And I think for the same reason because in order to be able to to help data practitioners, you have to feed the context, and the context is so specific to to the company. So very excited for this. Have you have you have to see, interesting tools, but I follow this field pretty closely.
[00:57:01] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and talk us through some of the challenges and problems and resolutions for reconciliation of your data in your database systems. So appreciate all the time and energy that you and your team are putting into to bringing visibility to that and helping teams, accelerate the process. So thank you again for taking the time today, and I hope you enjoy the rest of your day. Thank
[00:57:31] Unknown:
you, Tobias.
[00:57:33] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview of Data Lakes
Guest Introduction: Gleb Myshansky
Evolution of Data Management
Understanding Data Reconciliation
Challenges in Data Replication
Warehouse Migration Issues
Simplifying Reconciliation with DataFold
Compliance and Auditing Use Cases
Future Plans and Innovations