Reconciling The Data In Your Databases With Datafold

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst

and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Forms and data pipelines.

It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability,

a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/dagster

today to get started, and your first 30 days are free.

Your host is Tobias Massey. And today, I'd like to welcome back Gleb Myshansky to talk about how to reconcile data in database environments. So, Gleb, for folks who haven't heard you in the past, can you just give a brief introduction?

Absolutely.

Thanks for hosting me again, Tobias. I'm glad I am CEO and cofounder of DataFold.

We automate testing for data engineers,

and

I've been thinking how long I've been doing data, and it's almost been 10 years.

And,

over the course of those 10 years, I've been lucky to work on building and scaling data platforms

at 3 very different companies, including

Autodesk,

Lyft when I was

1 of the first data hires

and also a

seed stage startup before actually I'm working on starting DataFold.

And, I can surely attest that a lot has changed in the world of data management over the last 10 years, but I think some of the problems that we may be talking today are are still the same, and if not, more acutely felt by data partitioners. So excited to chat.

You couldn't have picked a, more interesting 10 years to get started in data because it feels like the past decade of data has had more in it than maybe the past half century before it. Although, I mean, I'm I'm sure everybody probably feels that way.

Yeah. For sure. I think we've definitely seen,

a lot of great

innovation happening across all fronts. But

to a large extent,

1 of the, I think, biggest things that happened over the last 10 years is the evolution of

the data warehouse in terms of the scalability

and in both storage and compute and also just how

basic it feels today to be able to just process and store

enormous amounts of data and also integrate

all possible data sources in 1 system. And I think that kind of gave

foundation for modern data platforms and other tools and also spurred whole bunch of new problems that only existed

at Big Tech maybe a decade ago.

You mentioned that you got started about 10 years ago and that you came in via Autodesk as your first data role. I'm just wondering what it is about this space that has kept you engaged and interested over the past 10 years that motivated you to wanna just keep digging deeper into it.

Yeah. I think,

when I first started my career, I wasn't necessarily set on data. I was curious about it. But the more I get exposed to the data space and data engineering in particular,

I was really excited to see all the opportunities

that I saw

at different businesses of different,

you know, sizes and stages. So Autodesk being, you know, really mature corporate technology business and Lyft,

going for hypergrowth period when I joined them. What was really interesting is to see all the possibilities that

data could make impact on the business,

but at the same time, seeing enormous

bottlenecks that existed to actually be able to use that data and activate it. And those bottlenecks were mostly existent

not in the end application, so there wasn't necessarily a problem of, you know, how do we train a machine learning model or how do we how do we build a dashboard and visualize it. All those problems were related to how do we actually get that data delivered and in a clean and reliable

and timely manner for those end applications.

And that's why I got so excited about data engineering because I felt like that while a lot of the end user applications or the final applications for data

were evolving really fast, and a lot of innovations have you know, had been happening in machine learning

at the time and even in database technology.

The workflow

of getting, you know, raw data

collected,

integrated into 1 place,

cleaned,

and processed, which is kind of the classic chunk of data engineering,

was just consistently

terrible throughout all of my experiences.

And the problem wasn't that, you know, we couldn't buy software or couldn't afford it. It was just there wasn't really

a lot of

tools that could support the workflow to make it productive and to help, engineers deal with the complexity and the quality issues. And I just been fascinated by this problem, and

I kind of switched from

being a hands on practitioner building pipelines

to, first being an engineer working on tools and then eventually becoming a data PM at Lyft focusing on

basically

ETL tooling and anomaly detection

and, you know, cataloging and your ability to basically improve the workflows of data engineers. So that's that's the problem I've been passionate about for for years now. And

what what we're doing at Dataflow is pretty much exactly

solving that problem.

Bringing us now to

the discussion around reconciliation

and how that is applied and

the strategies for doing that in different database

environments. So I'm wondering if we can just start by unpacking that term reconciliation

and what that means in the context of data and database engines.

Yeah. Absolutely. I think

reconciliation is 1 of the

dimensions of the broader data quality problem space. You know, just like in software quality,

data quality is a very

multidimensional

problem where

there are problems of timeliness and data completeness, and I won't recite all the all the terminology. But, essentially, what it means is we have

to solve different different problems at different workflows. Right? When engineers are writing code, when they are looking for data, when they are

deploying their code, when they're

maintaining

data pipelines in in production.

And reconciliation is just 1 of those workflows

that

I think

hasn't been really

served by tools in my career

yet being very important

and hasn't yet that much spotlight, I think, in general in the last few years as the data quality tools have been evolving. So I think it just deserves

to, be unpacked and talked about because I certainly felt that pain, and I've seen a lot of

really, high performing data teams facing a lot of these issues. So what is data reconciliation? Ultimately, it's all about understanding

the differences in data

across database environments or within the database environments, and that can be applied in a number of workflows. So as an example,

when we change the code as data practitioners,

which is what something we do all the time, we can introduce new metrics. We can change definitions for metrics. We can refactor our

code, which is usually these days SQL that runs on all of databases,

changing that code, for example, to be more performant

unsold performance bottlenecks,

we need to make sure that when we deploy this code, it doesn't introduce any regressions. And so

the reconciliation

applies here in the form of we wanna make sure that

the data which is produced by the updated version of our SQL code is actually the same, at least in the areas where we expect it to be the same with the previous 1. And so that is an example where we would reconcile or diff

the data which is in production environment versus staging, which is built with a new version of the code. And that typically is done within the same database environment, maybe just different tables or or different schemas. Another example of reconciliation

and maybe a more advanced or harder 1 is when we have to reconcile

data across

completely different physically different database systems.

And that I've seen

happening in 2 major workflows.

1 of them is what we call data migrations.

So

I think a lot of, state of practitioners have seen those in our tenure. It's basically when you operate a certain data stack, and at some

point, you outgrow as a team that data stack either because of data volumes or because you need to reach for

better capabilities,

and you need to move all of your workloads to a new system.

So for example,

during my tenure at Lyft, we spent

over 3 years migrating from Redshift

to a data lake that was built on top of Hive ecosystem

and

was served by 3 engines at the time, Hive, Presto, and Spark. And that's, is something where I spent

a lot of my time on. And 1 of the most painful

experiences that I've had during that process was

basically,

for any migration to be successful,

you have to

prove to yourself and then to data users

who are ultimately consuming that data that

whatever you have in the new environment, once you migrated your jobs and ported your SQL queries and and data

is the

same

as what you had in legacy environment. And if it's not the same, that that's a big problem because maybe those metrics were reported already and users just don't trust the data. And if they don't trust the data,

we, as data engineers, fail because no 1 wants to use our our pipelines.

And we basically needed to reconcile

a lot of our metrics, a lot of our core tables between Redshift

and the, the new data lake. The challenge is that you basically need to understand,

in this case,

the differences at both the high level. So, for example, let's say we ported the SQL code from Redshift to Hive or to Presto.

We run the queries. We did the backfill, and then the numbers just don't match on the high level. Let's say the counts don't match. What do you do next? You have to understand

why exactly exactly they're not matching, and so you start drilling in. And drilling in when you have

2 different environments where you can't simply issue a SQL query that just, you know, does, like, a join of the tables and tells you about the rows that are

not matching is extremely hard. You basically start doing a lot of manual work, like try to, you know, do samples and try to find

examples of data not matching or you start doing group buys. And that has been extremely laborious process that really slowed us down. And if you multiply this by, you know, all the engineers working in the project, you get enormous

loss of productivity, and the entire project gets delayed. So that's something we've we've, we've struggled a lot. And the

third example of reconciliation that is also becoming

ever more important, I think, as

the data platforms become more mature and they start to be really the center centerpiece of all data being integrated

is validating data replication.

So in the migration example, we talked about, okay, we have a legacy environment, we have a new environment, We the goal is to move all the workloads to the new environment, shut down the legacy tool, get rid of it, and move on. With replication,

the story

is not over once you move the data. Right? Typically, replication is done between

the business apps that sometimes are backed by databases like, you know, SAP HANA or Oracle

or OLTP databases that back production servers, so MySQL, Postgres,

sometimes NoSQL databases like Mongo and Dynamo.

And those,

those databases,

they

operate at high scale and they're really optimized for high throughput of conductions,

and they contain really valuable data that, in my practice, data teams really love to take advantage of. And historically, you know, obviously, data teams been operating in a lot of event data that kind of is explicitly instrumented in application code

or in the web apps, but transaction data is really, really important because of that as the source truth. Right? If someone is sitting in our microservices database, that is pretty much as close to the source of truth as possible. So leveraging being able to leverage the data in the warehouse alongside event sources has always been an increasing desire in my experience, my career, and what I've seen happening these days.

And so when you wanna replicate data across another basis,

that is a hard problem because those OLTP,

which stands for light production processing systems, they provide a high throughput, high scale. You can't just, you know, query them, export data, and then import it into your warehouse at scale. So a lot of the times, teams would embark on using what's called change data capture streams,

where you would

stream out

a a stream of updates that happen to the database and you ingest it on the other end,

and then you, through that, understand

and replicate

the all the tables that you have in your OLTP databases in all app.

And that is

a very hard problem to solve, but today,

there exists a lot of both proprietary software vendors

and open source tools like Debezium and Kafka to do that, but that process is never perfect. And we can talk about kind of why and and what are the challenges.

And so being able to ensure that

when this process is performed, you actually have consistent data on on both sides and whatever you have in your production system that you deem source of truth.

It is actually replicated properly on your other end where you wanna use it,

becomes really,

high priority for the data team as far as their data quality journey goes. Because otherwise, if you're not sure about the consistency,

you can't really say I'm using social true data. And so solving that problem basically requires

to build a system that is able to understand the differences between datasets at scale across different databases. And that's a very hard problem, but we've been working on this for a few years now, and, I'm excited to unpack it more.

Another interesting aspect

of this replication use case is that it's presuming the ELT approach of I'm just going to do a bit for bit copy of the source data to the destination with the node transformation in between. And I'm curious what are some of the

stumbling blocks or additional challenges that are introduced when you do have some form of transformation in the middle before you get it into the destination and then being able to validate that the data that you have in your target is actually all the same data that you have in your source even accounting for the mutation that's happening? Yeah. It's a great question, Tobias. I would say if someone is listening to it and they are actually doing transformations in flight, they should probably stop listening and rewrite it to the ELT right now. But seriously, I think that this pattern that was more prevalent maybe in the early days of data, and

and we've seen less and less

happening these days because I think teams realize that even moving the raw data around is hard enough. Applying transformations to it in flight at large scale is just error prone and also makes the whole developments inflexible.

And, therefore,

the industry massively adopted the ELT partner where you just dump whatever you have, and then you build downstream

pipelines. And even if someone is wanting to do

streaming transformations

in flight for, you know, legitimate use cases, like, for example, online

machine learning model inference and fraud detection where this is really warranted,

we still see that there is, like, a kind of an an outlet for the raw stream that is always saved in a big lake or in the warehouse so that it's always can be, can be queried. But, yeah, if the transformations are done, that only makes the reconciliation process harder.

Another interesting aspect of reconciliation

as you were talking about earlier is the case of I have all of my data in the warehouse. I'm doing some transformations, but now I need to modify that transformation a little bit to

address some new business use case or address some data quality issue.

And that brings to mind situations

like the

Snowflake copy on write tables where I'm going to write my new DBT code. I'm going to deploy it to a branch, and then I'm going to create a copy of that source table, run the new transformations.

And now I need to make sure that all of the same rows are being processed, but that my additional logic is also being accounted for. And I'm curious what additional

stumbling blocks you run into for those types of reconciliation use cases.

You mean, basically, when

someone is working on us on an environment that is, like, for from production

and the production can move on because it's get gets on baseline?

Correct. Yep. Yes. That is that is interesting problem, which is what we call data drift. Ultimately, to be able to reconcile data for testing the code changes use case, especially in in, you know, CICD

processes,

you ultimately need to be able to compare apples to apples. And

luckily today,

the

more modern data transformation

frameworks such as DBT,

they facilitate the creation of proper staging environments.

And DBT, in particular,

has this feature set,

which is called

state and defer.

And the idea is that a lot of the times when we wanna do some development or we wanna build a staging environment, we can basically

reuse the production data that's currently ready gets computed for everything that we know hasn't been changed and only

do the computation on the changed code and changed datasets.

That minimizes the the data drift. I think where because the same data is reused for both the production system and staging.

The I think where the things are getting harder is when

your production system is changing very frequently when you have, you know, either live tables or your jobs are running just very, very frequently. And so in this case, the data drift is likely to happen because the

production is just updated so so often. In this case, there are a number of techniques that teams can use, including

watermarking.

So applying filters

to only look at data at particular timestamp, but really works for 5 tables or append only tables,

as well as

using time travel where you can query dataset at a particular point in time. So you can actually compare your staging environment to production environment

as of particular date. So you know that, you know, you can kind of freeze your data in time. So, luckily, there are,

more advanced

techniques,

that more data warehouses support that make comparing data, reconciling it in a in an efficient manner possible.

Which brings to mind also some of the versioned data lake and data warehouse

systems such as lake f s or Nesi where you wanna be able to say, I only want to look at the data as of this commit or as of this,

tag in the data system so that I can reconcile across them and ignore the fact that there is new data that's been appended,

which which is some of the benefits that you get through, like, the iceberg table spec or the lake f s,

sort of versioned s 3 data store.

Yeah. I think that I I'm really curious to see the how how the version control for data proliferates in, like, the mainstream data pipelining.

I think these they are some really powerful ideas that for, you know, at large scale, data lakes definitely make the workflow smoother. But so far, at least in my experience, this still lives in, like, the advanced

category, whereas, like, the the mainstream

workflow

operates on a more simple construct. But that's that's definitely an area,

I'm watching closely.

And digging now into the error conditions that you run into.

So

maybe taking it, a couple of categories at a time. So I'm replicating my data from my source transactional database into my data warehouse,

and everything has been running smoothly. But now I'm running my reconciliation

report, and, oh, no. I'm I'm getting an error. I'm curious what are some of the types of sources of those errors and some of the downstream impact that those are likely to have as you continue to propagate changes downstream?

Yeah. I think that the the the challenges that

exist in cross database data application,

they stem from the fact that we're dealing with different systems

operating at high scale, and so you inevitably have to make trade offs. And 1 of the trade offs is that you're not really trying to continuously, like, refresh the data. You're not, you know, dumping the entire table and then serving it, which would certainly be more robust and simpler. But a lot of the times, your scale or your,

you know, load

constraints on your conduction system just cannot allow that, and so that's why

the change data capture kind of technique is utilized. And in terms of the types of issues that can occur, there's really a lot. For example, schema changes

When schemas are

updated on the source side, updating them on the target side

could be quite tricky. And

the other issues are with just the fact that we're talking about streaming datasets. When you're dealing with streaming,

it's all about trade offs. Right? And you start running into issues such as events ordering, event duplication,

and you kinda have to make trade offs whether you want the data to be delivered for sure or you want the data to be delivered exactly once, but not necessarily both things at the same time. And because those

the stream actually contains the, basically, transaction log for the database,

if you're missing certain you know, even 1 event sometimes that actually can have pretty big big consequences

on how the data looks on the other end. The other challenges that are most specific to how like, what systems are used is also how the transactions are handled and, like, where where exactly are the transaction boundaries on the OLTP side and how the deletes are handled because the data could be deleted, but not necessarily

accordingly

represented on the other end. And I think that the biggest problem

that teams are facing is that the systems that are designed to transport data,

They actually don't really have robust mechanisms for kind of QA ing their own work. So their work is kind of to deliver the data, but then they leave the QA to the user. And there is no there hasn't been much tooling that would actually allow engineers to say, like, oh, yeah. We have a, you know, problem here or the data is not not fully consistent. So that's kind of in terms of the the problem space that we're seeing. In terms of

how to solve this problem, how to actually reconcile data and validate it, it also has a number of challenges. And there are probably 3 main

groups of issues. 1, just the sheer scale of data,

and that means that for the same reason you do CDC versus just export and, you know, import data in bulk, you can't compare data by pulling it out from the database into some third system and running computations like in memory. Right? Because, like, if we were to try to build something naively, we could use something like, you know, Apache Spark that has amazing interoperability that could query all the all the other systems, import everything into 1 data frame, and then compare it and and spit out the results. So it turns out this is, a lot of the times, not an option for teams because of the 1 load that can be imposed on both databases, and 2, that it means you have a lot of data traveling over the network, which at scale gets expensive and becomes slow. So what that means when we are doing reconciliation and we are diffing data, we have to do this

without pulling the data out or with pulling only the minimal, you know, amount of data relative to the entire size of the table out. How does this work? So that's kind of where the magic of, you know, data diff and data fault comes in. Basically, we have an algorithm that relies

on

hashing data on both sides and then doing algorithmic search on those hashes to basically understand

where the discrepancies are and then drilling in when we find discrepancies and kind of being able to go from a 100, 000, 000, 000 row table

to particular row and particular value that are different. So that itself is is a very hard problem just from the performance

and algorithmic,

perspective. The other challenge

is, again, similar to why CDC is hard, is you're dealing with

data systems that

have been built completely differently by different teams with, like, no real

expectations that they would have to have a handshake or work together. Right? So the only thing they have in common is probably SQL.

That means that you can have different data types, different ways that those data types are handled, represented,

even ordered. So collations, which is the mechanism

in

by which the database actually decides how to order

values in a given column are also different and differently implemented across systems. And some systems have more options in terms of how you collate the data and some have fewer.

But if you can't order data in the same way, that also makes comparing the data really hard, especially if you try and do this without exporting the entire table. So but just to give you a glimpse of some of the challenges we're seeing with,

cross database compatibility

and, reconciliation.

Are you sick and tired of salesy data conferences?

You know, the ones run by large tech companies and cloud vendors?

Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around.

I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th,

where I'll play host to 100 of attendees, 100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI.

The community that attends data council are some of the smartest founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers

who are all working together to build the future of data and AI.

And as a listener to the data engineering podcast,

you can join us. Get a special discount off tickets by using the promo code depod20.

That's

depod20.

I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.

And then in the situation

of a warehouse migration project, what are some of the

different manifestations

of error conditions that are likely to occur and some of the types of resolutions that teams need to be planning for? Yeah. I think in terms of the migration,

if we think about the entire project, the idea is pretty simple. We take all the jobs we have on the legacy side. We take the code, let's say, in the majority of cases at SQL, let's say, running on some legacy system. We convert it to, you know, a new dialect, and these days, it's actually pretty easy to do with open source tools like SQL Gua,

and that is easy. However,

the devil is in the details because

even if you do the 1 to 1 automatic comparison of SQL, things are not really matching

most of the times because

you still have to rewrite certain parts of the queries, certain constructs are not supported because the data types are no longer the same, because you inevitably need to refactor some things. Otherwise, they just, like, won't run on your system, and

you end up with certain discrepancies. And those not don't necessarily have to be large to actually pose a risk to the project.

You know, even if you're missing, you know, point 0 1% of your transactions

or your revenue calculations are, you know, less than 1% or 0.1% off.

This is still, a lot of times, is a nonstarter for the business to be able to accept the migration. Right? So and then the the other the other challenge is that the sheer fact that comparing data is so hard,

even if you're

likely to have data fully matching, unless you can efficiently

understand that and prove that, it still becomes a really, really big uphill challenge.

Getting a little bit off topic here, but just as a point of curiosity, I'm wondering

how you have seen some of the data virtualization

engines used as a means of maybe simplifying that migration path of being able to say, well, I've got my source system in Redshift. I'm trying to go to Snowflake, but I can federate across them using, like, a Trino or a Presto, and if that's something that you ever see people address as a means of

reducing the user experience drift during that migration phase.

Yeah. I think we we sometimes see the virtualization

happening, especially

when the target the destination for migration

is a data lake with some sort of an open source engine, like you mentioned, you know, Presto, Trino

with very high interoperability

across the entire spectrum of systems.

And I think what that allows to do is also

reduce the pressure on the business while the migration is underway.

I think, in general, speaking about migrations, 1 of the best practices that

I recommend

when someone is undergoing migration

is b make sure that even if you haven't ported all the jobs that produce the data, that you can at least serve all the data with a new system as early as possible in the project because that just massively reduce the user dependency on the loan system, gives you more leeway,

and reduces the pressure

and load and kind of the overall pain that we experience. And so to that end, if you're operating with a kind of data virtualization layer that just makes it simpler,

If you're not, then you just, you know, regularly copy data from legacy into the new system, and you serve it from from there.

Beyond just

the bare facts of I have this piece of information

in my source system, I have the same piece of information in my destination system, I'm wondering what are some of the ways that the vagaries of different database engines and the ways that they manage

the storage layer, in particular things like partitioning or updates,

how that adds complication

to the work of bringing data between different systems,

and being able to ensure that your business logic is able to be,

implemented

effectively across them, or is it largely just a case of everything is SQL and I just don't I I just let the database engine worry about those vagaries so I don't have to? Yeah. Well, I think there are 2 questions. Like, 1,

how it complicates the actual,

you know, moving data part migration or application,

and the other, how it complicates the reconciliation process.

So in terms of the actual moving of data and, like, grading the jobs,

The good thing is that with modern systems, usually,

they are

far more obstructed

than the legacy systems in terms of what the user is expected to worry about, to care about, and the failure modes. And, you know, to give an example, in Hive, I believe there are over a 100 different parameters that you actually put in in your SQL query

that dictate, you know, how much memory you allocate for your map tasks and, you know, your shuffle stage. And so

that just means that there's so many different failure modes and bottlenecks that the user needs to worry about, and mastering such a system requires really deep expertise in a particular technology. Whereas if you talk about something like Snowflake,

it basically just works. And although, obviously, there could be better performance or worse performance,

the baseline is just so much easier.

And so, typically, when

users are upgrading their data platforms, they're going from something which is

slow, complex,

hard to run, requires deep expertise to something which is much simpler and more robust. So, usually,

it is easier to make it run on the new system and as opposed to, like, other way around. But, certainly, to your point about partitions,

that's another example where

certain systems that were legacy,

for example, could not work other you know, unless they were the data was partitioned. Right? In the new systems, that is not necessarily true. They can just operate on an entire table,

and apply kind of clever, you know, clustering techniques and and other things to to make things run well. So, again, we're talking about, like, reduced,

complexity.

When it comes to data reconciliation, things are a little bit more interesting. And because when you reconcile data, when you dip data in the case of, let's say, Dataflow comparing something like SQL Server to Snowflake, we have to deal with

the legacy system in its, you know, entirety and all all its complication and solid challenges. And so

while implementing high you know, high performance diffing algorithm

that would compare something like BigQuery to Snowflake is is

relatively straightforward

given, you know, a good high performing algorithm. If you're dealing with something like SQL Server Oracle,

that becomes much harder because, 1, the data is distributed

it's stored actually in typically a row fashion.

And whereas in your OLAP systems, especially modern ones,

column

functions that operate on the entire column run really fast and they're optimized.

In the OLTP database,

they are not necessarily used to operate on such fashion, which is actually required for cancellation.

Also, they have

been built, you know, for different types of transact you know, different type of operations and analytics performed on them, And that requires and has required us to apply a lot of really deep database optimizations

that are specific to each engine to make sure that when we do a cancellation of, let's say, a, you know, a 1, 000, 000, 000 row dataset with over a 100 columns between an OLTP

proprietary vendor database and, you know, something like Snowflake or or Databricks

that is performance and utilizes the compute

on the kind of more bottleneck side, which is usually the the legacy database as efficiently

as possible.

So that's definitely

a a harder challenge.

Talking about all of the complexities of managing reconciliation,

the different failure modes, the air conditions, all the systems that I need to be aware of, and the of how they interoperate. Where's the easy button?

How how how do I how do I make it easier so that I don't have to worry about all the different ways things can go wrong and how to fix them? Well, you just use DataFold.

So, basically, just to be clear, you know, we're not in the business of actually moving the data. This it's, I would say, a very different problem because moving data

is a large problem space that can be done with open source technologies, can be done with vendors. Sometimes you want to do real time event processing in the in the meantime, so we're not really in that business.

Our job is to make sure that data practitioners

have

really efficient data quality workflows solved, and that's just an example of a a data quality workflow that we're automating.

And so

what we are doing on our end to make sure that this process is as painless as possible for the practitioner is just, again,

abstract them away from all of these complexities.

So we you know, our team deeply researched the internals of all the database systems we're operating in. We mapped all the data types and how they map across different databases and what can be compared as is,

what needs certain trade offs to be compared.

You know, floats and decimals is an example of a really tough,

pair of,

data types to be able to to compare. We've also applied a lot of performance optimization techniques,

and

I would say that the the other interesting kind of innovation

or sets of problems that we had to solve in our product is how do you

implement and think about the user experience? Like, okay. We have a system that pushes all these databases to the limit and provides you with this diff report. But how do we actually make it use useful to a practitioner?

And what we discovered is that a lot of the times,

for example, users don't necessarily

need

every single row, every single discrepancy accounted for. A lot of the times, what they really wanna understand

is what is the magnitude of those differences,

Which columns are affected?

Is the area affected

spanning a particular date range? And based on that, they can actually make a lot of the decisions

on how do they deal with this discrepancy and how do they remediate. And so some of the interesting things that we built recently is, for example, the streaming of the results

and in real time and also

real time updates sorry, and,

sampling. So streaming results is interesting because

in classic sort of data engineering

workflow,

you are used to your query being atomic. Right? So you have a query. It may be doing a lot of computation. You sign in the database. You wait for x number of minutes. If it's something really big, could, you know, run for an hour, and then you get it back. But in the case of cross database diffing,

sometimes it actually takes longer, but you would actually benefit from seeing partial results of such a query. And so we actually implement an algorithm that allows the user to see intermediate results. And at some point, they may say, oh, you know, I see that there's at least, like, 0.1 percent difference in the row count and this column that I really care about

being equal. I see already there are, like, more than 10 discrepancies. I just gonna stop this process and work remediation. I don't really need the entire 1000000000 row data set to be examined. So that's just an example of, like, a really interesting UX

hack that we discovered that simplifies the workflow. And then the sampling is also

interesting because of the same reason users are a lot of the times, interested in having a representative sample of those differences.

Because if you have, you know, 100, 000, 000, 000 by 150 column dataset,

that's a lot of data to go through, especially, like I said, on the LTP side. But what you may actually be interested in is understanding

which columns are equal and which columns have at least x percentage of discrepancies. And so you want, as you go and you diff those results, you actually have a representative sample of that data, which is quite hard to do because

you don't know how those discrepancies are distributed. So you have to be clever about how do you statistically catch

that up with that example of of discrepancies in your new dataset. So those problems weren't necessarily apparent to us when we embarked on this, but the more we work with our our users at scale,

those were some of the interesting problems we had to solve.

The data type

piece in particular is interesting and 1 that I've had to deal with in some of my own work. The the the floats and decimals that in particular caught my attention

because by default, a lot of the systems that will just shuttle your data between point a and point b if it sees that it's a numeric type and it's not an integer, then it just says, oh, it's just a float then, which can be quite problematic if you actually need the precision of a decimal, particularly if you're dealing with monetary data.

And I'm wondering

of some of the ways that you've had to deal with some of those edge cases of mismatches in

support for

complex

or custom types between database engines as well as some of the implicit type conversions that might happen in the process of moving from point a to point b? I think that probably deserves,

like, a whole section of documentation of how this works. But in general, I think that we we thought about this problem hard, and we came to a conclusion

that we should do as least as possible

implicitly because we cannot assume

for the users how they want the data compared. And so, basically, the kind of decision framework that we

settled on was that in the case that we can logically conclude

that

things can be compared

exactly, like, you know, in a certain way even though those even those data types are different, we would do that. However, when we think that users may interpret the results differently or they actually would want

potentially different things depending on the use case and what is the semantic meaning of the data,

we try to surface the decision to the user in terms of, for example, how do we handle precision and then how those, you know, types are mapped to each other necessarily.

So that's a kind of continuous

work in progress, but at least we were able to set on these set of rules and and build

with them in mind, which simplifies at least some of some of the problem space.

Another interesting

aspect of this problem space,

particularly when you're moving data between systems, is the question of cost management

because a lot of cloud

will bill you for data coming in and data going out. And so,

ideally, you would minimize the amount of information that has to be shuttled between those systems.

And in the work that you're doing of adding a means of

seeing, I've got all the right data in the right places,

How do you try to

mitigate that cost of transfer and minimize the amount of information that you have to exfiltrate from the database to be able to do your job. Yeah. I would say this

cost management, performance managements,

network, and IO bottlenecks

are definitely a primary constraint that

we optimize around. And so when we settled on building the system for across database data dipping, we 1 of the 1 of the constraint was that there is no way that we would be exporting, you know, an entire

very large dataset

out

to compare.

And that's why we settled on this more kind of harder

algorithmic approach

that allows us to minimize

the data that's gets on top of the network. And,

you know, without going too much into the weeds, I would say that

the way it works is

by default,

the data that is sent over the network is the metadata. So it's hashes

of the rows and columns that are compared.

And then as we drill in into the discrepancies,

we are sending certain chunks of the of the data

of the table where we find those discrepancies.

And,

therefore,

you know, only a small fraction of the data is ever sent, and then the user actually has control over how much they're sending. So So for example, they can say that I,

want to

diff this table,

but I wanna make sure that I'm not sending more than, you know, x x amount of rows over the network.

Or they can say, I want to diff those 2 tables,

but I want you to stop.

And that I would basically get exactly what I want if I find at least, you know, x percentage of difference

in this column or in the row count or in primary keys, basically, depending on the what what the user wants. So, basically,

to, summarize, this is definitely the primary design constraint to

absolutely minimize the amount of data that's sent on the network, and

only a small fraction is usually sent. And we give the user control of

how much is,

is sent so they can control those costs.

And in your work of trying to empower teams to

validate the quality and correctness

of their data flows, ensure that all of their database systems

are healthy and that their business logic is doing what it's supposed to do? What are some of the most interesting or innovative or unexpected ways that you have seen

the DataFold product and the DataDiff utility used in this context of data reconciliation?

Yeah. I would say that,

we talked about the 3 major use cases. Right? So change management

of your

data processing code where you compare staging in production. We've talked about migrations. We've talked about replication.

The use case that we haven't talked about, I also didn't anticipate, because I personally haven't dealt with that in my career, using reconciliation

for compliance reasons and for auditing all data.

So a fair number of our customers

are operating in a financial services

space, and

they get audited by,

you know, professional auditors on a regular basis. And I think the interesting trend that we've seen is that previously, all the financial statements, everything that is actually important to the business would just live in the ERP systems and only be accessed from those systems. Whereas

today, because

when companies are reporting to their users, when they're reporting to their customers, when they're reporting to, you know, the government and authorities, they increasingly,

you know, rely on the metrics that are computed

from their analytical sources, or they would blend their kind of ERP transactional data

with those analytical sources. And so

that actually imposes

a entirely different degree of kind of accuracy

and com and control that you have to implement in your data workflows

that then it then existed previously. And so we've seen Datadhip being used by data partitioners

to basically demonstrate to auditors

that when they make changes

to,

you know, key metrics and key pipelines

when they move data around, that they don't lose data integrity. And that's really important. That saves them a lot of time, you know, saves

also

companies, the auditors bills because people don't have to work on this manually.

And I think also

will lead to

increased usage of data,

in those, you know, high compliance

environments, which I'm personally very excited about.

Absolutely. Yeah. The the compliance question is particularly interesting to ensure that we have all of the data that we're supposed to have and none of the stuff that we're not.

Exactly.

And in your work of

building the Datafolds product and working with practitioners

to ensure some of these

challenging quality questions, what are some of the most interesting or unexpected or challenging lessons that you've learned?

So I would say that definitely

since we're ultimately

opening up a whole new

type of tooling and the workflow for data partitioners,

and I mean data diffing, that's not something that, as a tool, has been available widely to,

the data partitioners before, I think, we released our open source

toolkit

and, and the cloud product. We have both the joy and kind of the responsibility

to come up with all the terms

and make a lot of different design decisions,

how this

type of work and workflow

is done. And we've definitely learned

a lot from our users.

And I think some of

the interesting realizations that we kinda talked about already, for example, in terms of the user experience were that, you know, performance matters

to an extent.

User experience in terms of how the data is presented,

what is presented first, how you think about the coverage

of kind of your

identified discrepancies and differences

versus maybe the depth, like how much data was actually, you know, compared. All those trade offs definitely,

were

not obvious to us when we'd marked on this, and we

had

to, you know, build something that on the 1 hand is opinionated, but on the other hand is

supporting workflows of a wide range of data partitioners operating in a completely different industries, you know, financial services, health care, Fintech, and all of them have some data in common and definitely have common workflows. But a lot of times, the nature of why they're doing this and what exactly they care about is is different. So I would say that in general, product design and UX have been probably the primary challenges.

Not to discount,

obviously, all the hard problems we had to solve on the

pure

computation and algorithmic side of things just to make sure that we compare data robustly and it runs as fast as possible.

And for people who are trying to

incorporate

this reconciliation

loop for their data in their database environments? What are the cases where a data folder or data div are the wrong choice?

So we don't support NoSQL

yet. This is something that we've heard a lot from our, you know, prospects and customers that they want to add support for. We have pretty good ideas how to do it. But as of, you know, today, March 4 you know, March 14, 2024,

this is not something that's possible, although we do support a wide range of SQL engines.

And

I would say

if you're

looking for a tool that actually moves data,

definitely, we're not we're not the choice.

We are there to

provide

visibility and to test. And the same goes for remediation.

We got some questions from certain prospects. You know, hey. Can you actually go since you know so much about our data and and discrepancies, why can't you just go in and, you know, fix

these issues for us? But we thought about this, and and we decided that that's just not something that we want to do because,

again, just because how those

pipelines

can work in so many different ways and

those application systems built in so many different ways. We wanna be providing information and visibility

and then let the users act on this themselves.

So you can basically get results from Dataflow using UI,

API, or we can even materialize them back in your database so you can create as a table, but we don't want to actually change your production data. That's something that it's actually in scope of the team using it to do. As you continue to build and invest in this space and expand the DataFold product, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?

Yeah. That's a great question.

NoSQL support is 1 as well as just broadening the systems that we work with for data reconciliation.

We're continuing to improve performance, and we do have a lot of interesting things and hacks that we discovered that we'll be just implementing in production to make it run faster.

The kind of continuous replication

monitoring is their more recent,

thing that we shipped where you basically instead of, like, asking data fold every time via API or u UI to compare datasets, you can just schedule those checks and then have data fold continuously

perform those checks. That's an area that we will be continuing

to evolve in. And there's a lot of other things, but they're not necessarily related to to reconciliation. So maybe we can talk about them in the in the next episode.

Absolutely. Are there any other aspects of this workflow of reconciliation

in your database engines and the ways that you are give giving visibility to that process with DataFold and DataDiff that we didn't discuss yet that you'd like to cover before we close out the show? I would say,

I I get a lot of questions.

You guys built the open source Datadiv tool

that's been pretty popular, and you have the cloud product. What is the difference? How do I know what should I use?

And the answer is actually quite simple. So the open source tool

allows you to compare data within databases or across databases.

It implements most of the basic adapters.

It is a self contained Python

package that is really easy to install, and

it's really

built and optimized for an individual partitioner. So let's say if I'm a data engineer and I just have to compare something that I just copied around or I wanna, you know, maybe test my DBT code when I'm making changes.

So that is the perfect tool. If I'm a team and if I'm an organization and I have to basically

do these dips at large scale,

in terms of both the volume of data, but also a lot of them, and I have to, you know, share results within my work. That's where the cloud comes in. So the cloud operates actually a different engine that requires

cannot basically run on on a single, you know, as a single Python application.

It's far more performance and more sophisticated in terms of how it handles

large scale and kind of cross database data mapping and how it how it operates. Basically, what kind of primary keys it supports in a table. So it's basically built more for the team and and enterprise user. And the analogy that I

like to give, it's not fully applicable, but it I would say it's illustrative, is there is Git and there is GitHub and GitLab. Right? So you can imagine Datadiff is Git in a sense that it gives you the fundamental fundamental ability

of comparing your your datasets, and it's open source and it's free. But if you are a team and then trying to collaborate on a software project, then you use GitHub or GitLab where you actually have the full workflow automated and you have the application UI and, you know, full, full service. So that's kind of the the differences that I think is important to highlight. But individual partitioners can for sure start with open source.

Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are up to, I'll have you add your preferred contact information to the show notes. And for the usual final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today. That's interesting.

I think that

there are a few things that I'm excited about. 1, there's been a lot of talk about how, you know, the hype cycle shifted from modern data stack or data engineering to

AI and,

LLM applications

and what it means for

for the practitioners, what it means for companies building in the space. I'm very excited about this. You know, we're not an AI

company. We're not building models.

But I think that for any business to actually be able to effectively apply those LLM applications

in their domain

for their business, for their users. The only way they can do this is by, you know, feeding in their business context and data. Right? If I'm a support agent

that is backed by, let's say, GPT,

I need to know all the previous orders and all the history I have with this particular user and all the previous interactions. So that data better be correct. And that's where, you know, old school

data engineering comes in where

now delivering

these this data

faster, more timely,

more reliably,

and just more of it, more context

becomes ever more important. So I'm very excited about this as

a entirely new field

for how the data

pipelines actually get consumed.

That's kind of 1 thing I'm excited about. But to answer your question about tooling, I'm excited about all the innovations that are happening

on the kind of the ID and data engineer tooling support with enablement of,

you know, of, LMS. I think that's as it usually happens

to data relative to the software space, we're still behind the level of support that software engineers get. Right? You know, Copilot works really well for for software engineers, but there's still quite limited support we get as data engineers. And I think for the same reason because

in order to be able to to help data practitioners,

you have to feed the context, and the context is so specific to to the company. So

very excited for this. Have you have you have to see,

interesting tools, but I follow this field pretty closely.

Absolutely. Well, thank you very much for taking the time today to join me and talk us through some of the challenges and problems

and resolutions for reconciliation

of your data in your database systems. So appreciate all the time and energy that you and your team are putting into to bringing visibility to that and helping teams,

accelerate the process. So thank you again for taking the time today, and I hope you enjoy the rest of your day.

Thank

you, Tobias.

Thank you for listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your

story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links