Maintain Your Data Engineers' Sanity By Embracing Automation

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy, and today I'm interviewing Chris Riccamine about building awareness of data usage into CI and CD pipelines for application development

and just the overall approach to automation to simplify the work of data professionals. So, Chris, can you start by introducing yourself? My name is Chris.

I have been working in the data space for about 15 years now. I started after a brief stint at PayPal where I was doing data science and data visualization.

I joined LinkedIn and spent 6 and a half years there.

I started in the data science world.

Within about a year, switched over to the engineering side

and spent a bunch of time

doing stream processing at LinkedIn,

specifically working on Apache Samza, which is the stream processor that came out of LinkedIn. And then I joined

WePay, which is a payments company that was acquired by JPMorgan Chase a couple years ago. And, again, ran their data infrastructure team where we set up a cloud data warehouse with BigQuery, real time CDC with Debezium,

bunch of Kafka connectors to move data around and sort of realized

the vision of, like, a data integration,

for lack of a better term, enterprise service bus on top of Kafka.

And then since last November, I've just been hanging out and sort of relaxing. So that's

me in a quick nutshell. I've also written a book with a friend of mine over the past couple of years, Dimitri,

who's the VP of engineering at ZimerGen.

And the book is the missing read me, which has

little to do with data and is more about just getting

new college grad software engineers up and running. So that's kind of the projects I've been working on.

And you mentioned a little bit about some of your earlier career work, and I'm wondering if you can just share sort of how you first got involved in the area of working with data and what it is about the space that keeps you interested and motivated?

I wish I could answer that question. It just sort of seemed

like a natural

affinity. For my first internship

at PayPal,

I joined this team that was called the advanced concepts team, and it was sort of this like skunkworks

lab team, a couple of people who were doing exploration and sort of new technology. And so I spent a bunch of time doing data visualization

at PayPal. And what that means is they had a a data warehouse, Teradata, at the time.

And it was, like, pulling down a bunch of their data and just visualizing it to try and understand it. We were coming at it from a fraud context. And so, you know, you would pull down billions of transactions, and it was it was a really fascinating

way to kind of get my hands dirty just understanding the power of beta and, like, exploring it. And there was no really no,

you know, specific end game in mind other than let's explore and understand and look for, you know, fraud trends.

And

from there,

while I was at PayPal, I started getting interested in graphs, in graph traversal, because 1 of the projects I worked on was

visualizing

transaction graphs. So, you know, I am a user. I have a credit card. Some other account also happens to add that same credit card. Maybe that credit card is stolen. So if you can imagine the visualization, there's 3 nodes in this graph. There's me, the other account, and we're connected via this credit card, right, which is a 3rd node.

And

through my exploration of graph databases and stuff, I kinda stumbled over Hadoop.

I got really interested in Hadoop and kind of wanted to learn that technology. And that's initially what drew me over to LinkedIn is they were

just building out their Hadoop ecosystem. And so a friend of mine and a mentor moved over from PayPal to LinkedIn and kind of, you know,

caught my interest.

And then I transferred to LinkedIn. And it was sort of the same story where I spent the 1st year doing data science and exploring stuff

and then quickly realized, like,

a lot of the valuable work to be done was in the the term was the data engineering, but in the data engineering space and, like, getting

the, you know, features

and the data into Hadoop, you know, to train the model, being able to scale the model. You know, they had at the time,

1 of the main things I was working on at LinkedIn was something called people you may know, which, again, is a graph algorithm mostly.

And they were running out on Oracle, and it was taking, like, 6 weeks to complete a single, you know, training run. And so it's, like, okay. Great. We have this wonderful model, but it takes us 6 weeks to refresh it and it's super brittle. Can we improve it? Initially, what to do, we got it down to, like, sub 24 hour and it was just like a

completely different game changing thing. And so that drew me, I think, from sort of this data science world into the engineering world and, like, the realizing the power,

especially that point in time, was heavily stilted towards,

you know, investing in engineering.

And in terms of your experience of working in this ecosystem and helping to build out some of the infrastructure and processes

and sort of organizational capacity for

being able to take advantage of data and actually power some of these data science use cases.

What are some of the pieces of data platforms and processing that have been most difficult to scale, not necessarily in the technical sense, but in the organizational sense, and being able to sort of build up and maintain velocity of being able to actually use them and iterate on the data products that you're trying to

create? My answer is probably gonna be a recurring theme in this conversation. I don't know whether it's delivered or not, but I think the biggest

organizational

challenge,

it's been recurring over the years. It's like been a fairly constant thing. I won't make the claim that it's solved. It has been

managing

sort of the contracts of data schemas,

especially at the seams between teams. So if I'm a team and I have a data model for some event

and some other team is using that data,

managing an agreement where

I am not going to break the other team when I mutate my schema

is definitely a big challenge. But like I said, I don't think it's a solved problem, but that's something that we had issues with at LinkedIn. It's something that we had issues with at WePay,

and

there are tools out there. So, for example, the Confluence schema registry has

the ability to enforce backwards and forwards compatibility of schemas. So as you're evolving a schema, it will prevent

incompatible changes from getting into

your data pipeline,

I e, Kafka.

And that's actually something that came out of LinkedIn. We had 1 of those at LinkedIn back in the day as well. But it's complicated because I think a large part of the problem is not technical. It's like cultural and social. It's like helping the engineers understand what are the rules they must abide by

and why.

Like, why is it a bad thing for me to drop a column

that was required? Like, I don't need that column anymore. It is required, but I no longer have the data for whatever reason. I wanna get rid of it. Why can't I do that? And then, like, having

the team

and the engineers understand, well, you know, that column is used by

8 other

teams.

It's powering, a machine learning model. It's indexed in our search index. That's

a challenge, and I think something that is work that remains to be done. So I think that's my number 1 answer is schema. I think a second thing that was

not as challenging, but definitely something that was in the air is

coming up with clearly defined ownership,

especially around operations. So

especially with data, it's a lot of, like, frameworks and platforms and job schedulers and, you know, all that kind of stuff. And so when things break, like,

figuring out an operational model that works

between the teams that are using the frameworks and the platforms and the teams that are running the frameworks and the platforms,

it needs some thought. So, you know, if I'm responsible for Hadoop and you're running your job on Hadoop

and your job doesn't work,

like, triaging that, figuring out who needs help, when they need help,

how to get help

is a challenge. And it can definitely lead to burnout on the infrastructure teams if it's not thought through well. So I think that would be a number 2 answer that I would give is sort of figuring out operational

responsibility of the systems that are being run. Yeah. And I think that

the operational aspect and sort of figuring out who owns what, where do the responsibilities

lie as the data goes across the different stages of its life cycle is definitely always an open question and 1 that I don't think ever stays settled.

Well, the the answer is easy. The answer is always I don't own, so it's always not me.

Yeah. If only.

And so in terms of

these 2 elements that you highlighted of the schema evolution and the contracts of schema as it traverses these different systems and the stages of the platform,

and then the ownership of that data and who is responsible for maintaining that schema and ensuring that it stays correct across those different stages and across those transition boundaries.

What do you see as the opportunities for automation to alleviate some of the toil that's

associated with this work and making sure that all of your pipelines stay, you know, healthy and running and don't break because somebody forgot to update the schema record or somebody forgot to notify somebody that downstream that, oh, I'm going to be changing this on such and such date or even actually planning the fact that they're going to change that in the first place.

Yeah. This is the question I'm really excited about. So

I think there's a lot of opportunity here. Now I mentioned earlier

the

backwards

and forwards and, quote, unquote, full compatibility that something like the Confluence schema registry can give you. Now

the problem with that approach,

at least as it's shipped out of the box

caveat, at least the last time I looked, which is a while ago, that it was at runtime. So what that means is you don't discover

that

your schema

evolution is bad until you actually try to send the message and schema registry Kafka encoder fails, you know, and you get a error in your logs, in your application, you know, essentially

stops working.

So 1 of the things we did at WePay to kind of alleviate that problem. Like, we don't wanna find out in production or in testing or staging when we're sending messages that things have broken. We wanna find out, like you said, continuous integration,

tests, or, you know, GitHub tests, essentially, the stuff that's running pre commit. 1 of the things that we did was

we started doing compatibility

checks

pre commit. So

we would take your schema beforehand and then take your schema afterwards, and we would compare them and, like, look and try and understand

was it a compatible change according to the rules that we'd set forth.

And we initially

did that

for the events pipeline

that we had, which was essentially you're sending messages to Kafka from some publisher. Right? And for that, we were using Avro as our

schema, our DDL, or whatever you wanna call it. And

that worked quite well, but it was

limited to the event publishing.

Our

primary OLTP

databases. So these were MySQL

DBs with, like, you know,

transaction data, user data, all the kind of stuff that you can imagine from

a payments company.

And we would funnel that data into Kafka. And then from there, we would stream it into BigQuery. And

we were discovering we were having the same problems in the CDC pipe that we had had in the event publishing pipe, which is

some application developer would

decide to mutate their DB schema, which is a totally sane thing to do. And, in fact, application developers are conditioned to think that their DB for their microservice is encapsulated, and it is, like, private, and they're able to do what they want with it. And so, you know, they would drop some columns, and that would

cause the to be unable to publish into Kafka because the compatibility

wasn't there anymore because they dropped the required field or what have you. And so we actually extended the CICD pipe to check not only the event schemas, but also the DB schema evolution changes.

And so

phase 1 of this answer is I think there's a lot of room for automation in just checking schema changes before they make their way into

either a production or preproduction environment. I mean, these can be done

at, you know, essentially compile time, at commit time. In the case of the

DB checks, what we were doing, do it essentially

take, spin up a little MySQL instance in Docker,

run the migrations

up to the latest change, you know, sort of snapshot the current the DB,

run the migration, snapshot the DB schema again,

and then compare is this safe or not. There's all kinds of interesting edge cases you have to think about. Like, is an integer

big end to a small end? Is that considered compatible? So it's an interesting problem.

We got a lot of mileage out of running that, and I think that's sort of phase 1. I think phase 2 where we never really got to, but I think

is something that the industry

will be spending time over the next few years is providing automation

and tooling around what to do

when

the application developer does want to make any like, It is legitimate to make incompatible changes. Sometimes the data goes away. Like,

some sometimes the data needs to be changed in a certain way. Right? And so

providing them the tooling

to safely make the changes

was something we never got to, at least not as long as I've been there, but that was in my mind and on the road. So what I mean by that is

let's imagine that you want to drop a required column, for example, or you want to change a string to an integer,

for a given column.

Let in the case of something like streaming,

1 could imagine allowing the application developer to write a little stream processor that would take the new change and sort of munge it into the old data in cases where that is possible. So, for example,

string to int is probably something where you could write a stream processor

and convert the int string rather to make it compatible with the old schema. Right?

Dropping a required field, well, maybe you need to call an external microservice or the data somewhere else. The data is fully gone though. Your change is actually truly incompatible and irreversible. You cannot recover from it.

I think the second part of this automation story and tooling story is we need to provide good ways for them to do, quote, unquote, major version change on their database schema. I think fortunately,

there's a great set of patterns and practices in the microservice world because this all the stuff I'm talking about right now is essentially just

the same thing as microservice API compatibility.

You know, it's semantic versioning and major, minor, you know, micro or patch and,

you know, having

API gateways and all that kind of stuff, but applied to the data space. So I think that's sort of my long winded

pitch on

automation as

a a helpful thing for this space.

To your point as well about the sort of API versioning

and compatibility management across service boundaries,

I'm wondering what your thoughts are on the opportunity for being able to extend some of the application paradigms

into treating those downstream data consumers as 1 of those service boundaries and figuring out how do we

make those contracts

more of a kind of natural extension of the development life cycle.

Whereas right now, it's, you know, there's a database there somewhere. It's the data engineer's job to find out the fact that it exists,

how to pull the data from it, how to reverse engineer meaning from this, you know, assortment of tables that doesn't necessarily have any useful context unless you're looking at the code that's actually creating and using them, how to think about extending the ORM to

populate some of that schema information into things like Debezium or the Confluence schema registry for the case where you are consuming directly from the database or just how to more clearly define that service boundary for those down intuition

is

that I wanna intuition is that

I want to borrow heavily from the microservice world. And so in my experience,

the way

that that kind of evolves is, you know, you have a bunch of microservices

and eventually, you know, a given team, you know, might have a collection of them. And eventually,

you know, if you grow big enough, you start providing some kind of internal,

like, API gateway to the rest of the organization.

And then you have a service mesh that's, you know, tracking who's calling what, and you start getting a bunch of tooling built around it to track the calls across all of these

inter team exchange points.

And so

I think we should do something very similar in the data space. You know, specifically,

I would like a team an application team to define

data product, and this is sort of borrowing from the data mesh terminology a little bit. And and the data product is a schema that they are publishing

for their data that is meant to be consumed by other teams.

I would like that data product to be semantically versioned

so that we can track compatibility.

And then I would like for the data engineering team to provide the

application team with a set of tools that allows them to

transform their internal data into this data product and evolve and manage the data product properly. So this is like data catalogs, data quality checkers, schema compatibility checkers, you know, migration tools when you're doing a major migration bump to track, you know, which consumers are consuming your data. You know, using Kafka, for example, all that data can be sucked out of the offset topics. You can see who's reading what based off the commits that they're they're doing.

So

I think we should borrow heavily from the microservice paradigm. And in a sense, that's what I see

when I look at data mesh. You know, people I think data mesh has a lot of words. It's very complicated to read those blog posts. That's not a knock against Jamuk. I think she's trying to just explain something that's complicated.

But

I think if we just sort of think about the data space is like, hey. It has a lot of the same problems and architecturally as microservices do. And then culturally, you know, it has a lot of the same problems that, like, operations had, and we grew the microservice and DevOps culture for those problems. Can we borrow a lot of those same

philosophies and ideas for the data space? I believe we can. And in fact, I wrote a blog post trying to describe data mesh that is essentially that. Like, let's look at microservices. Let's look at DevOps and sort of apply them to the data space.

To get back to your question,

I think the data engineering team should sort of shift to your point about, hey. Their job is to, like, get data where it needs to be to their job is to provide tooling to the application

teams to get data where it needs to be. So a lot more federation and, you know, automation.

If you kinda step back and just look at, like, what data teams do, it's kinda crazy. Like, the pitch is we're gonna have this centralized team that is responsible for moving all data

in the organization. Like,

again,

the idea that you would have

a microservice team that is responsible for all microservices is just bonkers. Like, it doesn't make any sense. So I think, you know, federation has to happen. It's just not scalable to have 1 team responsible for all the data, especially when the data is critical. I definitely agree with the

kind of principle of,

you know, bringing the data engineers and the application developers closer together, letting the data engineers work as more of a support team in a lot of the same ways

as SREs and platform engineers have been evolving to understand what their purpose is in the overall engineering group. So I think that they're going through a kind of parallel journey to what the sort of DevOps ecosystem and, you know, the the actual concrete implementations of that have been in the form of SREs, platform engineers. You know, it's evidenced by the fact that data platform engineer is an emerging title where, you know, it's not my job to actually do all the data manipulation. My job is to build the systems that let you do it kind of a thing. A 100%. And so, you know, my team at WePay was not called the data engineering team. It was called the data infrastructure team.

Like, it was a platform stuff as opposed to, like, a data engineering stuff. And I think

1 thing when I look at the ops world is you've got sort of the

platform

tooling people that are centralized. And so there's some kind of centralized ops organization. They're building the tooling and the platforms

that are being used by the rest of the org. Then you have this other role, which is the embedded SRE. Embedded SRE sits with the team.

They understand the products that the team are building and understand when they're shipping what, you know, ideally,

and are sort of helping liaise between

individual application team

and operations writ large. And the thing I'm excited about is this sort of new role, the analytics engineer that's that's showing up now. Because when I look at that pattern of sort of centralized ops and then embedded SRE,

I can very easily mentally map the legacy data engineer term onto that data platform centralized role and then the analytics engineer into the kind of, like, the embedded SRE role where the analytics engineer

is gonna

understand the data models of a given application team and understand how to, you know, build data marts or microdata warehouses or whatever you wanna call them, understand the evolution of the applications, teams, schemas, and stuff. And, again, help them liaise between

their data and the rest of the organization.

And so I'm really excited about this analytics engineering role that's, you know, getting a lot of traction these days because I think the combination of, like, centralized data engineering or data platform engineers, you said, that's building tooling, automation, federation stuff. And then the analytics engineers are kind of embedded in helping

define robust data product schemas,

helping explain, like, why it's bad to drop a required field

to the application team is actually, like, super valuable. So I think that those 2 things working together is gonna be a big deal. And it just maps so cleanly onto the success we saw with the DevOps and SRE world that I'm really bullish on that.

It's time to make sense of today's data tooling ecosystem.

Go to data engineering podcast.com/rudder

to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity.

The guide includes architectures and tactical advice to help you progress through 4 stages,

starter, growth, machine learning, and real time. Go to dataengineeringpodcast.com/rudder

today to drop the modern data stack and use a practical data engineering framework.

Continuing on this topic with the, sort of parallels between the DevOps transformation

and the kind of data transformation that we're going through now.

1 of the, I think, core components

that powered the overall transformation

to where we are with DevOps

is

the adoption and evolution of CICD principles where

everything that goes

into production has to make its way through these defined pipelines that are visible, that everybody has access to, that everybody can understand where things are in the delivery cycle.

And I know that, you know, some of those same utilities are being used for the data ecosystem, and also there are some parallels in the case of data orchestrators that serve as that kind of central visibility.

But I'm wondering, what are some of the other concepts of DevOps and some of the practices that are being adopted by app dev teams that are still

nascent or yet to be kind of translated into the data ecosystem and some of the opportunities for teams to be able to start experimenting with those ideas?

Couple of things that I see. So

1 of them is

taking

ops tools for managing

you know, infrastructure. So this is Terraform essentially. Right? Jeff, Puppet, Ansible,

Terraform,

whatever you wanna call it, and applying them

to tools in the infrastructure in the data space. The second 1

is

operations best practices when it comes to metrics and monitoring and operations and observability.

So on the first 1,

the essentially applying ops,

you know, tooling when it comes to deployment and figuration management and stuff. You know, something I saw firsthand at WePay

was,

you know, we had

robust ops team that was doing a bunch of stuff with Terraform,

and then we had our little little Motley

data team of 6 people. And we were tasked with running

BigQuery,

Airflow, Kafka, bunch of Kafka connectors and stuff. And

some of those things made it into the Terraform world and some of those didn't. And the things that made it into the Terraform world were the things that the SREs were heavily involved with. So Kafka connectors, for example.

Things that didn't make it into the Terraform world were things that SREs were not as directly involved with, and that would, you know, namely be managing the data warehouse. So datasets,

access controls,

all that kind of stuff. And so

I see us moving to a world where we're gonna be applying,

you know, Terraform or Terraform esque stuff, Pulumi or whatever it is, to the data tools that we have. And I think a good place to start with that is the data warehouse and, like, hey, let's manage our data warehouse fully from 1 of these config management tools. So when I create a dataset, when I create a s 3 bucket, when I grant access to this, that, or the other thing, let's not have a data engineer do that. Let's have the tool

do that, and we can submit a PR to the repo and get it reviewed and committed. And, you know, there's an audit trail and security's happier and stuff. So

that's 1 thing that I see us being able to borrow from when it comes to looking at over the fence at ops.

I think the

second thing,

the the observability

and data quality stuff,

I would say a little farther along in terms of

adoption. There's a bunch of tools

out now that are pretty great. There's Monte Carlo and there's big eye and

Anomalow

and a bunch of other ones. I'm oh,

Great Expectations, of course, is the the 1 I was forgetting. And these tools

approach

data quality checks in a bunch of different ways. Some of them are more like, I would say kinda like unit testy where you are defining the shape of your data. So I expect the cardinality of the country code column to be 255. And, you know, however many countries are already given point in time.

The

second iteration of that is is a little more automated where they will try and derive the rules for you. So,

hey. Here's my data. Go figure out some good heuristics, and I will come back and say, well, the cardinality

for your country code is currently 255, so let's enforce that. Right? And the 3rd iteration

is more of a anomaly detection,

fancy ML thing

where it's not deriving anything. Just looking at your data over time and noticing, like, hey. The cardinality of this column has been 255

for

the last 30 weeks,

and now it's 10, 000. That's weird. Like, we should alert you, right, and let you know. And so I think

that applying that stuff and really getting

rigorous about the data and our data pipeline and the data and our data warehouse and making sure everything is healthy,

We need to take that seriously,

especially as we start, you know, using data for data products that we're exposing to our customers. That's a big problem when you start giving your customers the wrong data.

As far as the testing element

in the kind of application development

environment. There have been different formulations of the testing pyramid where there are different

quantities of different layers of tests that you want to have in place to ensure that you can deliver with confidence. And I know that that is a

sort of capability that's been adopted at various levels of kind of sophistication

or commitment in the data ecosystem, particularly with some of the tools that you mentioned, Great Expectations,

Monte Carlo, Anamalo.

And I'm wondering what you see

as the

useful strategies

for determining what are the

appropriate targets for where to position those tests and how to understand what is the scope

of these tests

across the sort of different, you know, components of the data ecosystem,

you know, how and when to execute them,

where, you know, there's the difference between, like, unit tests that are executed as part of your CICD to say, I'm making this change. Is this safe? Okay. It's in production now.

And 1 of the interesting challenges of the data aspect is that there isn't usually just 1 sort of, like, make this check now and it's good for all time. It's make this check now and then keep making it for all time because it's going to change. Yeah. Yeah. Yeah. And I think another thing to consider, and this

was a huge issue for us at Webex cost. So

running

to your point, a data quality check and having to run that over and over again because

just because it passed once doesn't mean it's gonna pass again tomorrow. Like, we need to keep checking data over and over again. Can get really expensive in these cloud data warehouses, and so you have to be really, really smart about it unless you wanna spend tens or 100 of 1, 000 of dollars,

which we did for a while, and it was actually kind of pleasant to be able to just ignore it. But at some point, finance comes knocking and it's like, why is half our bill

spent

on data quality checks? In terms of where to place the checks,

we kind of took a very bifurcated

approach, and we checked stuff

upfront

right at the beginning.

And then

we, on the other end of the spectrum, check stuff

right at the end in the data warehouse. So if you imagine

our pipeline is, like, LTP database,

Debezium,

Kafka,

kcbq,

which is the Kafka Connect, BigQuery connector,

and then BigQuery. So there's, like, 5 or 6 different moving pieces in that pipeline.

We would check stuff sort of pre commit in a CICD pipe, and this is, like, schema compatibility checks. Is your data gonna make it in at all, or is there gonna be, you know, compatibility issues? And then in certain cases, it was also

kinda like these contrived

tests. I think I think dbt might support this as well now where you can run

you know, create a dataset, load some data into it, run your queries, and verify that the query result is as expected. So we would do some of that as well. Kinda like smoke tests kinda stuff. And then on the other end of the spectrum, we would do essentially

checks against the data that landed in our data warehouse. And the theory was,

you know, if the data was bad anywhere upstream in the pipeline, it would manifest itself as bad in the data warehouse as well. And then we would alert. And what that doesn't give you is, like, when something goes wrong,

you need to figure out where in the pipeline it went wrong. Was it, like, KCBQ that had an issue? Did Did drop some messages? Was the having a problem? And that was something that we were

we spent more time than I would like. And, you know, realistically, we wanna build better sort of binary search tooling to kinda whittle down where in the pipe things

broke when they went wrong. But the approach we took, as I said, was bifurcated into

CICD,

smoke and unit test stuff. It was more like, is the query logic correct?

Is the schema

gonna work properly?

And then much more robust stuff in data warehouse, which is like

the anomaly detection stuff that I mentioned. The other thing we did was DLP. So Google Cloud has a DLP product, which stands for, I think, data leak or data loss protection, which is a terrible name. But what it really did in our case was it it would detect

sensitive data. So PII, emails,

usernames,

passwords, that kind of stuff, and it would alert you if a given table had PII in it. And then we had separately a metadata set of, like, which tables had PII. And And so we would do those kind of checks, data quality checks, security checks, and stuff in the data warehouse and alert us. There was something we didn't do that

I think we had as a Jira that would get bumped from 1 quarter to the next, which was check summing. And so this is something that I think the napkin problems blog has written about. It's a fantastic post where, essentially, the idea is to do the data quality check. You construct something like a Merkle tree of all your rows

in your source and your destination, and you compare those checksums.

And if they don't match, then you can kind of traverse the Merkle tree and figure out which rows are

incorrect or different.

You can then, you know, figure out why they're different and sort of solve the problem. This is actually the way that, like, Cassandra's

inconsistency stuff works.

And so that was something that we were considering because it would be much cheaper than doing, like, select star in source. It's

destination and, like, row by row

in Python, compare these 2 rows and columns.

So that was something that we were considering but never built out. But there's a really good post on napkin problems. The author, his name is escaping me now, actually has an open source library that he's built

that does that. And so for those that are interested, I would highly recommend checking out that blog post and also his open source library. And again, the name Datadiff is the name of the tool. I actually just did an interview with the author of the library and Gleb Mojansky

from DataFold who helped with supporting that development. Yeah. He's great. I had a chat with him a few weeks back. Very knowledgeable on the subject. We were just vibing because it was it was like, yeah. This this seems easy. And then, you know, you start digging into it. It's such a robust and complicated problem. I think the fascinating thing for me talking to him was that his his experience primarily came from the online space. So he was migrating MySQL instances,

I think, at Shopify in real time. So production, real time data migration for moving shards of data around from 1 MySQL to another.

My experience obviously is data warehousing. Like, let's make sure the data warehouse matches production, but it's the same problem. And so his tool, I think, is just fantastic. I'm really excited about it. To your point of sort of

cost being an issue in figuring out how often and when to execute these data validation checks when you're running in a cloud data warehouse ecosystem.

I'm curious if you have heard any rumblings of these data warehouse providers starting to acknowledge this as a necessity and, you know, thinking about how do we actually bake in these capabilities as part of our platform so that it doesn't turn into a, you know, tens or 100 of 1, 000 dollar problem just to make sure that you're actually delivering what you think you're delivering?

I have not heard of any rumblings from the cloud providers.

You know,

during the conversation about the check summing stuff, we were both dreaming that eventually that vendors would all provide check summing APIs

so that, you know, for a given table or set of rows, you could just get the check sum of the data and then compare that, you know, using some spec across other systems.

I've heard 0 about that from the actual cloud vendors. Like, from a revenue perspectives,

they want you to query more, but then obviously from, like, a product satisfaction,

you know, net promoter perspective,

like, giving your users the ability to verify their data is good is I think trumps that. But, no, I haven't heard

much about that. I'll be honest. I haven't talked with, like, BigQuery product managers in probably a year or so. So I'm out of the loop when it comes to specifically the GCP stuff. I'm not a Snowflake user, so my experience there is, like, basically what's on Twitter

or what I hear from people I talk to. So, nope, short answer is I haven't heard much there. Alright. Well, for anybody listening who happens to either work at 1 of these companies or know somebody who works at these companies, definitely an issue to raise to see if that's something that we can factor in as part of the kind of standard operating procedure and, you know, just part of doing business.

I'm gonna blast that out into the Twittersphere and see what comes back to.

And

as far as the

kind of alignment

of

sort of goals and objectives between application development and data engineering or data platform teams or whatever formulation they're taking in your organization or, you know, in this current moment in time at whatever time this happens to be. What do you see as some of

the current points of friction between those 2 teams or some of the ways to more easily kind of align their objectives so that there is a

sort of smoother hand off between application developers

generating this data in the 1st place and providing a useful interface for

data management, you know, data professionals to be able to actually consume from those applications

and continue that kind of DevOps principle of bringing the entire business into alignment for these given objectives and figuring out how do we actually

manage that mapping between these different problem domains?

2 things come to mind. So the first thing that comes to mind is just a lack of awareness that's creating friction.

And the second thing I think is a lack of

process,

especially around architecture and and proper data modeling. I'll dig into what I mean by that in a second. So on the first thing, just lack of awareness.

The thing that we discovered

at WePay when we started instituting these CICD checks where we would basically say,

you can't commit

if you are breaking compatibility

on your event or database

schemas.

That's a very draconian statement. Like, preventing people from committing when they're dropping a required field in their MySQL table

is heavy handed. I'm first to admit that. But what we found is

when we rolled that out, most of the engineers you know, first, they were like, wait. Why can't we commit this?

And then we would explain to them like, hey. You know, the the reason we're preventing you from committing this is because it breaks our data pipeline and, you know, this data makes it into the data warehouse.

This table is being queried by data science and sales, and it's making its way into, you know, Zendesk or Salesforce or what what have you. And they were like, oh, wow.

Like, that totally makes sense. And they would instantly get it because, again, like, they're coming from microservice world. They know when another team drops, you know, or changes their API, it breaks their microservice API call. Like, they're unhappy about it. So they understand compatibility actually pretty deeply and have felt that pain sort of from the consumer side, but they weren't thinking a lot about

who was consuming their internal data. And so just educating them that, like, hey. The data you think of as internal is used all over the place. You know, at least 80% of the time, they come back to me. It was not adversarial. They're like, oh, yeah. Okay. Let's figure out how we can make this work. Let's go, you know, work with the other teams, whatever it is, to figure out what needs to get done.

So part of it is just educational and awareness, and simply putting a check-in place took care of a lot of that for us. The second part was,

you know, once they were educated, like, oh, man. This is we need to help the downstream people or we need to make this incompatible change. You know, how do we do that?

We didn't have

processes

and

sort of advisory teams or trained people in place to help them

navigate that space.

And, again, this is something on the microservice side that we went through. You know, when I was at LinkedIn, we were initially a monolith, and we started breaking it up into a bunch of different repositories and stuff

is we had to kind of grow this

microservice,

restful,

you know,

skill set. How do you do REST modeling?

How do you provide the proper APIs in data models for teams to call you? And that involves,

initially, a centralized

sort of, quote, unquote, council, which, you know, people kinda bristle at. And it's not something I would advise for everybody, but if you're a huge company, maybe it makes some sense. We've all sort of local

experts

that could help advise on, like,

how do you do the evolution

of the schema,

what should the schema look like, and just to have the general knowledge of, like, hey. You're creating

a payments model. Like, we have a payments model. Use the 1 that we have. Here it is. That kind of stuff. There's a bunch of sort of, you know, process and team sort of expertise

that has to get

developed and exposed to the broader engineering organization so that once they know there is a problem, they can get help navigating it. Those are kind of the 2 things that we saw.

So, you know, how that manifested at WePay, in particular, is, you know, I talked about all the CICD stuff. We also had the data platform engineer

team that you mentioned sort of work and come up with a centralized

schema repository.

It was protobufs,

and it had sort of standardized, you know,

payments

and, you know, bank account and credit card data models and address data models, and we borrowed some data models from Google.

And then we had, you know, these analytics engineer type people that we were going to sort of

liaise between that repository

And, And, again, it's, like, very 1 to 1 with with how I saw things evolve with REST.

As far as the kind of pain of

these kind of schema contracts, data contracts between these different junction points,

as the

scale of usage grows, as as the degree of reliance increases on these different sort of terminal data assets,

what are some of the junction points that you see as being the most critical to ensure

smooth handoffs where, you know, if the application development team changes their database scheme in a way that's not compatible

and it causes the, you know, Debezium replication to fail, you know, maybe there's a way to buffer that from the downstream data assets or, you know, maybe there

are super nodes from a, you know, graph analytics perspective of the

overall DAGs for where these data processing steps happen. You know, maybe there's a kind of core table in your DBT workflow that everything flows out from, so you wanna make sure that that's the spot you monitor. Like, what are the some of the ways that you have approached

identifying

some of these kind of critical juncture points to ensure that there is as much visibility

and fault tolerance as possible

to buffer some of the downstream

consumers

and users of these assets from failure.

Johnny, I think you hit the nail on the head there. So in terms of junction points or interaction points,

I think the the most critical 1 is between the application. Essentially, the team that is producing the data and everybody else. So the it's usually the application development team. And

so,

you know,

as kinda you and I have said, I think there's room here for separating between their internal schemas and whatever the external schema is that they're exposing to everybody else.

So that external schema

is really, really important

because it breaks everybody downstream of them. Secondarily, 1 of the things that we kind of grew at WePay,

and I see this in the data mesh

world as well, is a second tier of data model. So oftentimes, the data model the application development

team is exposing is relatively,

you know, fine grain and sort of close to

it's very detailed, I guess, is the way I would put it. And

the organization writ large may not need that level of detail or may need that data augmented with a bunch of other stuff from other teams and so on. So we grew this second tier of data model, which looks more like a data mart in the data warehouse world. We called it the canonical data representation, the CDR. And so our payments team would expose their payments data, and then we had our analytics engineering team sort of define an actual

payment data model that took data from the payments team's data model and also from, you know,

reporting team and from, you know, the banking team or whatever it was, and it kinda stitch it together into a data model that was really usable for the organization writ large. It was sort of this 2 tier

data modeling approach. And so that second tier, I think, is a second critical point.

And if you look at the data mesh world, I think Sharma calls these, like, quanta or something like that.

But she's very clear about having a hierarchy of data products, and we saw that firsthand at at WePay where we had a hierarchy where we had sort of the initial data products from the engineering teams, and we had sort of the 2nd tier analytics

engineer data product that the company writ large would use for the most part. I think your second point I probably should have touched on this earlier when you were asking about where you focus the testing, and I kinda said at the front and at the back, sort of data warehouse and then CICD. But there's sort of a different way to slice it, like you said, which is which data is the most

important. And, again, I think you hit the nail on the head. It's like you, the company, need to decide, and it's usually pretty obvious which data is the most important. It's like your primary data. So in our case that WePay, it was our transaction data. At LinkedIn, it was, like, our profile data, and then maybe our advertising, you know, page view advertising data or whatever it might have you. So stuff that's tied to revenue. And that's where you spend

the most amount of effort doing the checks. Right? So

you might do

all of the above data quality checks, metrics monitoring, CICD,

schema validation stuff, and you might really lock down

commits in that portion of the repository

around the schema so that when somebody evolves, say,

the payment

data model, there's a lot of eyes on that. And then before it gets committed, there's a bunch of CI and CD,

you know, schema compatibility checks and, you know, DBT checks, the works. Like, everything runs. For stuff that's less important, you know, some of the tracking data, for example, that we had was, like, relatively low value. It was useful, but if we got 99.9%

of it, that was plenty. We didn't

you know, having 0.1%

data loss wasn't gonna kill us or anything. Then, you know, you tune that way down. Maybe you don't do as much. You do the compatibility check, but you're not doing any of the cloud data warehouse checks or maybe you're only doing them once a week or once a month. That was another knob that we tuned at WePay was how frequently we would do the checks on the data warehouse. You know, if you do them half as half as often, suddenly you're spending half as much money. I don't think that part is really hard. Usually, it's

some knowledge of the business domain will tell you, like, what are the most critical pieces of data. I think there are some surprises. So especially in, like, the data science world, they'll pick up on you some data that you just had no idea, and then it's really suddenly really critical to the the product they're building. And so I think monitoring on usage is really important, and that's something that we did do. So just, you know, who's querying which tables and how often, and are there any

imbalances between data table usage

or topic usage and,

you know, monitoring health. We didn't get that sophisticated at WePay, but 1 could imagine

assigning some monitoring health score to each table and topic or schema.

So Anant, the guy that does the data engineering weekly newsletter, he's got this project called schemata, which is really interesting.

1 of the things that it has in it is it's trying to assign a score to your schema sort of based on its health.

And

the health that it's kinda driving there is looking at

sort of from a graph perspective how interconnected

is it with other stuff,

other models and entities. And I'm gonna do a poor job of describing it, so I'm gonna not gonna attempt it, but it's definitely worth a look. I think stuff like that is really interesting. So if you can detect that a given data model table schema, whatever topic is

relatively unhealthy

but also heavily used, you should probably add some more

probably should improve the health

of the monitoring of the metrics of whatever it is that needs to get done.

Data teams are increasingly under pressure to deliver.

According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity.

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5 percent report having current investments in automation,

85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend.io

come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation,

orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to dataengineeringpodcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer.

In your experience,

both working across different organizations

and in communicating with people who are working in different fields and industries? What are some of the most interesting or innovative or unexpected approaches that you've seen to automation and organizational scalability for data operations and data products?

A couple of things. 1 is the thing I just mentioned. The

health

metrics in schemata

caught me by surprise. It really took me a while to kinda, like, read through it and understand

what is it doing.

So I thought that was really interesting. I would recommend people take a look at it. I can't say

it was what I expected. What I was expecting was more like linting. So I think that was 1 thing that caught me by surprise. I also think that

Convoy. I've talked to Chad Sanderson a number of times who's over at Convoy,

and they've got this really cool tool that helps

manage

data models and schema at Convoy.

And the thing that it does

really well from what I've seen

is

it's really more of a cultural tool. So it's trying to help

the consumers

and producers

of data

collaborate on the schemas and data models and stuff in a constructive way. And it's in a sense kind of

doing some of the the job of the analytics engineer and data engineer

so that

if I'm a data scientist and I need some new data, I or or I need a schema or whatever, I can kinda, like, ask for it, and then I can work with the upstream teams to, you know, suss out where that data is, how can I get it, stuff like that? Can't remember the name of the tool, but that particular tool from Convoy I thought was really interesting as well.

Other than that, what else

catches my eye? I mean, the automation with Terraform in the data space, I think, is really interesting. And it's not really novel in the sense that it's a new tool, but I think the

application of it to the data space is really

long overdue.

I was talking with Sarah Krasnick who used to be at Perpay

who's working on Terraform and data automation stuff right now.

I thought that was just dead on, like, right up my alley. Yes. That's what we were looking at at WePay. It was really

high dividend kind of thing.

And so managing,

you know, data pipelines, data access,

even airflow

DAG deployment

through that world was something I was really excited about. So I guess those are the kind of 3 things that jumped to mind.

In your own experience of working in this space and building and growing teams and communicating with people who are in similar situations, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It's sort of this age old engineering story of

the hardest part tends to be the humans and cultural.

So

I'm not trying to downplay

the difficulty

of

building a scalable system, but I think at the point that we're at in the data

ecosystem, you know, life

cycle, there's a lot of scalable systems and charting and partitioning and corn based and, you know, leader follower.

Like, it's a fairly well understood,

you know, 20 to 50 year old problem.

But when it comes to navigating the cultural elements

of schema management and data modeling and metrics definitions and stuff like that, I think

it's very hard sometimes

to convince humans to do things that make it harder for them to ship their software

in the immediate sense. Like, I'm an application engineer, and my product manager wants this thing out by Wednesday, and

doing this extra work for the data stuff

is not gonna make it out by Wednesday.

That is

a real challenge.

And so there's, like, an element of pragmatism that comes with it, and you need some emotional maturity to work with folks.

I won't say that's necessarily surprising. Like I said, it's sort of the age old story of engineering is the hard stuff. It's not the technical stuff. I wouldn't say it's obvious. It's sort of a non obvious thing.

And for people who are building out their own data systems and data platforms or working in this ecosystem,

what are the cases where automation is too big of a foot gun and you actually want to keep doing things manually to figure out how it all works?

Yeah. I think you just answered that. So that was gonna be what I was gonna say. You don't wanna automate too early. You know, if you don't know what the right

flow or process is,

don't automate.

Do it manually until you figure out what the right flow is and then automate it. I think that sort of answer number 1 is especially engineers,

you know, they wanna write their Python scripts and automate everything immediately.

But I think doing things manually

the first few times and sort of working through, like, who do I talk to? Oh, I need to talk to security. Okay. Well, like, what tool are they using? Okay. Well, okay. They're using this tool to manage their security approvals and yada yada. That helps mitigate some of technical debt you accrue if you just go whole hog into automation from the get go and then discover, like, it's the wrong automation or you're missing stuff or there's tools you didn't know about or what have you. That's, I think, thing number 1.

The second thing is just that, you know, not everything can be automated. So,

specifically, some of the stuff I was talking about

around data modeling

and

what is the right way to define,

you know, a given data model. That's more of like a sort of an architectural

design

pattern y kind of question. And, you know, linting can help, and there's tools that can help and and sort of, like, look for other things that have similar names or, you know, some of that stuff like schematis doing where it's trying to figure out you have a payment here, but it should it be payment ID and there's this other payment data model? Are these 2 things interlinked, and should you be referencing payment instead? There's some of this tooling can do some of that. But I think there's sort of

this meta level of looking at data models and figuring out, you know, how they should be factored. They're like just humans have to do that. Right? There's no tool that's gonna tell you how to factor

your data models and whether 1 of them should be embedded in the other or a separate data model and so on. So I think when it comes to defining schemas and data models,

automation can help, but I think there still needs to be human in the loop there for a lot of that stuff. So those are kind of my 2

areas where

I think automation is not the end all be all.

Yeah.

As you continue to work in this space and

track the evolution

of how automation is being brought in, how DevOps principles are factoring into

data engineering and data platforms.

What does the future of data engineering look like to you?

So I think much more

federated and automated. You know? Again, it's basically been the whole conversation around this, but I think that's

the direction I always wanted to move when I was running the data team at WePay. And it was a direction we were moving in, and I saw it paying a ton of dividends. So

I think, you know, the ecosystem is growing more and more tooling in this area that's just gonna make that easier and easier for organizations to roll out. So

empowering

the upstream and downstream

producers and consumers

to

get the data they need, manage the data they have

without involving

data engineers.

I think that's kind of the direction we're moving. And I think data engineers are moving away from

shoveling

data from 1 system into another and into a role where they're just building

automation tooling for the organization to use.

That's kind of the future that I see on the data engineering

side of the world.

This is probably a topic that's worth a whole another episode, but as data engineers work more into that facilitation role and move more of the actual

data movement, data management

higher up into some of the business

users and analytics engineering roles?

What are some of the potential pitfalls or some of the elements of education or

understanding of the, kind of, fundamentals

of data management, data modeling, kind of, scheme evolution

that need to be either

baked into those tools and systems

so that they don't need to be exposed or

that need to be kind of managed, and how do we translate some of those concepts and

lessons

to be accessible to some of those users that don't necessarily have that same background?

I think it's a both and. I think it's in some cases, the tools can handle that. You know, to use that concrete example, we can have the tools that check compatibility.

And is this schema forwards and backwards compatible? Is this change forwards and backwards compatible?

But I think the counterbalance to that is

this analytics engineer embedded role that I had mentioned. We're like, okay. Failed. The application engineer is left sitting there like, well, I can't commit this change, and it's telling me it's an incompatible change because, you know, field was a string and now it's an in. Like, what do I do? I think that's where analytics engineer can play a role. And then that's sort of,

I think, not grunt work, but sort of a tactical thing. I think the second level, which again I think is less automation and more human, is just the analytic engineers can really help the application

engineering teams

define, like, what does their public, quote, unquote, data product look like,

help them manage it. You know? So application engineer is probably gonna be in charge of defining and exposing that public data product and helping

transform the internal data into the external data. So I think it's a both end. You know, some of it will be tooling, but I think a lot of it really is gonna fall into the, quote, unquote, embedded SRE role, which is really like this analytics engineering role. I don't wanna make it sound like this is, like,

pie in the sky

made up stuff. Like, there are definitely

teams that are doing this. So, you know, I mentioned earlier, Dimitri, my coauthor for my book, his teams at ZymriGen actually do have embedded

analytics engineers. They're a very data centric company. They're doing, like, biotech stuff. Hand weighty.

I'm not super educated on that, but they have

analytics engineers that kinda work hand in glove with the the rest of the engineering team

to do this. It exists in the world. This is not completely made up, and it does

work. But I think it's just a very new pattern for this kind of thing. So those are sort of the 2 ends of the spectrum that I see.

Are there any other aspects of this subject of

scaling of organizational

capacity for data management and data evolution and the role that automation plays in building and maintaining that velocity that we didn't discuss yet that you'd like to cover before we close out the show?

We only touched on it really briefly,

but I think

the area of security and compliance is, you know, important in becoming more so every day. You know, I think between GDPR and HIPAA and SOC 2 and, you know, on and on and on shield and CCPA, and there's a million of these things now. So there's sort of a compliance aspect to it. And then also just a

we need to be good stewards of our data general moral responsibility

that you have to your customers.

There's a lot of room for automation there too. You know? And I mentioned this DLP product. I think I would love to see a lot more effort

spent in that space in automating

security checks and access approvals and stuff like that. 1 of the things that we were doing at VPAY

was

tagging

the source data. So if I have a MySQL table

and it has a column

and that column has phone numbers in it, the developers

had a way of expressing, hey. This table has a column a,

and a has phone numbers.

And then downstream,

what we would do is run these automated checks that would

detected sensitive information that was not already tagged

for a given column,

it would alert. And so, you know, maybe column a that has phone numbers, maybe it detected there were emails in there. And so it's, hey. We found email. It's not

tagged. So maybe that's legitimate. It's okay to have emails in there, in which case the developer needs to go and tag it. Or it was illegitimate, in which case we need to remediate and, like, scrub the data, get rid of the emails, redact access, what have you. So I think that's part of the automation. The second part of the automation is once you have this metadata

tagged and detected,

you can start doing more automated

access control. So, you know, if I'm a level 1 user, for example, whatever level 1 means. Let's say it means I have access to a certain tier of secure data. When I request access to a given table, you know, if it has emails,

I'll be automatically granted that table for 24 hours, you know, something like that. And no human needs to be in the loop for that, which, again, getting back to automation and federation is a big thing. So I think there's a lot to be done there. You know, none of these are, like,

novel ideas. If you look at, you know, SSH production access and, you know, bastion gateway, like, all those kind of stuff operations does. But it's, again, it's taking those ideas and applying them to the data space. I think it's something that we need to do in the security side of things. So we touched on that a little bit, but that was something that I would definitely wanna highlight as, I think, an opportunity for a ton of automation.

For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I think it's this thing I kind of alluded to with the Convoy tool, which is I would really love to see more tooling that helped

the teams work with each other. Like, we had engineers had GitHub and, you know, all the different teams can, like, look at PRs and comment on each other and whatnot. I would

love to have tooling that allowed

organizations to collaborate on data models and schemas. I think that's just a huge

gap. And so for me, that would be my big ask.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing and the experiences that you've had and some of your insights on the opportunities for automation in the data ecosystem.

It's definitely a very

important and constantly evolving area. So I appreciate all of your help in continuing to push the conversation forward. So thank you again for taking the time, and I hope you enjoy the rest of your day. Day. Yeah. Thank you.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you have learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links