Summary
Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense?
- What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in?
- The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent?
- What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate?
- As someone with a software engineering background and extensive experience working in data, what are the missing links to make those teams/objectives work together more seamlessly?
- How can tooling and automation help in that endeavor?
- A key factor in the adoption of automation for application delivery is automated tests. What are some of the strategies you find useful for identifying scope and targets for testing/monitoring of data products?
- As data usage and capabilities grow and evolve in an organization, what are the junction points that are in greatest need of well-defined data contracts?
- How can automation aid in enforcing and alerting on those contracts in a continuous fashion?
- What are the most interesting, innovative, or unexpected ways that you have seen automation of data operations used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on automation for data systems?
- When is automation the wrong choice?
- What does the future of data engineering look like?
Contact Info
- Website
- @criccomini on Twitter
- criccomini on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- WePay
- Enterprise Service Bus
- The Missing README
- Hadoop
- Confluent Schema Registry
- Avro
- CDC == Change Data Capture
- Debezium
- Data Mesh
- What the heck is a data mesh? blog post
- SRE == Site Reliability Engineer
- Terraform
- Chef configuration management tool
- Puppet configuration management tool
- Ansible configuration management tool
- BigQuery
- Airflow
- Pulumi
- Monte Carlo
- Bigeye
- Anomalo
- Great Expectations
- Schemata
- Data Engineering Weekly newsletter
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy, and today I'm interviewing Chris Riccamine about building awareness of data usage into CI and CD pipelines for application development and just the overall approach to automation to simplify the work of data professionals. So, Chris, can you start by introducing yourself? My name is Chris.
[00:01:08] Unknown:
I have been working in the data space for about 15 years now. I started after a brief stint at PayPal where I was doing data science and data visualization. I joined LinkedIn and spent 6 and a half years there. I started in the data science world. Within about a year, switched over to the engineering side and spent a bunch of time doing stream processing at LinkedIn, specifically working on Apache Samza, which is the stream processor that came out of LinkedIn. And then I joined WePay, which is a payments company that was acquired by JPMorgan Chase a couple years ago. And, again, ran their data infrastructure team where we set up a cloud data warehouse with BigQuery, real time CDC with Debezium, bunch of Kafka connectors to move data around and sort of realized the vision of, like, a data integration, for lack of a better term, enterprise service bus on top of Kafka.
And then since last November, I've just been hanging out and sort of relaxing. So that's me in a quick nutshell. I've also written a book with a friend of mine over the past couple of years, Dimitri, who's the VP of engineering at ZimerGen. And the book is the missing read me, which has little to do with data and is more about just getting new college grad software engineers up and running. So that's kind of the projects I've been working on.
[00:02:28] Unknown:
And you mentioned a little bit about some of your earlier career work, and I'm wondering if you can just share sort of how you first got involved in the area of working with data and what it is about the space that keeps you interested and motivated?
[00:02:40] Unknown:
I wish I could answer that question. It just sort of seemed like a natural affinity. For my first internship at PayPal, I joined this team that was called the advanced concepts team, and it was sort of this like skunkworks lab team, a couple of people who were doing exploration and sort of new technology. And so I spent a bunch of time doing data visualization at PayPal. And what that means is they had a a data warehouse, Teradata, at the time. And it was, like, pulling down a bunch of their data and just visualizing it to try and understand it. We were coming at it from a fraud context. And so, you know, you would pull down billions of transactions, and it was it was a really fascinating way to kind of get my hands dirty just understanding the power of beta and, like, exploring it. And there was no really no, you know, specific end game in mind other than let's explore and understand and look for, you know, fraud trends.
And from there, while I was at PayPal, I started getting interested in graphs, in graph traversal, because 1 of the projects I worked on was visualizing transaction graphs. So, you know, I am a user. I have a credit card. Some other account also happens to add that same credit card. Maybe that credit card is stolen. So if you can imagine the visualization, there's 3 nodes in this graph. There's me, the other account, and we're connected via this credit card, right, which is a 3rd node. And through my exploration of graph databases and stuff, I kinda stumbled over Hadoop. I got really interested in Hadoop and kind of wanted to learn that technology. And that's initially what drew me over to LinkedIn is they were just building out their Hadoop ecosystem. And so a friend of mine and a mentor moved over from PayPal to LinkedIn and kind of, you know, caught my interest.
And then I transferred to LinkedIn. And it was sort of the same story where I spent the 1st year doing data science and exploring stuff and then quickly realized, like, a lot of the valuable work to be done was in the the term was the data engineering, but in the data engineering space and, like, getting the, you know, features and the data into Hadoop, you know, to train the model, being able to scale the model. You know, they had at the time, 1 of the main things I was working on at LinkedIn was something called people you may know, which, again, is a graph algorithm mostly. And they were running out on Oracle, and it was taking, like, 6 weeks to complete a single, you know, training run. And so it's, like, okay. Great. We have this wonderful model, but it takes us 6 weeks to refresh it and it's super brittle. Can we improve it? Initially, what to do, we got it down to, like, sub 24 hour and it was just like a completely different game changing thing. And so that drew me, I think, from sort of this data science world into the engineering world and, like, the realizing the power, especially that point in time, was heavily stilted towards, you know, investing in engineering.
[00:05:30] Unknown:
And in terms of your experience of working in this ecosystem and helping to build out some of the infrastructure and processes and sort of organizational capacity for being able to take advantage of data and actually power some of these data science use cases. What are some of the pieces of data platforms and processing that have been most difficult to scale, not necessarily in the technical sense, but in the organizational sense, and being able to sort of build up and maintain velocity of being able to actually use them and iterate on the data products that you're trying to
[00:06:04] Unknown:
create? My answer is probably gonna be a recurring theme in this conversation. I don't know whether it's delivered or not, but I think the biggest organizational challenge, it's been recurring over the years. It's like been a fairly constant thing. I won't make the claim that it's solved. It has been managing sort of the contracts of data schemas, especially at the seams between teams. So if I'm a team and I have a data model for some event and some other team is using that data, managing an agreement where I am not going to break the other team when I mutate my schema is definitely a big challenge. But like I said, I don't think it's a solved problem, but that's something that we had issues with at LinkedIn. It's something that we had issues with at WePay, and there are tools out there. So, for example, the Confluence schema registry has the ability to enforce backwards and forwards compatibility of schemas. So as you're evolving a schema, it will prevent incompatible changes from getting into your data pipeline, I e, Kafka.
And that's actually something that came out of LinkedIn. We had 1 of those at LinkedIn back in the day as well. But it's complicated because I think a large part of the problem is not technical. It's like cultural and social. It's like helping the engineers understand what are the rules they must abide by and why. Like, why is it a bad thing for me to drop a column that was required? Like, I don't need that column anymore. It is required, but I no longer have the data for whatever reason. I wanna get rid of it. Why can't I do that? And then, like, having the team and the engineers understand, well, you know, that column is used by 8 other teams.
It's powering, a machine learning model. It's indexed in our search index. That's a challenge, and I think something that is work that remains to be done. So I think that's my number 1 answer is schema. I think a second thing that was not as challenging, but definitely something that was in the air is coming up with clearly defined ownership, especially around operations. So especially with data, it's a lot of, like, frameworks and platforms and job schedulers and, you know, all that kind of stuff. And so when things break, like, figuring out an operational model that works between the teams that are using the frameworks and the platforms and the teams that are running the frameworks and the platforms, it needs some thought. So, you know, if I'm responsible for Hadoop and you're running your job on Hadoop and your job doesn't work, like, triaging that, figuring out who needs help, when they need help, how to get help is a challenge. And it can definitely lead to burnout on the infrastructure teams if it's not thought through well. So I think that would be a number 2 answer that I would give is sort of figuring out operational responsibility of the systems that are being run. Yeah. And I think that
[00:09:01] Unknown:
the operational aspect and sort of figuring out who owns what, where do the responsibilities lie as the data goes across the different stages of its life cycle is definitely always an open question and 1 that I don't think ever stays settled.
[00:09:16] Unknown:
Well, the the answer is easy. The answer is always I don't own, so it's always not me.
[00:09:22] Unknown:
Yeah. If only. And so in terms of these 2 elements that you highlighted of the schema evolution and the contracts of schema as it traverses these different systems and the stages of the platform, and then the ownership of that data and who is responsible for maintaining that schema and ensuring that it stays correct across those different stages and across those transition boundaries. What do you see as the opportunities for automation to alleviate some of the toil that's associated with this work and making sure that all of your pipelines stay, you know, healthy and running and don't break because somebody forgot to update the schema record or somebody forgot to notify somebody that downstream that, oh, I'm going to be changing this on such and such date or even actually planning the fact that they're going to change that in the first place.
[00:10:13] Unknown:
Yeah. This is the question I'm really excited about. So I think there's a lot of opportunity here. Now I mentioned earlier the backwards and forwards and, quote, unquote, full compatibility that something like the Confluence schema registry can give you. Now the problem with that approach, at least as it's shipped out of the box caveat, at least the last time I looked, which is a while ago, that it was at runtime. So what that means is you don't discover that your schema evolution is bad until you actually try to send the message and schema registry Kafka encoder fails, you know, and you get a error in your logs, in your application, you know, essentially stops working.
So 1 of the things we did at WePay to kind of alleviate that problem. Like, we don't wanna find out in production or in testing or staging when we're sending messages that things have broken. We wanna find out, like you said, continuous integration, tests, or, you know, GitHub tests, essentially, the stuff that's running pre commit. 1 of the things that we did was we started doing compatibility checks pre commit. So we would take your schema beforehand and then take your schema afterwards, and we would compare them and, like, look and try and understand was it a compatible change according to the rules that we'd set forth.
And we initially did that for the events pipeline that we had, which was essentially you're sending messages to Kafka from some publisher. Right? And for that, we were using Avro as our schema, our DDL, or whatever you wanna call it. And that worked quite well, but it was limited to the event publishing. Our primary OLTP databases. So these were MySQL DBs with, like, you know, transaction data, user data, all the kind of stuff that you can imagine from a payments company. And we would funnel that data into Kafka. And then from there, we would stream it into BigQuery. And we were discovering we were having the same problems in the CDC pipe that we had had in the event publishing pipe, which is some application developer would decide to mutate their DB schema, which is a totally sane thing to do. And, in fact, application developers are conditioned to think that their DB for their microservice is encapsulated, and it is, like, private, and they're able to do what they want with it. And so, you know, they would drop some columns, and that would cause the to be unable to publish into Kafka because the compatibility wasn't there anymore because they dropped the required field or what have you. And so we actually extended the CICD pipe to check not only the event schemas, but also the DB schema evolution changes.
And so phase 1 of this answer is I think there's a lot of room for automation in just checking schema changes before they make their way into either a production or preproduction environment. I mean, these can be done at, you know, essentially compile time, at commit time. In the case of the DB checks, what we were doing, do it essentially take, spin up a little MySQL instance in Docker, run the migrations up to the latest change, you know, sort of snapshot the current the DB, run the migration, snapshot the DB schema again, and then compare is this safe or not. There's all kinds of interesting edge cases you have to think about. Like, is an integer big end to a small end? Is that considered compatible? So it's an interesting problem.
We got a lot of mileage out of running that, and I think that's sort of phase 1. I think phase 2 where we never really got to, but I think is something that the industry will be spending time over the next few years is providing automation and tooling around what to do when the application developer does want to make any like, It is legitimate to make incompatible changes. Sometimes the data goes away. Like, some sometimes the data needs to be changed in a certain way. Right? And so providing them the tooling to safely make the changes was something we never got to, at least not as long as I've been there, but that was in my mind and on the road. So what I mean by that is let's imagine that you want to drop a required column, for example, or you want to change a string to an integer, for a given column.
Let in the case of something like streaming, 1 could imagine allowing the application developer to write a little stream processor that would take the new change and sort of munge it into the old data in cases where that is possible. So, for example, string to int is probably something where you could write a stream processor and convert the int string rather to make it compatible with the old schema. Right? Dropping a required field, well, maybe you need to call an external microservice or the data somewhere else. The data is fully gone though. Your change is actually truly incompatible and irreversible. You cannot recover from it. I think the second part of this automation story and tooling story is we need to provide good ways for them to do, quote, unquote, major version change on their database schema. I think fortunately, there's a great set of patterns and practices in the microservice world because this all the stuff I'm talking about right now is essentially just the same thing as microservice API compatibility.
You know, it's semantic versioning and major, minor, you know, micro or patch and, you know, having API gateways and all that kind of stuff, but applied to the data space. So I think that's sort of my long winded pitch on automation as a a helpful thing for this space.
[00:16:01] Unknown:
To your point as well about the sort of API versioning and compatibility management across service boundaries, I'm wondering what your thoughts are on the opportunity for being able to extend some of the application paradigms into treating those downstream data consumers as 1 of those service boundaries and figuring out how do we make those contracts more of a kind of natural extension of the development life cycle. Whereas right now, it's, you know, there's a database there somewhere. It's the data engineer's job to find out the fact that it exists, how to pull the data from it, how to reverse engineer meaning from this, you know, assortment of tables that doesn't necessarily have any useful context unless you're looking at the code that's actually creating and using them, how to think about extending the ORM to populate some of that schema information into things like Debezium or the Confluence schema registry for the case where you are consuming directly from the database or just how to more clearly define that service boundary for those down intuition is
[00:17:11] Unknown:
that I wanna intuition is that I want to borrow heavily from the microservice world. And so in my experience, the way that that kind of evolves is, you know, you have a bunch of microservices and eventually, you know, a given team, you know, might have a collection of them. And eventually, you know, if you grow big enough, you start providing some kind of internal, like, API gateway to the rest of the organization. And then you have a service mesh that's, you know, tracking who's calling what, and you start getting a bunch of tooling built around it to track the calls across all of these inter team exchange points.
And so I think we should do something very similar in the data space. You know, specifically, I would like a team an application team to define data product, and this is sort of borrowing from the data mesh terminology a little bit. And and the data product is a schema that they are publishing for their data that is meant to be consumed by other teams. I would like that data product to be semantically versioned so that we can track compatibility. And then I would like for the data engineering team to provide the application team with a set of tools that allows them to transform their internal data into this data product and evolve and manage the data product properly. So this is like data catalogs, data quality checkers, schema compatibility checkers, you know, migration tools when you're doing a major migration bump to track, you know, which consumers are consuming your data. You know, using Kafka, for example, all that data can be sucked out of the offset topics. You can see who's reading what based off the commits that they're they're doing.
So I think we should borrow heavily from the microservice paradigm. And in a sense, that's what I see when I look at data mesh. You know, people I think data mesh has a lot of words. It's very complicated to read those blog posts. That's not a knock against Jamuk. I think she's trying to just explain something that's complicated. But I think if we just sort of think about the data space is like, hey. It has a lot of the same problems and architecturally as microservices do. And then culturally, you know, it has a lot of the same problems that, like, operations had, and we grew the microservice and DevOps culture for those problems. Can we borrow a lot of those same philosophies and ideas for the data space? I believe we can. And in fact, I wrote a blog post trying to describe data mesh that is essentially that. Like, let's look at microservices. Let's look at DevOps and sort of apply them to the data space.
To get back to your question, I think the data engineering team should sort of shift to your point about, hey. Their job is to, like, get data where it needs to be to their job is to provide tooling to the application teams to get data where it needs to be. So a lot more federation and, you know, automation. If you kinda step back and just look at, like, what data teams do, it's kinda crazy. Like, the pitch is we're gonna have this centralized team that is responsible for moving all data in the organization. Like, again, the idea that you would have a microservice team that is responsible for all microservices is just bonkers. Like, it doesn't make any sense. So I think, you know, federation has to happen. It's just not scalable to have 1 team responsible for all the data, especially when the data is critical. I definitely agree with the
[00:20:28] Unknown:
kind of principle of, you know, bringing the data engineers and the application developers closer together, letting the data engineers work as more of a support team in a lot of the same ways as SREs and platform engineers have been evolving to understand what their purpose is in the overall engineering group. So I think that they're going through a kind of parallel journey to what the sort of DevOps ecosystem and, you know, the the actual concrete implementations of that have been in the form of SREs, platform engineers. You know, it's evidenced by the fact that data platform engineer is an emerging title where, you know, it's not my job to actually do all the data manipulation. My job is to build the systems that let you do it kind of a thing. A 100%. And so, you know, my team at WePay was not called the data engineering team. It was called the data infrastructure team.
[00:21:15] Unknown:
Like, it was a platform stuff as opposed to, like, a data engineering stuff. And I think 1 thing when I look at the ops world is you've got sort of the platform tooling people that are centralized. And so there's some kind of centralized ops organization. They're building the tooling and the platforms that are being used by the rest of the org. Then you have this other role, which is the embedded SRE. Embedded SRE sits with the team. They understand the products that the team are building and understand when they're shipping what, you know, ideally, and are sort of helping liaise between individual application team and operations writ large. And the thing I'm excited about is this sort of new role, the analytics engineer that's that's showing up now. Because when I look at that pattern of sort of centralized ops and then embedded SRE, I can very easily mentally map the legacy data engineer term onto that data platform centralized role and then the analytics engineer into the kind of, like, the embedded SRE role where the analytics engineer is gonna understand the data models of a given application team and understand how to, you know, build data marts or microdata warehouses or whatever you wanna call them, understand the evolution of the applications, teams, schemas, and stuff. And, again, help them liaise between their data and the rest of the organization.
And so I'm really excited about this analytics engineering role that's, you know, getting a lot of traction these days because I think the combination of, like, centralized data engineering or data platform engineers, you said, that's building tooling, automation, federation stuff. And then the analytics engineers are kind of embedded in helping define robust data product schemas, helping explain, like, why it's bad to drop a required field to the application team is actually, like, super valuable. So I think that those 2 things working together is gonna be a big deal. And it just maps so cleanly onto the success we saw with the DevOps and SRE world that I'm really bullish on that.
[00:23:10] Unknown:
It's time to make sense of today's data tooling ecosystem. Go to data engineering podcast.com/rudder to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity. The guide includes architectures and tactical advice to help you progress through 4 stages, starter, growth, machine learning, and real time. Go to dataengineeringpodcast.com/rudder today to drop the modern data stack and use a practical data engineering framework. Continuing on this topic with the, sort of parallels between the DevOps transformation and the kind of data transformation that we're going through now.
1 of the, I think, core components that powered the overall transformation to where we are with DevOps is the adoption and evolution of CICD principles where everything that goes into production has to make its way through these defined pipelines that are visible, that everybody has access to, that everybody can understand where things are in the delivery cycle. And I know that, you know, some of those same utilities are being used for the data ecosystem, and also there are some parallels in the case of data orchestrators that serve as that kind of central visibility. But I'm wondering, what are some of the other concepts of DevOps and some of the practices that are being adopted by app dev teams that are still nascent or yet to be kind of translated into the data ecosystem and some of the opportunities for teams to be able to start experimenting with those ideas?
[00:24:43] Unknown:
Couple of things that I see. So 1 of them is taking ops tools for managing you know, infrastructure. So this is Terraform essentially. Right? Jeff, Puppet, Ansible, Terraform, whatever you wanna call it, and applying them to tools in the infrastructure in the data space. The second 1 is operations best practices when it comes to metrics and monitoring and operations and observability. So on the first 1, the essentially applying ops, you know, tooling when it comes to deployment and figuration management and stuff. You know, something I saw firsthand at WePay was, you know, we had robust ops team that was doing a bunch of stuff with Terraform, and then we had our little little Motley data team of 6 people. And we were tasked with running BigQuery, Airflow, Kafka, bunch of Kafka connectors and stuff. And some of those things made it into the Terraform world and some of those didn't. And the things that made it into the Terraform world were the things that the SREs were heavily involved with. So Kafka connectors, for example.
Things that didn't make it into the Terraform world were things that SREs were not as directly involved with, and that would, you know, namely be managing the data warehouse. So datasets, access controls, all that kind of stuff. And so I see us moving to a world where we're gonna be applying, you know, Terraform or Terraform esque stuff, Pulumi or whatever it is, to the data tools that we have. And I think a good place to start with that is the data warehouse and, like, hey, let's manage our data warehouse fully from 1 of these config management tools. So when I create a dataset, when I create a s 3 bucket, when I grant access to this, that, or the other thing, let's not have a data engineer do that. Let's have the tool do that, and we can submit a PR to the repo and get it reviewed and committed. And, you know, there's an audit trail and security's happier and stuff. So that's 1 thing that I see us being able to borrow from when it comes to looking at over the fence at ops.
I think the second thing, the the observability and data quality stuff, I would say a little farther along in terms of adoption. There's a bunch of tools out now that are pretty great. There's Monte Carlo and there's big eye and Anomalow and a bunch of other ones. I'm oh, Great Expectations, of course, is the the 1 I was forgetting. And these tools approach data quality checks in a bunch of different ways. Some of them are more like, I would say kinda like unit testy where you are defining the shape of your data. So I expect the cardinality of the country code column to be 255. And, you know, however many countries are already given point in time. The second iteration of that is is a little more automated where they will try and derive the rules for you. So, hey. Here's my data. Go figure out some good heuristics, and I will come back and say, well, the cardinality for your country code is currently 255, so let's enforce that. Right? And the 3rd iteration is more of a anomaly detection, fancy ML thing where it's not deriving anything. Just looking at your data over time and noticing, like, hey. The cardinality of this column has been 255 for the last 30 weeks, and now it's 10, 000. That's weird. Like, we should alert you, right, and let you know. And so I think that applying that stuff and really getting rigorous about the data and our data pipeline and the data and our data warehouse and making sure everything is healthy, We need to take that seriously, especially as we start, you know, using data for data products that we're exposing to our customers. That's a big problem when you start giving your customers the wrong data.
[00:28:31] Unknown:
As far as the testing element in the kind of application development environment. There have been different formulations of the testing pyramid where there are different quantities of different layers of tests that you want to have in place to ensure that you can deliver with confidence. And I know that that is a sort of capability that's been adopted at various levels of kind of sophistication or commitment in the data ecosystem, particularly with some of the tools that you mentioned, Great Expectations, Monte Carlo, Anamalo.
And I'm wondering what you see as the useful strategies for determining what are the appropriate targets for where to position those tests and how to understand what is the scope of these tests across the sort of different, you know, components of the data ecosystem, you know, how and when to execute them, where, you know, there's the difference between, like, unit tests that are executed as part of your CICD to say, I'm making this change. Is this safe? Okay. It's in production now. And 1 of the interesting challenges of the data aspect is that there isn't usually just 1 sort of, like, make this check now and it's good for all time. It's make this check now and then keep making it for all time because it's going to change. Yeah. Yeah. Yeah. And I think another thing to consider, and this
[00:29:54] Unknown:
was a huge issue for us at Webex cost. So running to your point, a data quality check and having to run that over and over again because just because it passed once doesn't mean it's gonna pass again tomorrow. Like, we need to keep checking data over and over again. Can get really expensive in these cloud data warehouses, and so you have to be really, really smart about it unless you wanna spend tens or 100 of 1, 000 of dollars, which we did for a while, and it was actually kind of pleasant to be able to just ignore it. But at some point, finance comes knocking and it's like, why is half our bill spent on data quality checks? In terms of where to place the checks, we kind of took a very bifurcated approach, and we checked stuff upfront right at the beginning.
And then we, on the other end of the spectrum, check stuff right at the end in the data warehouse. So if you imagine our pipeline is, like, LTP database, Debezium, Kafka, kcbq, which is the Kafka Connect, BigQuery connector, and then BigQuery. So there's, like, 5 or 6 different moving pieces in that pipeline. We would check stuff sort of pre commit in a CICD pipe, and this is, like, schema compatibility checks. Is your data gonna make it in at all, or is there gonna be, you know, compatibility issues? And then in certain cases, it was also kinda like these contrived tests. I think I think dbt might support this as well now where you can run you know, create a dataset, load some data into it, run your queries, and verify that the query result is as expected. So we would do some of that as well. Kinda like smoke tests kinda stuff. And then on the other end of the spectrum, we would do essentially checks against the data that landed in our data warehouse. And the theory was, you know, if the data was bad anywhere upstream in the pipeline, it would manifest itself as bad in the data warehouse as well. And then we would alert. And what that doesn't give you is, like, when something goes wrong, you need to figure out where in the pipeline it went wrong. Was it, like, KCBQ that had an issue? Did Did drop some messages? Was the having a problem? And that was something that we were we spent more time than I would like. And, you know, realistically, we wanna build better sort of binary search tooling to kinda whittle down where in the pipe things broke when they went wrong. But the approach we took, as I said, was bifurcated into CICD, smoke and unit test stuff. It was more like, is the query logic correct?
Is the schema gonna work properly? And then much more robust stuff in data warehouse, which is like the anomaly detection stuff that I mentioned. The other thing we did was DLP. So Google Cloud has a DLP product, which stands for, I think, data leak or data loss protection, which is a terrible name. But what it really did in our case was it it would detect sensitive data. So PII, emails, usernames, passwords, that kind of stuff, and it would alert you if a given table had PII in it. And then we had separately a metadata set of, like, which tables had PII. And And so we would do those kind of checks, data quality checks, security checks, and stuff in the data warehouse and alert us. There was something we didn't do that I think we had as a Jira that would get bumped from 1 quarter to the next, which was check summing. And so this is something that I think the napkin problems blog has written about. It's a fantastic post where, essentially, the idea is to do the data quality check. You construct something like a Merkle tree of all your rows in your source and your destination, and you compare those checksums.
And if they don't match, then you can kind of traverse the Merkle tree and figure out which rows are incorrect or different. You can then, you know, figure out why they're different and sort of solve the problem. This is actually the way that, like, Cassandra's inconsistency stuff works. And so that was something that we were considering because it would be much cheaper than doing, like, select star in source. It's destination and, like, row by row in Python, compare these 2 rows and columns. So that was something that we were considering but never built out. But there's a really good post on napkin problems. The author, his name is escaping me now, actually has an open source library that he's built that does that. And so for those that are interested, I would highly recommend checking out that blog post and also his open source library. And again, the name Datadiff is the name of the tool. I actually just did an interview with the author of the library and Gleb Mojansky from DataFold who helped with supporting that development. Yeah. He's great. I had a chat with him a few weeks back. Very knowledgeable on the subject. We were just vibing because it was it was like, yeah. This this seems easy. And then, you know, you start digging into it. It's such a robust and complicated problem. I think the fascinating thing for me talking to him was that his his experience primarily came from the online space. So he was migrating MySQL instances, I think, at Shopify in real time. So production, real time data migration for moving shards of data around from 1 MySQL to another.
My experience obviously is data warehousing. Like, let's make sure the data warehouse matches production, but it's the same problem. And so his tool, I think, is just fantastic. I'm really excited about it. To your point of sort of
[00:34:57] Unknown:
cost being an issue in figuring out how often and when to execute these data validation checks when you're running in a cloud data warehouse ecosystem. I'm curious if you have heard any rumblings of these data warehouse providers starting to acknowledge this as a necessity and, you know, thinking about how do we actually bake in these capabilities as part of our platform so that it doesn't turn into a, you know, tens or 100 of 1, 000 dollar problem just to make sure that you're actually delivering what you think you're delivering?
[00:35:26] Unknown:
I have not heard of any rumblings from the cloud providers. You know, during the conversation about the check summing stuff, we were both dreaming that eventually that vendors would all provide check summing APIs so that, you know, for a given table or set of rows, you could just get the check sum of the data and then compare that, you know, using some spec across other systems. I've heard 0 about that from the actual cloud vendors. Like, from a revenue perspectives, they want you to query more, but then obviously from, like, a product satisfaction, you know, net promoter perspective, like, giving your users the ability to verify their data is good is I think trumps that. But, no, I haven't heard much about that. I'll be honest. I haven't talked with, like, BigQuery product managers in probably a year or so. So I'm out of the loop when it comes to specifically the GCP stuff. I'm not a Snowflake user, so my experience there is, like, basically what's on Twitter
[00:36:24] Unknown:
or what I hear from people I talk to. So, nope, short answer is I haven't heard much there. Alright. Well, for anybody listening who happens to either work at 1 of these companies or know somebody who works at these companies, definitely an issue to raise to see if that's something that we can factor in as part of the kind of standard operating procedure and, you know, just part of doing business. I'm gonna blast that out into the Twittersphere and see what comes back to. And as far as the kind of alignment of sort of goals and objectives between application development and data engineering or data platform teams or whatever formulation they're taking in your organization or, you know, in this current moment in time at whatever time this happens to be. What do you see as some of the current points of friction between those 2 teams or some of the ways to more easily kind of align their objectives so that there is a sort of smoother hand off between application developers generating this data in the 1st place and providing a useful interface for data management, you know, data professionals to be able to actually consume from those applications and continue that kind of DevOps principle of bringing the entire business into alignment for these given objectives and figuring out how do we actually manage that mapping between these different problem domains?
[00:37:42] Unknown:
2 things come to mind. So the first thing that comes to mind is just a lack of awareness that's creating friction. And the second thing I think is a lack of process, especially around architecture and and proper data modeling. I'll dig into what I mean by that in a second. So on the first thing, just lack of awareness. The thing that we discovered at WePay when we started instituting these CICD checks where we would basically say, you can't commit if you are breaking compatibility on your event or database schemas.
That's a very draconian statement. Like, preventing people from committing when they're dropping a required field in their MySQL table is heavy handed. I'm first to admit that. But what we found is when we rolled that out, most of the engineers you know, first, they were like, wait. Why can't we commit this? And then we would explain to them like, hey. You know, the the reason we're preventing you from committing this is because it breaks our data pipeline and, you know, this data makes it into the data warehouse. This table is being queried by data science and sales, and it's making its way into, you know, Zendesk or Salesforce or what what have you. And they were like, oh, wow.
Like, that totally makes sense. And they would instantly get it because, again, like, they're coming from microservice world. They know when another team drops, you know, or changes their API, it breaks their microservice API call. Like, they're unhappy about it. So they understand compatibility actually pretty deeply and have felt that pain sort of from the consumer side, but they weren't thinking a lot about who was consuming their internal data. And so just educating them that, like, hey. The data you think of as internal is used all over the place. You know, at least 80% of the time, they come back to me. It was not adversarial. They're like, oh, yeah. Okay. Let's figure out how we can make this work. Let's go, you know, work with the other teams, whatever it is, to figure out what needs to get done. So part of it is just educational and awareness, and simply putting a check-in place took care of a lot of that for us. The second part was, you know, once they were educated, like, oh, man. This is we need to help the downstream people or we need to make this incompatible change. You know, how do we do that?
We didn't have processes and sort of advisory teams or trained people in place to help them navigate that space. And, again, this is something on the microservice side that we went through. You know, when I was at LinkedIn, we were initially a monolith, and we started breaking it up into a bunch of different repositories and stuff is we had to kind of grow this microservice, restful, you know, skill set. How do you do REST modeling? How do you provide the proper APIs in data models for teams to call you? And that involves, initially, a centralized sort of, quote, unquote, council, which, you know, people kinda bristle at. And it's not something I would advise for everybody, but if you're a huge company, maybe it makes some sense. We've all sort of local experts that could help advise on, like, how do you do the evolution of the schema, what should the schema look like, and just to have the general knowledge of, like, hey. You're creating a payments model. Like, we have a payments model. Use the 1 that we have. Here it is. That kind of stuff. There's a bunch of sort of, you know, process and team sort of expertise that has to get developed and exposed to the broader engineering organization so that once they know there is a problem, they can get help navigating it. Those are kind of the 2 things that we saw.
So, you know, how that manifested at WePay, in particular, is, you know, I talked about all the CICD stuff. We also had the data platform engineer team that you mentioned sort of work and come up with a centralized schema repository. It was protobufs, and it had sort of standardized, you know, payments and, you know, bank account and credit card data models and address data models, and we borrowed some data models from Google. And then we had, you know, these analytics engineer type people that we were going to sort of liaise between that repository And, And, again, it's, like, very 1 to 1 with with how I saw things evolve with REST.
[00:42:02] Unknown:
As far as the kind of pain of these kind of schema contracts, data contracts between these different junction points, as the scale of usage grows, as as the degree of reliance increases on these different sort of terminal data assets, what are some of the junction points that you see as being the most critical to ensure smooth handoffs where, you know, if the application development team changes their database scheme in a way that's not compatible and it causes the, you know, Debezium replication to fail, you know, maybe there's a way to buffer that from the downstream data assets or, you know, maybe there are super nodes from a, you know, graph analytics perspective of the overall DAGs for where these data processing steps happen. You know, maybe there's a kind of core table in your DBT workflow that everything flows out from, so you wanna make sure that that's the spot you monitor. Like, what are the some of the ways that you have approached identifying some of these kind of critical juncture points to ensure that there is as much visibility and fault tolerance as possible to buffer some of the downstream consumers and users of these assets from failure.
[00:43:20] Unknown:
Johnny, I think you hit the nail on the head there. So in terms of junction points or interaction points, I think the the most critical 1 is between the application. Essentially, the team that is producing the data and everybody else. So the it's usually the application development team. And so, you know, as kinda you and I have said, I think there's room here for separating between their internal schemas and whatever the external schema is that they're exposing to everybody else. So that external schema is really, really important because it breaks everybody downstream of them. Secondarily, 1 of the things that we kind of grew at WePay, and I see this in the data mesh world as well, is a second tier of data model. So oftentimes, the data model the application development team is exposing is relatively, you know, fine grain and sort of close to it's very detailed, I guess, is the way I would put it. And the organization writ large may not need that level of detail or may need that data augmented with a bunch of other stuff from other teams and so on. So we grew this second tier of data model, which looks more like a data mart in the data warehouse world. We called it the canonical data representation, the CDR. And so our payments team would expose their payments data, and then we had our analytics engineering team sort of define an actual payment data model that took data from the payments team's data model and also from, you know, reporting team and from, you know, the banking team or whatever it was, and it kinda stitch it together into a data model that was really usable for the organization writ large. It was sort of this 2 tier data modeling approach. And so that second tier, I think, is a second critical point.
And if you look at the data mesh world, I think Sharma calls these, like, quanta or something like that. But she's very clear about having a hierarchy of data products, and we saw that firsthand at at WePay where we had a hierarchy where we had sort of the initial data products from the engineering teams, and we had sort of the 2nd tier analytics engineer data product that the company writ large would use for the most part. I think your second point I probably should have touched on this earlier when you were asking about where you focus the testing, and I kinda said at the front and at the back, sort of data warehouse and then CICD. But there's sort of a different way to slice it, like you said, which is which data is the most important. And, again, I think you hit the nail on the head. It's like you, the company, need to decide, and it's usually pretty obvious which data is the most important. It's like your primary data. So in our case that WePay, it was our transaction data. At LinkedIn, it was, like, our profile data, and then maybe our advertising, you know, page view advertising data or whatever it might have you. So stuff that's tied to revenue. And that's where you spend the most amount of effort doing the checks. Right? So you might do all of the above data quality checks, metrics monitoring, CICD, schema validation stuff, and you might really lock down commits in that portion of the repository around the schema so that when somebody evolves, say, the payment data model, there's a lot of eyes on that. And then before it gets committed, there's a bunch of CI and CD, you know, schema compatibility checks and, you know, DBT checks, the works. Like, everything runs. For stuff that's less important, you know, some of the tracking data, for example, that we had was, like, relatively low value. It was useful, but if we got 99.9% of it, that was plenty. We didn't you know, having 0.1% data loss wasn't gonna kill us or anything. Then, you know, you tune that way down. Maybe you don't do as much. You do the compatibility check, but you're not doing any of the cloud data warehouse checks or maybe you're only doing them once a week or once a month. That was another knob that we tuned at WePay was how frequently we would do the checks on the data warehouse. You know, if you do them half as half as often, suddenly you're spending half as much money. I don't think that part is really hard. Usually, it's some knowledge of the business domain will tell you, like, what are the most critical pieces of data. I think there are some surprises. So especially in, like, the data science world, they'll pick up on you some data that you just had no idea, and then it's really suddenly really critical to the the product they're building. And so I think monitoring on usage is really important, and that's something that we did do. So just, you know, who's querying which tables and how often, and are there any imbalances between data table usage or topic usage and, you know, monitoring health. We didn't get that sophisticated at WePay, but 1 could imagine assigning some monitoring health score to each table and topic or schema.
So Anant, the guy that does the data engineering weekly newsletter, he's got this project called schemata, which is really interesting. 1 of the things that it has in it is it's trying to assign a score to your schema sort of based on its health. And the health that it's kinda driving there is looking at sort of from a graph perspective how interconnected is it with other stuff, other models and entities. And I'm gonna do a poor job of describing it, so I'm gonna not gonna attempt it, but it's definitely worth a look. I think stuff like that is really interesting. So if you can detect that a given data model table schema, whatever topic is relatively unhealthy but also heavily used, you should probably add some more probably should improve the health of the monitoring of the metrics of whatever it is that needs to get done.
[00:48:46] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5 percent report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. In your experience, both working across different organizations and in communicating with people who are working in different fields and industries? What are some of the most interesting or innovative or unexpected approaches that you've seen to automation and organizational scalability for data operations and data products?
[00:50:09] Unknown:
A couple of things. 1 is the thing I just mentioned. The health metrics in schemata caught me by surprise. It really took me a while to kinda, like, read through it and understand what is it doing. So I thought that was really interesting. I would recommend people take a look at it. I can't say it was what I expected. What I was expecting was more like linting. So I think that was 1 thing that caught me by surprise. I also think that Convoy. I've talked to Chad Sanderson a number of times who's over at Convoy, and they've got this really cool tool that helps manage data models and schema at Convoy.
And the thing that it does really well from what I've seen is it's really more of a cultural tool. So it's trying to help the consumers and producers of data collaborate on the schemas and data models and stuff in a constructive way. And it's in a sense kind of doing some of the the job of the analytics engineer and data engineer so that if I'm a data scientist and I need some new data, I or or I need a schema or whatever, I can kinda, like, ask for it, and then I can work with the upstream teams to, you know, suss out where that data is, how can I get it, stuff like that? Can't remember the name of the tool, but that particular tool from Convoy I thought was really interesting as well. Other than that, what else catches my eye? I mean, the automation with Terraform in the data space, I think, is really interesting. And it's not really novel in the sense that it's a new tool, but I think the application of it to the data space is really long overdue.
I was talking with Sarah Krasnick who used to be at Perpay who's working on Terraform and data automation stuff right now. I thought that was just dead on, like, right up my alley. Yes. That's what we were looking at at WePay. It was really high dividend kind of thing. And so managing, you know, data pipelines, data access, even airflow DAG deployment through that world was something I was really excited about. So I guess those are the kind of 3 things that jumped to mind.
[00:52:19] Unknown:
In your own experience of working in this space and building and growing teams and communicating with people who are in similar situations, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:52:31] Unknown:
It's sort of this age old engineering story of the hardest part tends to be the humans and cultural. So I'm not trying to downplay the difficulty of building a scalable system, but I think at the point that we're at in the data ecosystem, you know, life cycle, there's a lot of scalable systems and charting and partitioning and corn based and, you know, leader follower. Like, it's a fairly well understood, you know, 20 to 50 year old problem. But when it comes to navigating the cultural elements of schema management and data modeling and metrics definitions and stuff like that, I think it's very hard sometimes to convince humans to do things that make it harder for them to ship their software in the immediate sense. Like, I'm an application engineer, and my product manager wants this thing out by Wednesday, and doing this extra work for the data stuff is not gonna make it out by Wednesday.
That is a real challenge. And so there's, like, an element of pragmatism that comes with it, and you need some emotional maturity to work with folks. I won't say that's necessarily surprising. Like I said, it's sort of the age old story of engineering is the hard stuff. It's not the technical stuff. I wouldn't say it's obvious. It's sort of a non obvious thing.
[00:53:54] Unknown:
And for people who are building out their own data systems and data platforms or working in this ecosystem, what are the cases where automation is too big of a foot gun and you actually want to keep doing things manually to figure out how it all works?
[00:54:08] Unknown:
Yeah. I think you just answered that. So that was gonna be what I was gonna say. You don't wanna automate too early. You know, if you don't know what the right flow or process is, don't automate. Do it manually until you figure out what the right flow is and then automate it. I think that sort of answer number 1 is especially engineers, you know, they wanna write their Python scripts and automate everything immediately. But I think doing things manually the first few times and sort of working through, like, who do I talk to? Oh, I need to talk to security. Okay. Well, like, what tool are they using? Okay. Well, okay. They're using this tool to manage their security approvals and yada yada. That helps mitigate some of technical debt you accrue if you just go whole hog into automation from the get go and then discover, like, it's the wrong automation or you're missing stuff or there's tools you didn't know about or what have you. That's, I think, thing number 1. The second thing is just that, you know, not everything can be automated. So, specifically, some of the stuff I was talking about around data modeling and what is the right way to define, you know, a given data model. That's more of like a sort of an architectural design pattern y kind of question. And, you know, linting can help, and there's tools that can help and and sort of, like, look for other things that have similar names or, you know, some of that stuff like schematis doing where it's trying to figure out you have a payment here, but it should it be payment ID and there's this other payment data model? Are these 2 things interlinked, and should you be referencing payment instead? There's some of this tooling can do some of that. But I think there's sort of this meta level of looking at data models and figuring out, you know, how they should be factored. They're like just humans have to do that. Right? There's no tool that's gonna tell you how to factor your data models and whether 1 of them should be embedded in the other or a separate data model and so on. So I think when it comes to defining schemas and data models, automation can help, but I think there still needs to be human in the loop there for a lot of that stuff. So those are kind of my 2 areas where I think automation is not the end all be all.
[00:56:13] Unknown:
Yeah. As you continue to work in this space and track the evolution of how automation is being brought in, how DevOps principles are factoring into data engineering and data platforms. What does the future of data engineering look like to you?
[00:56:31] Unknown:
So I think much more federated and automated. You know? Again, it's basically been the whole conversation around this, but I think that's the direction I always wanted to move when I was running the data team at WePay. And it was a direction we were moving in, and I saw it paying a ton of dividends. So I think, you know, the ecosystem is growing more and more tooling in this area that's just gonna make that easier and easier for organizations to roll out. So empowering the upstream and downstream producers and consumers to get the data they need, manage the data they have without involving data engineers.
I think that's kind of the direction we're moving. And I think data engineers are moving away from shoveling data from 1 system into another and into a role where they're just building automation tooling for the organization to use. That's kind of the future that I see on the data engineering side of the world.
[00:57:32] Unknown:
This is probably a topic that's worth a whole another episode, but as data engineers work more into that facilitation role and move more of the actual data movement, data management higher up into some of the business users and analytics engineering roles? What are some of the potential pitfalls or some of the elements of education or understanding of the, kind of, fundamentals of data management, data modeling, kind of, scheme evolution that need to be either baked into those tools and systems so that they don't need to be exposed or that need to be kind of managed, and how do we translate some of those concepts and lessons to be accessible to some of those users that don't necessarily have that same background?
[00:58:23] Unknown:
I think it's a both and. I think it's in some cases, the tools can handle that. You know, to use that concrete example, we can have the tools that check compatibility. And is this schema forwards and backwards compatible? Is this change forwards and backwards compatible? But I think the counterbalance to that is this analytics engineer embedded role that I had mentioned. We're like, okay. Failed. The application engineer is left sitting there like, well, I can't commit this change, and it's telling me it's an incompatible change because, you know, field was a string and now it's an in. Like, what do I do? I think that's where analytics engineer can play a role. And then that's sort of, I think, not grunt work, but sort of a tactical thing. I think the second level, which again I think is less automation and more human, is just the analytic engineers can really help the application engineering teams define, like, what does their public, quote, unquote, data product look like, help them manage it. You know? So application engineer is probably gonna be in charge of defining and exposing that public data product and helping transform the internal data into the external data. So I think it's a both end. You know, some of it will be tooling, but I think a lot of it really is gonna fall into the, quote, unquote, embedded SRE role, which is really like this analytics engineering role. I don't wanna make it sound like this is, like, pie in the sky made up stuff. Like, there are definitely teams that are doing this. So, you know, I mentioned earlier, Dimitri, my coauthor for my book, his teams at ZymriGen actually do have embedded analytics engineers. They're a very data centric company. They're doing, like, biotech stuff. Hand weighty.
I'm not super educated on that, but they have analytics engineers that kinda work hand in glove with the the rest of the engineering team to do this. It exists in the world. This is not completely made up, and it does work. But I think it's just a very new pattern for this kind of thing. So those are sort of the 2 ends of the spectrum that I see.
[01:00:20] Unknown:
Are there any other aspects of this subject of scaling of organizational capacity for data management and data evolution and the role that automation plays in building and maintaining that velocity that we didn't discuss yet that you'd like to cover before we close out the show?
[01:00:35] Unknown:
We only touched on it really briefly, but I think the area of security and compliance is, you know, important in becoming more so every day. You know, I think between GDPR and HIPAA and SOC 2 and, you know, on and on and on shield and CCPA, and there's a million of these things now. So there's sort of a compliance aspect to it. And then also just a we need to be good stewards of our data general moral responsibility that you have to your customers. There's a lot of room for automation there too. You know? And I mentioned this DLP product. I think I would love to see a lot more effort spent in that space in automating security checks and access approvals and stuff like that. 1 of the things that we were doing at VPAY was tagging the source data. So if I have a MySQL table and it has a column and that column has phone numbers in it, the developers had a way of expressing, hey. This table has a column a, and a has phone numbers.
And then downstream, what we would do is run these automated checks that would detected sensitive information that was not already tagged for a given column, it would alert. And so, you know, maybe column a that has phone numbers, maybe it detected there were emails in there. And so it's, hey. We found email. It's not tagged. So maybe that's legitimate. It's okay to have emails in there, in which case the developer needs to go and tag it. Or it was illegitimate, in which case we need to remediate and, like, scrub the data, get rid of the emails, redact access, what have you. So I think that's part of the automation. The second part of the automation is once you have this metadata tagged and detected, you can start doing more automated access control. So, you know, if I'm a level 1 user, for example, whatever level 1 means. Let's say it means I have access to a certain tier of secure data. When I request access to a given table, you know, if it has emails, I'll be automatically granted that table for 24 hours, you know, something like that. And no human needs to be in the loop for that, which, again, getting back to automation and federation is a big thing. So I think there's a lot to be done there. You know, none of these are, like, novel ideas. If you look at, you know, SSH production access and, you know, bastion gateway, like, all those kind of stuff operations does. But it's, again, it's taking those ideas and applying them to the data space. I think it's something that we need to do in the security side of things. So we touched on that a little bit, but that was something that I would definitely wanna highlight as, I think, an opportunity for a ton of automation.
[01:03:05] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:21] Unknown:
Yeah. I think it's this thing I kind of alluded to with the Convoy tool, which is I would really love to see more tooling that helped the teams work with each other. Like, we had engineers had GitHub and, you know, all the different teams can, like, look at PRs and comment on each other and whatnot. I would love to have tooling that allowed organizations to collaborate on data models and schemas. I think that's just a huge gap. And so for me, that would be my big ask.
[01:03:52] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing and the experiences that you've had and some of your insights on the opportunities for automation in the data ecosystem. It's definitely a very important and constantly evolving area. So I appreciate all of your help in continuing to push the conversation forward. So thank you again for taking the time, and I hope you enjoy the rest of your day. Day. Yeah. Thank you. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you have learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Chris Riccamine's Career Journey
Early Career and Interest in Data
Challenges in Scaling Data Platforms
Opportunities for Automation in Data Pipelines
Microservice Paradigms in Data Management
Adopting DevOps Principles in Data Engineering
Cost and Frequency of Data Quality Checks
Critical Junction Points in Data Pipelines
Innovative Approaches to Data Automation
Lessons Learned in Data Engineering
Future of Data Engineering
Security and Compliance in Data Management
Closing Remarks