Summary
Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Maarten Masschelein and Tom Baeyens about the work are doing at Soda to power data quality management
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at Soda?
- What problem are you trying to solve?
- And how are you solving that problem?
- What motivated you to start a business focused on data monitoring and data quality?
- The data monitoring and broader data quality space is a segment of the industry that is seeing a huge increase in attention recently. Can you share your perspective on the current state of the ecosystem and how your approach compares to other tools and products?
- who have you created Soda for (e.g platform engineers, data engineers, data product owners etc) and what is a typical workflow for each of them?
- How do you go about integrating Soda into your data infrastructure?
- How has the Soda platform been architected?
- Why is this architecture important?
- How have the goals and design of the system changed or evolved as you worked with early customers and iterated toward your current state?
- What are some of the challenges associated with the ongoing monitoring and testing of data?
- what are some of the tools or techniques for data testing used in conjunction with Soda?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Soda being used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building the technology and business for Soda?
- When is Soda the wrong choice?
- What do you have planned for the future?
Contact Info
- Maarten
- @masscheleinm on Twitter
- Tom
- @tombaeyens on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Soda Data
- Soda SQL
- RedHat
- Collibra
- Spark
- Getting Things Done by David Allen (affiliate link)
- Slack
- OpsGenie
- DBT
- Airflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Martin Mosckelain and Tom Byans about the work they're doing at SOTA to power data quality management. So, Tom, can you start by introducing yourself? Thanks, Tobias, by the way, for having us. I've been a fan for the podcast for over 3 years now, and I'm super excited to be
[00:01:14] Unknown:
here. Yeah. So I'm Tom. I started my career building a workflow engine because the existing products at the time were dreadful. Funny thing is that a lot of engineers actually feel the need to write a workflow engine sooner or later. So I happen to be the lucky 1, building a large open source community around it and ending up at Red Hat. As an engineer, open source is actually awesome because by sharing your code, you get more feedback, so you learn much faster and build better code. Also, it's very rewarding to know that your code runs across the world at thousands of organizations. 1 such company, for example, is Calibra.
They use the workflow engine we created to drive 1 of their key features, which is collaboration and task management. And now at SOTA, I'm driving the open source strategy, and my main responsibility is to ensure that data engineers just love using our And Martin, how about yourself? Hey, Martin.
[00:02:11] Unknown:
I started my career as an early stage software company, pretty much to learn all things SaaS. I was fortunate to join an amazing team that ended up growing the company from 5 when when I joined to about 350 people when I left. And as probably some of you know how it goes in startups, you know, you take on a variety of roles. The most interesting 1 for me personally was heading up the field operations team. Once the company got a bit bigger, field ops was really a diverse team. We had 30 people across customer facing roles, across system and process management, as well as data and analytics. And 1 of the most memorable things I remember from that time was us launching the the company's first data products, really, which were all about improving predictability and efficiency of the go to market. And, Martin, do you remember how you first got involved in the area of data management?
Yes. For me, that was through the company I was just talking about, which was Colibra. Some of you might be familiar with them as they're active in the data governance and cataloging space, predominantly for large organizations. But they created, world's 1st data governance and data discovery platform back in 2010. And it was early. Some would say too early because at the time there was only 1 chief data officer in the world and nobody had heard of data discovery before. Whereas today, there's tens of thousands of CDOs or heads of data and close to 10 commercial offerings in that space. The first 4 years of my career, I predominantly helped companies internally sell and implement capabilities like metadata management, data discovery, data privacy as GDPR came up, data lineage, and then ultimately also quality.
However, a lot of companies were struggling, especially with quality, as the tooling out there was very antiquated and and unwieldy. I remember the biggest struggle from back then was all about how can we enable the people that know the data really well, like subject matter expert or a data owner, to write executable business rules so they could get exception reports on data defects. And at the time, kind of, we solved that because with a combination of the tooling that existed, but then also with collaboration tooling because there was a handover between the business and the technology team. And for that, we used workflow. So you can kind of guess how Tom and I met. Because in the engineering team, Tom was considered like the legend of the workflow space.
And I was only aware of kind of the stories back then. But fast forward a couple of years, and I'm on this trip to London and the Eurostar, the train from Brussels to London. And 1 of my colleagues spots Tom. He was just sitting there coding away. So, yeah, we we started chatting and the rest is history. Anyway, you could say that's almost partially responsible for bringing Tom over to the data management world.
[00:04:59] Unknown:
Yeah. And that meeting actually started me on a transition from a software engineer to a data engineer. The software engineering principles are basically the same as in the data space, only the landscape is a bit different. And it struck me that in data, the testing aspect was lagging behind. So coming from the engineering background, writing unit tests and monitoring applications in production is a given there. In data, it's quite different. Most organizations are aware that they should test, but they don't know what the good practices are and how to approach it. And when I and Martin, started working together, we just knew that there was such a big need for this.
[00:05:39] Unknown:
Yeah. It's definitely remarkable how much velocity we've been able to gain in the data space while still sort of leaving behind a lot of the testing principles that have become so ubiquitous in the software engineering landscape. And I think part of it is because a lot of the data products that we use came out of academia where testing hasn't really become as commonplace. So a lot of the tooling was just kind of built around the idea of, here's this thing. You just kind of put it onto a server, and then it's in production. And then it's taken this long for data to become sort of a core element of so many businesses, and then that is what has led to the rise in observability and quality and testing as a top level piece of importance in the data ecosystem because it is so critical to the rest of the business, and so many more businesses are relying on it than were using it in, you know, the early 2000.
[00:06:34] Unknown:
Yep. Totally.
[00:06:36] Unknown:
After meeting at Calibra and deciding that data was something you wanted to focus on, you've both decided to start a business around it. So can you give a bit of an overview about what it is that you're building at SOTA and some of the problems that you're trying to solve there? We're building a data monitoring platform that enables organizations to discover, communicate, and prioritize data issues, especially as it relates to the data that goes into their data products, whether that data is streamed or whether it resides in lakes or warehouses.
[00:07:05] Unknown:
Now it's probably a good time to first define what we mean by data products for the context of today. How we think of it is that data product is a piece of software. It's a software product that uses data as an input to facilitate but often automate a decision. And as organizations start building and operationalizing more of these data products, it's becoming increasingly critical to monitor the quality of the data that's being fed into those. So we ultimately think of a data product really as a piece of software that uses a continuous flow of data to drive a business outcome. I think BI reports, price setting algorithms, recommendation engines on ecommerce, stores, etcetera.
[00:07:45] Unknown:
Yeah. The overall space of data product is definitely pretty broad. So I appreciate you giving that framing for the context of this conversation because, you know, some people think data product and it's, you know, machine learning models and other people think data product and it's, you know, the SaaS application that happens to require customer input. You know, even in that context, even in your framing, it's still pretty broad and can encapsulate those different things. You know, other people think data product is just the 1 report that I deliver to my CEO on a weekly basis, you know, from some batch jobs. So I noticed you mentioned a continuous stream of data. So I'm wondering if that means that you're focused largely on kind of the streaming aspect or if you're trying to unify the batch versus streaming landscape in terms of the ways that you're approaching data quality and observability?
[00:08:30] Unknown:
Ultimately, what we've done is we've looked at where most data processing happens today, and that is clearly kind of in the SQL landscape. But a lot of data processing is happening in Spark, and there's, of course, a lot of data being streamed. So we have an approach in which we'll talk about that a little bit later. But when it comes to the architecture of our system, we've initially focused really hard on SQL and are now creating open source libraries for both Spark and streaming.
[00:08:59] Unknown:
In terms of the specifics of your approach to data quality, I know that there are a number of different companies and projects in the space, particularly just in the past year or 2 that have been coming up. And some of them orient around the data warehouse as kind of a focal point where they just look at the data that's coming into the system into the system and orient the entire data quality product around the data as it lives there versus some companies are maybe focusing on the data pipelining aspect. And there are a number of different places where data quality issues can crop up. So I'm wondering what your stance is on the best place to focus your efforts, particularly given that you're an early company and have limited time and energy for being able to make an impact and sort of what your overall philosophy is as to how to approach this large problem.
[00:09:57] Unknown:
So maybe we should start defining, like, what is the problem that we're after. Right? What is the problem that we're solving? And I think the core problem really is, first of all, is that it's data teams are dealing with silent issues. I think we should first establish that because there are silent issues because more often than not, you don't even know that something's off, that something's happening. So your data products keep working, but they produce unexpected and erroneous results. So that's kind of the kind of the broader problem as we define it. Some teams call that data quality issues. But, ultimately, kind of the core aspect of it is that they're silent, and it requires the customer of your data product to ring the alarm bell. And that's a very, very annoying feeling.
As a data engineer, you're involved in firefighting a data issue because somebody else renamed an event name, and that led to an email campaign being sent to the wrong people. Or as a product manager, you're losing revenue due to an error kind of in your recommendations on your ecommerce website, for example,
[00:10:59] Unknown:
because they're distorted towards cheaper items, because we mistakenly train on skewed data. Now this list goes on. The real problem here is actually that most data teams today, they're flying blind, and they don't have systems and processes in place to detect problems with data. And as a result, these data issues remain silent. So the software keeps working but on bad data, and that leads to all sorts of uncontrolled events. And there's usually 2 key problems. So first is the lack of observability, and second 1 is figuring out where to start.
1st, like, if you don't monitor, you can't know that something is wrong. Like, that's quite obvious, quite obvious. But in order to start monitoring, the most difficult nut to crack is this. Like, who's doing what and where do we start? Without such an overall approach and a clear way to start data monitoring, many people just park it for now and they move on to the next urgent feature. Also, local initiatives get stuck as they fail to create the transparency towards other data people in the organization, and that leaves your data products exposed.
And those exposed data products can lead to all sorts of trouble. The more mission critical your data product, the greater the potential damages. Damages can be things like loss of revenue, for instance, when broken data would cause wrong prices to be calculated, increased costs, like if you have to do a huge cleanup, something went wrong, and risk is another 1. That's when you, for instance, publish wrong information on a dashboard to your customers, and it can break the reputation of your brand. I remember that risk was the main driver for our first customer, actually. I think that sums up the core of the problem that we're addressing.
[00:12:49] Unknown:
You were mentioning how there is this need for being able to understand what data you're working with, how the systems are operating. And these are all software products. So at some level, there's the kind of anticipation or expectation that you will use standard unit testing approaches and being able to do sort of production monitoring of your systems. But I also know that data systems are a tricky beast and that they have their own special set of concerns. And I'm wondering what you see as the missing pieces in the overall testing landscape from a software engineer's perspective, and why it is that the existing tools aren't sufficient or aren't easy to apply to these problems of data observability and data quality and just some of the ways that you're working to tackle that problem? I think it's worthwhile kind of going through the capabilities
[00:13:42] Unknown:
end to end because how we look at the problem first on a high level is we basically want to make sure we support the end to the end data issue workflow. So it starts really with discovering kind of your data issues, and you could do that by defining what good data looks like because that goes into the how. And I think what is, to your question, right, what is lacking today is actually that we have all of the approaches that what we need to apply to actually get to high quality data, to reliable data pipelines. And that's kind of part of the defining what good data looks like for us.
But that's not where it ends. A second piece that's missing is that we create a lot of systems that create alerts, but there's no ownership of it. Right? So we need to be able to communicate incidents to the right people based on the role that they have in a data team and work around the data culture, where we foster ownership. And finally, it's about prioritizing resolving issues based on the holistic view of the impact. Because very often, if you only have a small subset, like a small view of, for example, test failures, you cannot really contextualize that very well and determine, oh, this is now the first thing I'm gonna tackle today. So as I said, the majority of our focus is on, like, how do we define what good data looks like? And for us, there's 3 key areas, like, 3 key perspectives.
1 is from a data platform team. Right? So you're managing that central data platform for everyone to build data products on. Well, for that team, it's core to really create dataset level observability. Right? So we can automatically warn the teams that are building data products when something irregular data arrival. I think changes in data types. I think a lot of missing values or data. Think schema changes. So, really, as new data comes in, you compare metrics and information to a historical baseline. That's 1 part of the solution.
A second piece is everything around data testing. So when you, as a data or analytics engineer, are writing data transformations, you want them to be reliable. Right? You want them to keep working as expected. And that's not always the case. So you need tools to be able to quarantine data for review. Finally, we should look at it also from the product manager's perspective. So if you're managing a data product, you want to be able to click and SLA together that says, hey. For my data, that needs to arrive, for example, at 8 AM, and it needs to have all of these core characteristics. Because if that's not the case, the analytical integrity of my data product is at jeopardy.
[00:16:24] Unknown:
Yep. From a product perspective, we solve these problems with a combination of a cloud platform and a set of open source developer tools. The The open source developer tools, they're built by and for developers and for data engineers. These tools are designed to fit naturally in the data engineer's workflow. They're embedded in the pipelines, and they stop them if needed. Because no data is a lot better than broken data. Observability starts with the developer tools. They compute a set of metrics, and they can also evaluate tests. And the metrics can be sent to our cloud.
So DesotoCloud enables us to store things like the metrics over time that's used for checking anomalies automatically, allowing for nontechnical people to set thresholds, sending alert notifications, and diagnosing issues, for instance. I think that's the the gist of it. That's how we're solving the problem in Yep. So to come back to your question, Tobias,
[00:17:29] Unknown:
I think it's predominantly the overarching framework and the different tools or the different things that everyone actually should be doing
[00:17:36] Unknown:
to get to high quality trusted data. And I really like how you pointed out the challenge of being able to build tests around the system because of the fact that you need representative data to understand if the changes you're making are actually going to have the desired impact. And it's definitely challenging to be able to easily and repeatably generate data in a manner that's going to match what you're seeing in production. And so having the approach of being able to quarantine sample data for then rerunning against the changes that you're making locally to understand what impact that will have when you get to production is definitely valuable.
[00:18:13] Unknown:
Exactly.
[00:18:14] Unknown:
Going back a bit to the overall space of data quality and data monitoring and the level of growth that we've seen in the most recent couple of years in in terms of the number of companies, the number of projects. Wondering if you can just give your perspective on how you see the current state of the ecosystem and some of the differences in terms of philosophy or capabilities of the SOTA platform and the open source tools that you're building there as it relates to some of the other approaches that some of the other businesses and projects are taking and the the way that they're trying to address this problem? I think from the the level of interest, it's clear, of course, that testing monitoring and observability, it's top of mind for thousands of organizations
[00:18:58] Unknown:
right now. But if you were to analyze the ecosystem, I devaluate it across 3 core dimensions. So I think of it as, you know, some systems are active and some are passive. Like, active tools, they run as you run your pipelines, and they immediate action when something goes wrong. So for active monitoring and testing, performance is key, of course. But also the developer experience is key, as well as the ability to version control this. Passive tools, on the other hand, they, for example, work on top of your data warehouse. They could, execute a set of queries, a certain schedule, or read in the logs, and they'll run-in parallel and alert you typically after the damage is done, which is already a lot better than not knowing what's going on, but it's not always the ideal scenario because sometimes you really want to stop data from flowing through. So we kind of believe on those 2 dimensions that both capabilities are really necessary to solve it holistically.
But you can, of course, start where you see the most value, yourself. The second 1 is this notion of manual versus automated. It's a question that we sometimes get. And in fully automated systems, even those developed by the greatest minds in the industry, right, they will still create alert fatigue. On the other hand and if you're fully manual driven or, like, manual user involvement, albeit minimal, right, it could be considered tedious or not scalable because we have so much data and so many data usages and products. But it is, in our opinion, a valid approach as well because it's a really good way to get the tacit knowledge of data subject matter experts, that they have different business functions, into the the system as well. And so this balance between automation and manual user involvement, I think, is also critical to get right. And then the third 1 is around kind of tool versus platform. Because tools are really great because they allow you to solve a problem in a very easy way, and, typically, you get a native solution for that.
Whereas platforms are good because they can provide more end to end insights, the health of your data, for example. Or they could provide capabilities that are more collaborative in nature, right, to collaboratively solve problems. So they've been very deliberate about kind of what we consider to be a tool, like, for example, data testing, and what we consider to be part of the platform, like, for example, data collaboration, and how tools can also enrich collaboration in that platform. So that's how I'd look at it. It's kind of these 3 main dimensions.
[00:21:31] Unknown:
Yeah. It's definitely a useful way to look at it. And I like what you're saying too about the paths to adoption of tool based versus platform based systems, where if you have a platform, then you need to kind of buy it entirely to make it useful, and there's a much bigger switching cost both into and out of the platform. Whereas if it's a kind of loose collection of tools that all integrate with a broader platform, then it's easier to get started with 1 or 2 of those tools, use them with your existing systems, and then start to realize the value of actually sending it all to that central platform to be able to get the more longitudinal view of how it all sits together.
[00:22:11] Unknown:
Yep. Exactly.
[00:22:12] Unknown:
And so in terms of the ways that you're approaching the overall design of your system, I'm wondering what you are using as kind of your guiding philosophy and who you use as your canonical sort of representation of an end user of the SOTA platform and the tools that you're building for and how that helps to direct the way that you build the interfaces and design the workflow for people who want to onboard into the system
[00:22:48] Unknown:
perspective on the different areas that data monitoring touches. The first is the data platform engineers. They provide the central infrastructure for the entire data organization. They want to monitor as a service or they want to enable monitoring as a service on that central platform so that others can build data products on it. For them, automation is key. They want to enable monitoring all of the data in their platform, and typically, that involves a large number of datasets. The central platform team cannot spend time to set up each dataset individually.
For them, data monitoring must be fully automated, and that's the data platform engineers. Next, there's the data engineers, sometimes also referred to as analytics engineers. Their primary responsibility is to build transformations. Given they prepare the data for the data products, they often are the primary contact in the team if something goes wrong. They often feel the heat if their transformations ran correctly. With respect to monitoring, their primary concern should be to deliver reliable transformations. That's why they should add data testing to ensure that their transformations keep running as expected.
Before a transformation starts, a data engineer could test preconditions on the input data. Those are all the assumptions on which the transformation relies. These are typical things like schema checks, null checks, referential integrity, and so on. At the end of a transformation, post conditions should verify that the transformation ran correctly. When that is done for each transformation, a decent coverage is achieved, and trust in the data products will go up. This is very similar to unit testing in software engineering. And over the coming years, data testing is going to become a standard practice in the data space as well. That covers the engineers part. And then the last 1 is the product managers.
So product manager role, they are responsible for the data products, and they want to ensure that the data products keep working correctly. And to achieve that, they use service level agreements. These SLAs, they track important aspects of the data that goes into their data products. Often, these represent domain knowledge. Like, for example, on a public holiday, we should only see 10% of the normal traffic, for products x, y, and z. Another SLA could be about freshness. The data needs to arrive by 8 AM. These are examples of data product dependencies that product managers want to monitor as a service level agreement.
[00:25:33] Unknown:
So I think with these 3 roles, we've got a good summary of how to set up a good monitoring system. There's 1 role we didn't discuss. And, of course, it's important as well that your head of data, your chief data officer, gets a bit of an overview as well. Because for them, it's all about implementing an a strong data culture, right, in which very often data ownership is is key. Because without data ownership, who will look at all the alerts? Right? Who will solve the problems? And very often or not, this is a cultural change that needs to happen. So for them, it's more about kind of the oversight and seeing that the data teams and the organization are going in the right direction.
[00:26:13] Unknown:
That's another 1 of the challenges too in the overall space of data quality and data monitoring is that there are so many different stakeholders and people who contribute to and are concerned with the vitality of the data. Whereas in a software system, the testing and integration testing is largely the concern of maybe 1 or 2 roles where you have the software developer who's likely writing the tests and monitoring them. And in a larger organization, you might have a QA engineer or a test engineer who's working with the software engineer to build out the whole testing strategy versus in data where you have you mentioned 3 different roles that are working with the data at its core, but then you also have the consumers of the data and, you know, the product managers who are trying to understand different ways that the data can be used across the company and, you know, other roles that we could probably spend a while getting into some of the minutiae. But the nature of issues and data is much more insidious than if there's a small bug in your website versus, you know, when everything is completely broken, it's much more obvious.
[00:27:15] Unknown:
Exactly. You know, although we can draw analogies, right, to indeed, like, infra or application monitoring, but exactly what you said, the problem domain is much more complicated because how data systems work and the amount of people involved in it and kind of the flow of data makes it much more complicated. I think for us, it's super exciting because there's a lot of kind of new capabilities and features to be built and that all help in that. And ultimately, finding issues and making sure we can prioritize and resolve them.
[00:27:49] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask. Digging more into the soda best practices or kind of the happy path of systems that you are targeting initially for being able to enable some of this visibility and data quality checks?
[00:29:05] Unknown:
Yeah. That's quite simple, actually, because on the 1 hand, we have our open source developer tools, right, and they are there to create observability and do data testing. And they can be used standalone. They're a very good way to get started because they give you a framework. And then for companies that want to take the next step, they can choose to take the sort of cloud to the platform in either a SaaS or a PaaS for kind of deployment. A PaaS being very important because we deploy it automatically in the customer's VPC so that no data leaves it. Because 1 of those things in data is, like, we wanna see, for example, failed roles, failed records, when things go bad. So we cannot have that data leave, the VPC.
And then if you're kind of would only be interested in saying, hey. Do my automated monitoring, then typically, for us, we have teams just say, well, then I'll use the sort of clouds. I'll point it to a system, and it will figure that out for me. But teams that are coming more from a testing angle, well, they just use and so does SQL. And we mentioned SQL a lot because, you know, it's the first developer tool that we've just released a month ago, and we see kind of a lot of data workloads happening there. But in the future, we'll release sort of for Spark, and that's gonna be very soon, as well sort of for streaming. So you can quarantine your data, basically as it flows through your, through your systems.
[00:30:27] Unknown:
And in terms of the quarantining, are you using things like schema validation? What are some of the other heuristics that you can use to determine when to quarantine a particular record? Are are you largely focused on quarantining invalid data, or are you also sampling data that passes all of the checks to be able to use for doing automated validation to make sure that you don't break your existing fully functioning systems?
[00:30:53] Unknown:
That's a very important point because when we talked about this notion of active and passive earlier, the active system, like, you you're testing your data and you quarantine, it requires you to change your existing data pipeline. So it requires you to change your code base, which is not something everyone wants to do all the time. The alternative is just have, okay, let's have an automated system look at when our data refreshes. And each time it does, let's collect things like what you said. Like, what's actually that frequency it updates at? What is the volume of data, the row counts that we're processing in this interval or win time window? Is all of that like, what does the schema look like, and what do we have in terms of semantic types for all of those columns?
Based on those types, what are some of the metrics we can calculate? For strings, of course, that's length. Maybe that's patterns. For numerics, those would be things like your averages, etcetera. And based on all of that, you can do an automated analysis, comparing the latest value that comes in, so the values of the current time window, and comparing that to the historical baseline.
[00:31:59] Unknown:
And that will be 1 way in which we see users really wanna tackle it. Digging a bit more into the actual core architecture of the system that you're providing to end users as a service. I'm wondering if you can just talk a bit about some of the ways that that's implemented and some of the tools that you're relying on and maybe some of the ways that you're actually using SOTA to validate your product that you're building. That's where I think it gets really interesting.
[00:32:24] Unknown:
We've done a lot of research to deliver a single central monitoring system. And this is important because collaboration, as we discussed earlier, is often the biggest challenge. Product managers, they need an app, for instance, so that they can build their own monitors and SLAs in a self-service UI. That's because product manager are often less technical. At the same time, data engineers must make the data observable. That is often underestimated, and it requires a totally different approach. Data engineers, they want, for instance, readable configuration files that result in optimized queries, and they want to store the configuration files in the same git repository that they also use for their pipeline code. And they want to control the exact SQL statements that are being executed. So and they want to run the tests as part of their orchestrations, and that should be delivered in a white box approach that puts the engineers in control. It also has to fit with all the technologies across the data landscape, like, for instance, Snowflake, DBT, Airflow, data lake technologies like Spark, even streaming solutions like Kafka. So from an architectural perspective, I think those are the main components,
[00:33:40] Unknown:
that we're looking at. What were some of the kind of core challenges that you were trying to build around and some of the biggest decision points in creating the system architecture that you ended up settling on and maybe some of the evolution that it underwent as you started to actually dig further into the problem and understand what some of the bigger challenges are that were hiding underneath the surface.
[00:34:05] Unknown:
In the early days, we went quite broad because it wasn't really clear for us what every stakeholder wanted and how they could contribute effectively. So it required countless of discussions with companies and our customer advisory boards. But after a while, it became crystal clear that the key to success would be to first empower the data engineer so that they could create observability, test data, And we wanted to make sure that they could also easily hand over some of the responsibilities that they very often inherently get to the right people, being the data owners and the product managers.
So as we started focusing on the engineer, the most challenging aspect was to really create kind of frictionless connectivity. That's why we ended up with this sort of SQL approach. It allows us to leave data where it is, always track the metrics through efficient queries, etcetera. But at the same time then, we were working with the product managers, and they wanted this smooth experience. They wanted to click some things together. They didn't wanna dive into the codes. And so for this piece, we then worked with an amazing design team in California last summer, many, many, many late nights. But we created an amazing kind of user experience, UI on top of it. But if you're interested, you could just go and check out because you can now sign up on our website, and you could check out the UI alongside the open source libraries.
So that was, I think, in general, the kind of the biggest challenges and kind of the iterations. We did a couple of these really architectural iterations
[00:35:40] Unknown:
until we got it right. In terms of the actual libraries themselves, you know, releasing them as open source is definitely useful for making them accessible to people who don't necessarily want to buy into the full SOTA platform as well as a sort of bottom up adoption path for being able to gain customers. But there's also already a large existing set of open source tooling for various purposes. And I'm wondering, what were some of the gaps in that open source ecosystem that you're working to target and maybe some of the ways that you have looked to design your system to fit nicely within the ecosystem and play well with others rather than trying to supplant existing tools that other people might already be using? So we have a number of open source projects that we're currently launching, but it's not
[00:36:32] Unknown:
our goal to just add them to the mix. The main goal that we have is to make sure that all of these fit together and fit onto a 1 central platform. That's the main goal. It should seamlessly fit on the central platform. And then another reason why we split it up into several open source project is that we want to address each technology natively. So we don't want a generic abstraction that it gets really hard to map it to the different technologies. Now we want to make sure that if you select your technology, we want everyone to get up and running in less than 5 minutes. So that's for us the goal. You should be up and running very quickly, but then you know that all of these open source, sort of open source projects, they collectively connect to the same cloud where the collaboration can then happen. So that's the background motivation that we had to go for this approach. Yep.
[00:37:22] Unknown:
And I think on the interoperability, you make a really good point. So I think it's 1 of the key problems in data management today. So what we do is we basically we have a wide variety of technologies that we integrate with, but most of that happens through the cloud. So it's kind of plug and play integrations. For example, you wanna show the data quality issues, for example, that you have on a certain BI report. Well, great. Let's push them to Tableau. We can show that directly in your Tableau dashboards. You wanna make sure that when people search for data in your organization, that they can see that data is trustworthy and that we're actually solving problems.
Well, great. Then we'll just push the results to our data cataloging system. So that's kind of our thinking. At the core, we wanted to make sure we had very slick, very easy to use libraries that are super powerful that supports kind of communication towards the central place where we then facilitate a lot of the interoperability in the ecosystem.
[00:38:23] Unknown:
In terms of the more long range challenges of dealing with data quality, you know, there's the point solution of, okay, I've got a test. I can see when something fails or, you know, I have some validation for this pipeline, so now I can feel safe putting it into production. But as you work on these problems more long term and you start to build up sort of an intuition of where things might go wrong or maybe start to run up against too much structure or too many restrictions in terms of how you can build these different data pipelines. What do you see as the balancing act on some of the long term considerations of teams who are factoring data quality as a core concern of the work that they're doing.
[00:39:09] Unknown:
I think after the initial getting started as you're rolling it out and you are really into the next phases, then 1 of the challenges that you'll meet eventually is alert fatigue. That's a real threat to any monitoring system, and it can lead to completely abandoning the system altogether. Especially when you rely too much only anomaly detection, then this tends to cause problems. Automatic issue discovery often finds a lot of abnormal data that's actually not a problem at all. The way we deal with this is with a different approach that not use alert notifications.
It's inspired by getting things done. We sort all abnormal observations in a list according to criticality and then map it to the right people. Users can then just plan and timebox their work of dealing with the most critical anomalies. They process them as far as they can, then they go through the list in the time they want to spend on it. And that's just 1 of the ways that we've identified to tackle this alert fatigue amongst a couple of others. Another challenge that we see after a while is the scaling of the domain knowledge. Subject matter experts, they have a lot of this domain knowledge, and that can be very helpful in finding the data issues.
The challenge is that it's hard to scale, actually, this approach when engineers have to be involved to implement the domain knowledge. So the way we solve this is by adding self-service monitors to our app. Subject matter experts can then manage their own monitors without the round trip to the engineers, and that saves time and enables a lot more of the business logic to be monitored.
[00:40:56] Unknown:
Yeah. That's definitely a useful and worthwhile thing to call out and kind of spend some time on is the kind of scalability of and sort of the overall coverage of the types of checks that you want to have in a system. Because as you said, an engineer has something that they care about where I wanna make sure that the pipeline is running so that I don't get paged at 2 AM to kick it and get it going again. Whereas the VP of sales has a completely different set of concerns to say, I wanna make sure that my KPIs for my salespeople don't drop below this particular level. And when it does, I wanna know early so that I can try and understand what is the root cause of this. Might not even be an issue with the quality of the data or the validity of the data. It's just something that I care about because, you know, this this is a metric or a statistic that has massive business impact for me. And so that's another aspect of data quality is that it's not even just that when things go wrong with the technology, it's also something of when things go wrong in the business and needing to be able to get an early warning system for that.
[00:42:01] Unknown:
Yeah. Exactly. It's part of that who's doing what is for us crucial. Right? This data quality is a team sport. Everyone needs to be involved. It's not good when too much responsibility is given to 1 person, right, because that will create a lot of frustration. So we try to spend a lot of time in making sure that we kind of split up the who does what and then just make sure we can work with each of those teams to go and actually execute on that. And that's the fastest way to getting from 0 to where you need to be.
[00:42:36] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
Going a bit broader too in terms of data quality and visibility, there's a certain set of concerns that you're focused on at SOTA, but there are also other components to the overall platform that are useful to be able to integrate with and that are necessary for a more holistic view of your data life cycle. And I'm wondering what you see as being some of the necessary components of the data platform for making sure that soda is as valuable as possible? And also just for people who are concerned with the data quality, what are some of the other kind of tangential concerns that they should be thinking about or other systems that they should be aware of to help make sure that these data organization?
[00:44:00] Unknown:
I think there's a broad ecosystem of tools, really, that we're working on in conjunction with. So I think some of them we've covered. Right? So your sources, Snowflake, right, your Spark DataFrames, your orchestration, like Airflow, your transformation, like DBT, then your version control systems, like GitHub, right, that bring auditability to your tests and monitors. And then it gets into let's communicate about incidents. So that's Slack or it could be Opsgenie for, like, those notifications. And then when an incident actually becomes an issue, then we use issue tracking tools like Jira or ServiceNow. And in the end, all of this, you want to disseminate through the organization because what you've just done is extremely valuable. You've operationally ran kind of an issue management process end to end to solve the problem. And currently, we see that that is not being communicated very well in organizations. So then the next thing we do is we make this information available in your BI tools, in your data discovery tools so that everyone can see, wow. This is a data product here. Look at these datasets. We had issues. We immediately resolved them. These are all the people involved. That's fantastic.
How we look at the broader ecosystem, because together, kind of all of these integrations allows teams to discover issues very quickly, communicate about them, prioritize them, and do that from a holistic perspective.
[00:45:25] Unknown:
Yeah. The celebration aspect is definitely useful because there's kind of the issue of the invisible hero of if you're doing your job well, then nobody knows you exist. And so I like the fact that you're able to kind of surface some of this quality information to the business intelligence system where people are looking so that you can see, oh, actually, there is a lot of work going into it. It's not just that everything runs smoothly. It's that that this is a hard problem, and there's a lot of work being done to make sure that it happens. And so I like the fact too that you tie it back to sort of the people involved so that there can be some of that visibility and celebration rather than just, you know, I'm doing my job. Nobody has to know I exist because nothing is broken right now.
[00:46:05] Unknown:
Well, I think the first thing that really struck us there's nothing to do really with the systems that they tied it to, but just merely with the fact that we thought this was only kind of a problem for, let's say, companies of, like, 750,000 employees plus, like, larger teams. But we quickly saw that there's a lot of smaller, younger companies, think SaaS companies, that are doing already so much with data that are looking for solutions in this space. Another thing that we saw a lot was, and even still today, 2021, homegrown solutions.
So many data teams still develop some parts of this entirely homegrown. For us, that was a bit striking because there was also so many similarities with all of these systems that they were building. So what we basically said, look, let's take a step back and let's make sure we build a good kind of framework around this. So I think that was ultimately kind of 2 interesting points. Another 1 that's interesting, I think, is that the first time that you're in a meeting in which you have a weekly meeting of, data subject matter experts and owners, and they go through your software, they go through looking at all of the different issues, they start kind of saying, oh, wow. This is something that next time we should prioritize faster, etcetera.
So that it becomes kind of this collaborative thing that people are looking at together.
[00:47:28] Unknown:
That was really, really nice to see. Digging a bit more too into kind of the prioritization aspect. And you mentioned earlier about being able to understand, okay, this quality issue is actually affecting these other systems. I'm wondering what the integration looks like for being able to pull in things like lineage graphs to understand, okay, this step in this pipeline is actually being used to power this report that's very important versus this is only used for this other report that only gets looked at once a month So I can, you know, put it a little bit further back in my backlog and just some of the ways that you're able to surface some of the criticality
[00:48:03] Unknown:
of a given issue. Lineage is something that we've already integrated with tools that provide lineage, which is really nice to see because then you see your datasets kind of move towards through your kind of overall flow, and you can start to see aggregate scores of quality. That's a very interesting thing to see and helps teams, of course, especially your analytics engineer who's 1 level deeper and needs to really understand all the dependencies. And I think there's gonna be a lot of new innovation and a lot more of also your data kind of orchestration type tooling will have a lot of those lineage capabilities.
So it's actually something that we should add to that overall list of what you integrate within the ecosystem. From a SOTA perspective, what's interesting for us is that as we collect so many metrics across all of your datasets, we don't need to necessarily go and maybe parse out all the query logs and see how things match together. Or we don't necessarily need to have teams kind of design these lineage flows retroactively. We can just look at how these metrics are degrading as data flows through the system. Like a data is being kind of processed to another data set and then to another 1. And And then we see based on the time sequence as well as the metric degradation that something is off. So that's 1 way also to kind of infer data lineage almost, which is a capability that we're thinking about providing.
Today, the easiest way of doing lineage or doing impact assessment is just having a good practice that when you build a data product, set up your SLAs. Because if you have your SLAs, you could immediately see what the impact is of certain issues in datasets.
[00:49:41] Unknown:
And in terms of your experience of building SOTA and working to understand the problem space and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process, both from the technology and the business side? So in the very early days, the challenge was really for us to prioritize across an extremely broad range of feature requests from both data engineers,
[00:50:04] Unknown:
engineering managers, for example, data governance teams, to BI leads, product managers, you name it. And as your thinking evolves, of course, your architecture evolves. So we went through 3 major architectural iterations in almost 3 years. Out of the 3rd iteration came SOTA SQL. SOTA SQL, alongside also the framework to how we're gonna tackle the same problem space for Spark and then for streaming. It's really been great to now see that 3rd iteration. It's incredible to see how big of a change, sometimes minor changes have in that, how big of a change they have on the adoption.
[00:50:44] Unknown:
If I can add, like, an unexpected or interesting lesson, I think there was 1 during our fundraising process recently where I was seriously surprised. The investors explained that they mapped out their investments in the data landscape, and they pointed out the empty box that says data monitoring. So we expected that the normal procedure is for a startup to pitch and then be grilled, but it turned out that most investors started pitching to us instead. So back when we started, I totally didn't expect that to happen.
[00:51:18] Unknown:
Definitely interesting to see some of the different strategies that investors take. And, you know, it varies by investor too. Or like you said, sometimes it might be that they wait for a company to come to them. But a lot of the investors, particularly in the technology and especially in data, they're actually trying to chase after the companies because there are so many that are out there, and it's such a high value space that if they're not very early to the game, then it's easy for them to miss out on a potential investment opportunity or maybe not be the right fit because they don't have as much rapport with the company and the founders early on. The best investors, they do their homework so well. They have a broad network of heads of data, and they start doing interviews. They come up with entire presentations with all the data. It's like this is the key problem. It really depends investor by investor, but for sure this space and to your question earlier, right, there's a lot of things happening in this space.
[00:52:10] Unknown:
It's because many of them have really pinpointed this as, like, this is a problem, then there's really no good solutions for that out there yet today. For people who are looking to improve their capabilities for being able to monitor their data and understand the overall quality
[00:52:27] Unknown:
and be able to add more useful checks and validations. What are the cases where Asoda is the wrong choice?
[00:52:34] Unknown:
We sometimes refrain from using terms like data quality management. And let me explain a little bit. Because on the 1 hand, of course, because it sounds dull, On the other hand, because some would think that we do things like data matching, data cleansing, etcetera. So in the past, data quality was predominantly linked to things like slowly changing dimensional data, like your customer dataset, for example, and to operations like matching and merging records. Instead, we have SOTA focused on data monitoring in a production context. So we very often deal with many other data types, like transactional data.
I do think like in the future, this could be an area that, for example, open source, a library with kinda common data quality cleaning operations. But today, we see that a bit out of our core flow, which really goes from issue discovery to communication and prioritization.
[00:53:29] Unknown:
As you continue to explore this space and build out the product and the business, what do you have planned for the near to medium term that people might wanna keep an eye out for?
[00:53:39] Unknown:
So yeah. Now we just released SODA Cloud and SODA SQL, and that was based on the work we've done with our early customers. Initial traction was beyond our expectations because we already had significant contributions to our open source project in the 1st month. Over the course of this year, we want to double the team while at the same time maintaining our strong culture. So if you're listening and you're equally excited about data monitoring spaces, we are definitely check out the careers page. And from a product perspective, we'll be releasing more native connectivity for Spark and Streaming, and our data science team will be releasing new intelligence features.
[00:54:22] Unknown:
And the platform team will add additional ways to monitor data. Yeah. And beyond this year, I think we'll be expanding our reach by addressing more of the needs of the broader organization. I think you're our head of data and CTO. We've done extensive work with them, and we know they need more of these oversight capabilities as well to help steer programs internally. So beyond this year, that will be our focus.
[00:54:47] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tool and your technology that's available for data management today. Wow. Well, so there's been a lot of in for innovation in the last year, so and when it comes to things like data ingestion, transformation, and discovery.
[00:55:11] Unknown:
And of course, we will be here today if we wouldn't say that, next thing is really monitoring testing and collaboration for data teams. But I think what's the most interesting observation that we've made and I think the biggest challenge for the community as a whole is what you've mentioned earlier on on the podcast. It is all about how these different technologies, how do they work and play together. Right? Because we've seen a sprawl of new tools and data management, and it's kind of similar in terms of fragmentation to the marketing tech landscape because there also we have loads of different tools that in the early days, they didn't really work well together.
This idea, like in marketing tech, what we're very often looking for is our Customer 360. Right? And we have technologies like Segment, for example, to help us do that. That's kind of this hub of data, that customer data that allows you to kind of see and share that across these different systems. I think similarly, the biggest challenge for data and data management is Data 360. If health of data is suffering, how can we make that available everywhere? How can we make people understand that data is really trustworthy or not in the tools that they're using? And this is kind of your plug and play interoperability almost between metadata tools.
And that's not easy to do. So for us, I mean, we'll keep pushing for more interoperability because it's super important to make sure that silent data issues will ultimately become a thing of the past.
[00:56:38] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've all been doing at SOTA. It's definitely a very interesting product, and I like the approach that you're taking with the open source tools to help to expand the overall set of options for people who wanna do things themselves while also giving a way for them to be able to gradually adopt what you're building. So definitely appreciate all of the time and energy you're putting into the data quality space, and I hope you enjoy the rest of your day. Thank you very much, Tobias. Thanks a lot, Tobias. Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Guest Introduction
Tom's Background and Career Journey
Martin's Background and Career Journey
Transition from Software to Data Engineering
Overview of SOTA and Data Quality Management
Core Problems in Data Quality
Capabilities and Approach to Data Quality
Current State of Data Quality Ecosystem
Design Philosophy and User Roles
SOTA's Tools and Platform
Core Architecture and Implementation
Long-term Considerations in Data Quality
Broader Ecosystem and Integration
Lineage and Criticality of Data Issues
Lessons Learned and Challenges
Future Plans and Developments
Biggest Gaps in Data Management Tools
Closing Remarks