Summary
Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Tom Baeyens about Soda Data’s new DSL for data reliability
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SodaCL is and the story behind it?
- What is the scope of functionality that SodaCL is intended to address?
- What are the ways that reliability is measured for data assets? (what is the equivalent to site uptime?)
- What are the core abstractions that you identified for simplifying the declaration of data validations?
- How did you approach the design of the SodaCL syntax to balance flexibility for various use cases, with structure and opinionated application?
- Why YAML?
- Can you describe how the Soda Core utility is implemented?
- How have the design and scope of the SodaCL dialect and the Soda Core framework evolved since you started working on them?
- What are the available integration/extension points for teams who are using Soda Core?
- Can you describe how SodaCL integrates into the workflow of data and analytics engineers?
- What is your process for evolving the SodaCL dialect in a maintainable and sustainable manner?
- What are the most interesting, innovative, or unexpected ways that you have seen SodaCL used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SodaCL?
- When is SodaCL the wrong choice?
- What do you have planned for the future of SodaCL?
Contact Info
- @tombaeyens on Twitter
- tombaeyens on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atland's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today I'm interviewing Tom Byans about SOTA Data's new DSL for data reliability.
So, Tom, could you start by introducing yourself for folks who didn't listen to your past appearance on the show and give a bit of a reminder about how you first got involved in data? Sure thing,
[00:01:46] Unknown:
Tobias. Thanks for the invitation. I'm an enthusiastic, fan of the podcast, so I'm really glad to be here. So I'm, Tom Bynes. I'm a cofounder and CTO of SOTA. I started a long time ago as a software engineer building workflow engines in open source. And as such, I did a lot of SQL. Then I went all the way to NoSQL, so eventually, I was glad to see everything, coming back to SQL again. And when I started to focus on data engineering more specifically, that's when I met my cofounder, Maarten, who was at, Colibra at the time. So his background in data management led us into data quality as it was a natural fit with my earlier work in open source. So it turned out that since a lot of other companies have also confirmed that they see this space as the next frontier in data.
So and that's the environment I love. It's where things are not yet settled, and it's where pushing the boundaries of what we can do with data.
[00:02:48] Unknown:
In terms of the soda checks language and the DSL that you're building and the library that you're building to be able to take advantage of that definition, I'm wondering if you can just give a bit of an overview about what it is that you're building and some of the story behind why you decided that this was a project that was worth investing in.
[00:03:06] Unknown:
Our driver has always been to build a holistic approach to data quality. We did not really start with the intention of building a language at first, but it quickly became clear that we needed a language for describing what good data looks like. And before diving into the language itself, let's set the scene a bit with some context so that you can see how the language fits into our overall data quality approach. Most data teams, they know that data issues will occur regularly in analytical data. Let me illustrate this with a quick example. Like, imagine you're a bank and there's a legal requirement to report tax information of customers.
So the bank has built a data application that produces a financial compliance report, and that's running fine for a while. Now someone in the mobile team, they in the mobile application team, they do a change in the operational database. And as a result, a crucial field for the tax information has been lost. So if you miss this in your financial reporting, the company risks legal action from the tax authority. So why this example? It's because it shows the importance of data quality for the business. Data is not something that's only handled by 1 engineering team. Data needs to be connected to the business.
On this example, it shows also the the overall goal of data quality. As data applications are running in production, how can you ensure that changes to the data do not break those data applications? How to ensure that we can trust the analytical data that flow into the data products? That's the pain a lot of companies are dealing with right now. So when data teams want to take action on data quality, they need to set up systems and processes for 3 distinct steps. So first 1 is really all about finding the data issues, being the first to know. Is when data analysts build reports, they want to express their assumptions of the data that they consume.
And when engineers build pipelines, they want to express their assumptions required for the correct operations of those pipelines. Data stewards, they take responsibility for data in source systems, like for example, a Salesforce. They wanna get notified if people don't fill in the forms as agreed. SodaCL, the language, is a common foundation for all these use cases to start finding data issues. Now in finding data issues, you have to pay close attention to not generate too many alerts because it leads to alert if there are too many alerts that are not acted upon, people start ignoring the alerts and the data quality initiative fails. So a good signal to noise ratio on data quality alerts is a prerequisite.
So the CL plays an important role to make that happen. So if you wanna be the first to know about data issues, you need to make sure it's easy for people to express what good data looks like. And that's the crux to making this first step scale. Like, after finding the issues, there is the second step of root cause analysis. For every alert that's raised, it needs to be analyzed. You'll need to bring together all kinds of information that can help find this root cause. This part of data quality is also known as observability. Examples of information that can help find the root cause is like all kinds of data metrics. That's 1. In our case, they mostly come from the check diagnostics.
All issues found by checks have explainability built in. So that includes the metrics, but it can also include record level failed and passed rows, for instance. Schema changes can help diagnose. And apart from the data and the metadata, it's useful to correlate this with pipeline code commits, for example, or pipeline execution failures because those will also help you find the root cause. So in the last and third step, resolving the data issues. So once the root cause has been identified, it needs to be fixed. And as data issues happen frequently, you need, like, systems and processes to drive that collaboration between the teams.
More than half of the data issues originate at the source of inside of the production systems actually. So so dealing with those data issues is most often not within the single engineering team, but you have to, like, establish these workflows that are part of the process. So all too often, we see bad data being extracted in spreadsheets and then being sent around in emails. That's not ideal. So and maybe 1 more thing is there's a clear analogy here with the software engineering. That's the world that came out before, so it's good to see that these principles are now taking on as well.
So in order to get reliable software, there's 2 main ingredients that are quite common right now, which is test suites and observability. So if you want to keep releasing software with confidence, you need good automated test suites. And if you want to be able to diagnose your issues in production, you need observability tools like a Datadog or a New Relic. And the same is true for data. So data testing brings the issues to the surface, make sure you are the first to know, and data observability helps you diagnose the root cause.
[00:08:58] Unknown:
In terms of the scope of concerns that the DSL that you're building and the library that you built around it is intended to address, I'm wondering if you can talk to which stages of the data life cycle it is designed around and also whether it is intended to be something that is stateless, where it will run a check against a particular stage and do validations against that, or if it is intended to be stateful, where it is maybe aggregating checks across different life cycle stages so that you can understand whether a kind of sequence of validations are holding true based on the set of transformations that are being executed on a transformations that are being executed on a particular record or batch of records?
[00:09:36] Unknown:
Let me break it down a bit. So the first part is is, like, we made sure that, like, both analysts and engineers can express what good data looks like, so the cl. So that's quite important. So the core is there the library that can execute the so the cl language. It's implemented as an open source lightweight library, and that gives the engineers the flexibility to embed data testing straight into their pipelines and then stop the pipeline if needed. But also the analysts, they find it easy to read and write these check files. So with SodaCL, they don't have to ask the central data team to implement the the checks for them. They can now write their own checks. That's quite crucial to as a background. So data quality has to follow here what what BI did, business intelligence. Like, who remembers the days that for a new report, you had to talk to the central data team? That was cumbersome. So data quality has to become self serve as well to overcome the scalability problem.
And then, sort of CL, as the name says, it's a language to write checks. And a collection of checks together form like a data contract or an agreement as we call it. So data contract is what you need as the data team grows bigger. This is actually where a lot of companies are struggling. So it's sort of small and manageable with a handful of pipelines. Engineers leveraging a pipeline's intermediate result, for instance, to build the next data pipeline. And the result is long pipelines with many uncleared dependencies. And this gets very hard to maintain.
The cause is really like these uncleared data pipeline dependencies. This interconnected mess of pipeline dependencies is similar to spaghetti code in the old days of software development, so I call them the spaghetti pipelines. And this is typically what happens when when the team grows in size and there's like more data products being produced. And the concept of data as a product as outlined in the data mesh, that really helps to break this down and scale the data team. So instead of long spaghetti pipelines, the analytical team is split into domains, and each domain can then be a team.
Between the teams, the focus is on these handovers. That's the data as a product. Like, when data is being handed over, that's where focus needs to be to really productize, these datasets. And as a producer of data, you should make sure that your product, your data product is discoverable, that it's documented, and also that it's monitored with a data contract. So that's in a nutshell as your team grows, part of that strategy to tackle that complexity. Yeah. Maybe 1 more thing. Like, so the ACL agreement is an executable data contract, actually, that verifies at run time if the data you produce is as you expect.
And so that's why it's crucial during those handover phases. An agreement or a data contract can also be used to monitor individual usages. So as a consumer of data, you can state, like, we'll be using this data in this in this specific way, and these are the requirements that we have for this data. So that later, if any of these assumptions fail, then that particular team can be informed about any data breach or data, data issue.
[00:13:20] Unknown:
In terms of the kind of broader scope of the ecosystem, I'm wondering what you looked to as far as other existing tools and practices and what was I don't know if lacking is the right term, but what are the pieces that were absent either in terms of being a cohesive whole or just entirely missing from the ecosystem that made the Sodacx language and Sodacore a a necessary and useful contribution to that space. I'm thinking in particular in terms of things like Great Expectations and some of the different kind of product focused data quality suites and kind of what the role of SodaCL and SodaCore is given the existing set of technologies that are available.
[00:14:08] Unknown:
Yeah. We definitely looked very close at what's out there, and there's, like, 2 things that we wanted to merge. And 1 of them is, as you mentioned, great expectations. It's like a programmatic approach to data testing. And this, like, for us, the key behind scaling data quality towards the entire team is to make sure that also analysts and, like, less technical people can also manage their own checks and their own data quality expectations. That was the driver for us because we realized, like, analysts, data stewards, and the less technical, they are often, like, technical enough to do some SQL or to write maybe hack together a Python script, but they're not technical enough to get, like, Python code into production.
That's often, like, a step too far, and that would actually require them to hand over or to communicate to engineering on what they wanna get implemented. So that's 1 of the things we looked at, and that's where we saw the bottleneck happening is that we really needed to make this self serve for for analysts and and a broader data team to get the business data quality checks involved there as well. And then the other thing we looked at was the older systems, like data quality existed, like, a long time ago, which kind of validates that this has been around. But there, it didn't really scale. And that was part of the same thing. It was also, like, a technical approach to how these data quality checks were actually realized and implemented that it wasn't really done as self serve. And so by now, the technology landscape has completely changed, and the amount of data and people involved also has grown. So that made us set us down down the path of exploring SOTA CL as a language for for all of these use cases.
[00:16:02] Unknown:
Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. HEVO Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines. You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preloads transformations and auto schema mapping precisely control how data lands in your destination, models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action.
All of this, plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast.com/hevodata today and sign up for a free 14 day trial that also comes with 247 support. As far as the design of that DSL, what was your approach for identifying what concerns needed to be addressed in it, how to structure it, why you chose YAML as the expression format. It's always the question when somebody adds another YAML DSL and just the overall process of thinking through what the available syntax should be, how to constrain it appropriately, but also allow for enough flexibility and expressivity and the kind of iteration process of going from, this is a problem that we want to solve, to this is how we've decided to solve the problem and structure the language.
[00:17:53] Unknown:
Yeah. That's a great question. Thanks. I'll first dive a bit into the concepts and then, gradually go into the language design decisions that we made. So first of all, like, it's essentially, like, a collection of checks that we want to model or that you want to author. So for every aspect that we want to check, there's a specific check type, and that was in large part driven by our community because they really say, like, oh, they give the the input of this is the type of check that I wanna run on my data. So every type of check is for us, like, we were looking for the best possible syntax on how can we express this so that it's also easy to read. That was also very important.
So a major improvement we did was over the previous versions, the initial versions of the language was that we allowed for the organization of the files to be different so that we now support many to many between the datasets. So before we had, like, 1 file per dataset, and now you can group your checks on different datasets in files as you want. And that was crucial for analysts and also the use case of data contracts, where 1 data contract might span, like, multiple datasets. And another concept that we added into the language was the filters. So filters allow you to express the checks that only have to be executed a partition of the data.
So that's also helping the the language there. Also, it's an important concept that needs to be in there. And then in terms of the design of the language itself, we started actually from a blank sheet, of paper. So we tried various alternative styles of just like without any constraint. Right? Even before we decided on YAML, but it turned out that what we produced eventually, if we want to make it compact and readable, it was very close to what our desired, style was. We said, like, rather than having something that looks like YAML and is not really YAML, that's gonna lead to problems. So that was the background as to deciding that we can better take YAML as the basis and then layer the language in there as well. That probably helps also a little bit with the tooling around it so that if people have a YAML editor, they can much more easily start working with this.
And then the rest of the design was always centered around readability. That's front and center for us. So this leads us to group all the checks, for instance, of a certain dataset dataset partition, actually, together so that you could put them together. That's really helping the readability. It's kind of grouping them together. The rest is is just working out the check configuration details for each check type. And from there, it flowed very natural, the the rest of of the language. So
[00:20:45] Unknown:
In terms of the tooling that you mentioned, 1 of the things that's always nice to have, and some people find it a requirement, is when they're actually working in the language and they're in their editor environment, being able to have sort of syntax highlighting, you know, syntax suggestions and just editor help around how they're actually writing that code. And I'm wondering what level of investment, if any so far, has been made in being able to define some of that syntax and allow for linting and helpers in the overall process of defining a set of checks that you want to write using this DSL?
[00:21:23] Unknown:
There's 2 angles to this, and there's a lot of different people wanting different styles of that kind of support. Right? And we focus on 2 use cases. So first of all, there's the engineer. They want to do it in their editor as part of, building and coding their pipelines. And so we made sure there that the files can be just be added into any code repository and that you can read them from there so that they can use their favorite editor to do this. And there, we didn't focus initially on, like, the building the support of, like, code completion and all that, but we focused on making sure that you can run this and that you get, like, great input as to where is your error and what is exactly going on here. So that is easy to create, like, a a test kind of dry run of the scan so that you get parsing and very clear highlights as to where the problems are and what exactly that is going on. And then the second part is, like, when we are thinking about more advanced help towards editing, that's where we look at our sort of cloud offering.
That's where we want to build an experience to which is kind of inspired by an IDE kind of way of working, which is, like, you write and then you try it out. And that's what we want to do on sort of cloud for the analysts and for the less technical users. That experience is mostly built on a round trip towards the actual data so that you can have a full round trip of testing. So that's quite key to make it self serve for analysts that they can write some YAML. They can take some snippets. There's some help in that editor as well. But then the very first step, and that's really crucial, is that we can send it to your data. We can send the check files to your data. We can run them, and analysts get immediate feedback as to whether the check runs okay and then also the results of the check, if that's as they expected, before they actually put these checks into production.
[00:23:22] Unknown:
As far as the implementation of the SodaCore library, I'm wondering if you can talk to how that's designed, how it's implemented, and some of the evolution that you went through as you were defining the DSL and then building the utility that was intended to operate on that syntax and specification?
[00:23:43] Unknown:
Thanks. That's a great question, and it gets me super excited because it's often the work that is under the hood that doesn't really get exposed yet, but this was nontrivial and very excited to work on. So basically, it's SOTA itself and the sort of core engine that that tries the interpretation of the language. That's really implemented with performance and compute cost in mind. So first, all the checks are parsed, and then all the metrics are computed that are necessary for these checks. And often, it's the same metric that's used in many checks. So we first ensure that these are merged so that you compute every metric only once. And then, like, we compile then the minimal number of queries this avoids the pitfall.
In homegrown solutions, for instance, that you have, like, 1 query per check. This strategy has helped our customers to save, like, a tremendous amount of costs on compute. And this is, of course, like, often under the hood and and and not always realized, but it's, yeah, that was quite a feat. That's the stuff where I'm proud of that we did this.
[00:24:56] Unknown:
In terms of the extensibility of the soda core project, just the terminology of it being core implies that there are other things that can and will be built around it. And I'm wondering if you can talk to the interfaces and extension points that you designed into that utility and some of the ways that you envision it being extended or integrated into other projects.
[00:25:18] Unknown:
The core, essentially, is like a Python library, and this is open source so that you can embed it in all sorts of ways. And that's the engine that, like, explain how it works. Like, it compiles queries, and it runs the runs the checks. Then, also, part of the open source core project is a CLI, And then it's easy to embed this into, for instance, a Docker container or these are all ways and so that you can get it into your airflow and into your orchestration tools. That's all towards the engineers, making it very easy for them to adopt this. And then the other way to that we extended this, yeah, in terms of making the difference between core and the rest is that we have, like, a managed version of SodaCore.
That's our agent. So that the agent runs SodaCore in your organization next to your data, so the data can stay always where it is. And that we then can connect to Soda Cloud from there to actually get the whole self-service experience, and that the monitoring, the alerting goes, and the collaboration aspects go through SOTA Cloud. So that's how the relationship between, like, SOTA Core is and, SOTA Cloud.
[00:26:29] Unknown:
For the workflow of somebody who is using the SodaChex language and SodaCore, can you talk to just the ways that it fits into their workflow and some of the places in their development cycle that they will be interacting with the Chex language, either as somebody who's actually writing it or even as a consumer who is trying to understand what are the constraints and validations that are being performed on this data asset that I'm consuming?
[00:26:59] Unknown:
Typically, it's like embedding into an orchestration tool. That's the most common way on how it's embedded. Then you have, like, each time when your pipeline produces new data, that's when the checks are executed typically before or after a DBT run, for instance. That would be quite difficult. And then as part of the orchestration solutions, like an Airflow or a DAXTER, that's where you're gonna enter your checks into. This is indeed something that we had to build a language for to enable this. That's like the split up between, like, who is actually running the checks. There's a data engineer building this into their pipelines, and then the people authoring their checks. Sometimes it's the same person, and then it's all self contained, self controlled that an engineer says, I write my checks on my pipeline, and I embed them here in my in my DAG or in my orchestration tool. But sometimes, it's also the case that you have other people of an analyst that are just writing a bunch of check files. They get centralized in a folder, for instance. And then the engineer that runs the checks or runs the scan points to that folder to say, like, take all the files, the check files from this folder and run these checks in this stage in the pipeline.
This is gradually how we got to see, like, the value of doing it as a self-service in our cloud. Because that's kind of the flavor that we run to. So if you have data agreements, for instance, or data contracts that you can manage those in sort of cloud, you could consider those as a bunch of files there, which are actually agreements. And that these then get executed on a certain schedule or embedded into the pipelines.
[00:28:43] Unknown:
As far as the design of the language, you mentioned that you wanted it to be something that is accessible beyond just developers. And so I'm wondering what are some of the ways that you tried to encapsulate some of the concepts to make it clear where somebody can look at a file that contains a set of checks and then be able to intuit what the expressed intent is supposed to be without even necessarily having to read through all the specifics or read through the entire file and some of the way that the language is structured to be able to make it scannable, particularly as you move from, you know, a small use case to I have thousands of assets, which means I have, you know, at least dozens of checks files with maybe hundreds of validations and being able to make it so that it is something that is useful without being overwhelming.
[00:29:35] Unknown:
So on a check file level, like, the first thing we did, like, making sure that we have kind of grouping of those checks there, the grouping by dataset by partition, that's the first thing. And there, we also made sure that, like, the language compact that that you don't have spread out these checks over, like, a gazillion lines and that you have to, like, scroll endless files so that in 1 screenshot, you can already see, like, a very large overview of all the checks involved. So that's also quite important there.
[00:30:07] Unknown:
And then the other piece is as you scale the usage of SodaCL, you know, as with any programming language, there need to be some ways to manage reusability and modularity and being able to kind of encapsulate concerns. So different languages have different ways to do it, usually with some sort of name spacing. And so I'm wondering what your thoughts have been along how to manage projects written in Sodecial as they scale in terms of size and expressed complexity and, you know, some of the ways that you are working with some of the early users of the project to be able to understand what are some of those kind of seams or points of division or units of composability that you want to be able to support?
[00:30:56] Unknown:
Yeah. So, actually, this is interesting because initially, we had the idea of, like, look. We need some kind of reusable construct in there. And then it turned out that 2 things. So first of all, when this reuse was happening, it was also often, how do you say, working against them because then someone changed something in the common package, and it wasn't intended in the second use case, for instance. And it wasn't properly understood that these changes could happen. So 2 things, like, if it's on an engineering level, then reuse is actually better organized on a file level. So then engineers know, like, I can have 1 file with the central checks. That's the 1 I load always. And then for this particular use case, I have a second file with customizations and with, like, extra checks on that. So this is engineering level. Right? So engineers can actually just reuse files. And then when you go to the analyst side, that's where we see it more as a bad practice to leverage them because it's like, typically, it's for different usages. So if you have different teams or different data products using the same data set, you actually want to capture those specific usages as separate check files so that they can evolve also separately.
And the scans just make sure that all these checks get, merged together so there's no performance penalty to pay there.
[00:32:22] Unknown:
Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month.
For more information on Prefect, go to dataengineeringpodcast.com/prefect today. That's prefect. As far as the kind of collaboration aspect of data and the fact that this is designed to be able to be accessible and usable by the different stakeholders and approachable by people who aren't doing the engineering necessarily, What are some of the ways that you see the SOTA Core and SOTA CL utilities reflecting in the ways that the work is done as it spans the different roles in a data life cycle and some of the ways that data teams think about how to structure their work and how to kind of build confidence and trust in the work that they're doing and the information that they're providing to the organization?
[00:33:49] Unknown:
First of all, the language, it really helps to distribute the work. So it creates a much clearer picture of who has to do or what to contribute to that overall picture of what does good data look like. So it's the analysts that can do with self serve, build up their part of the picture, which is usually from the usage perspective. Then there's the engineers that can contribute their part as, like, what is needed from the data pipelines operations perspective. They can add that part there. And then, like, as issues come out of this, then you also have and this is where the agreement is quite important. Agreement states are, like like, when some problem comes out of these checks in this agreement, then there's also a workflow associated to start dealing with this. So that's the kind of overall workflows and collaborations that we facilitate and that are based on this language.
[00:34:47] Unknown:
Given the fact that this is still a very early project and a young language and something that is intended to be used by the broader community and that you're obviously hoping to see adoption and growth for. I'm curious how you're thinking about the overall plan for being able to manage the kind of growth and evolution of the language and being able to introduce new syntax and how to do that in a way that is understandable by the community and approachable by the community while being maintainable and sustainable for the core engineering team that's responsible for it? We already have, like, a couple of iterations
[00:35:26] Unknown:
and incorporated already a massive amount of feedback from the communities, like a lively community for a while now. So in terms of actual big changes into the language itself, of course, there will always be extensions and, like, small changes left and right, but that's where we don't anticipate the next big step. We see the biggest step that we're gonna do next is, like, extending our self serve and, like, the authoring experience from the cloud, making sure that that gets extended. So beyond what we have right now, which is really, like, the testability of your checks in the cloud towards, like, supporting it with, like, check suggestions or, like, snippets or, like, a live templates that help you build checks much quicker. That's the track where we see more of the next steps going forward.
[00:36:16] Unknown:
In terms of the usage that you've seen so far and the applications for ODCL? What are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:36:26] Unknown:
I didn't expect to see such a demand for monitoring the business metrics at the start. Initially, we were focused on the technical data validations, but our customers pushed us to include monitoring of business metrics. So for example, send me an alert to know if the total sales volume per country goes down with more than 3% in a month. That was the biggest surprise for us that that push came sooner than we thought.
[00:36:55] Unknown:
In your experience of working on this language and helping your team with the building of the utility around it and working with your customers of understanding what are their use cases for it and how are they actually applying it in the real world? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:37:14] Unknown:
I think the interesting 1 was that the agent is kind of a nontrivial thing, and that was really needed. So we adopted the agent principle same as of the other monitoring technologies, and it was necessary because, of course, companies don't wanna give access to their data from a remote perspective. So you have to calculate and connect to the data, extract the metrics locally, and that needs to be scalable, and that needs to be on-site. And then in order to have the self serve experience, you need to connect that. So that was for us the biggest lesson learned. It's like, okay, this is nontrivial, and how to set that up was more interesting and more challenging than we initially expected.
[00:37:56] Unknown:
For people who are interested in being able to manage the definition of what quality checks and what types of reliability information they're expecting that, let
[00:38:14] Unknown:
that, let's say, if you're doing data quality, then, of course, I would argue it's the right choice, but I might be biased there. But maybe if you're looking to set up something like only data lineage, for instance, that's a number of other tools out there. So this is where we also integrate them with tools, which are more specialized in
[00:38:33] Unknown:
that area. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:48] Unknown:
I think, like, many parts of the data stack are getting simpler. If you look at, like, Fivetran or an Airbyte, they completely remove the need for custom coding to extract and land the data. That really simplifies things. And then, like, DBT simplifying the transformations, and we try to add our 2¢ by adding self serve to data quality. Like, bundling all this into, like, an integrated data technology stack is is, I would guess, the biggest challenge ahead. And it's something that all the data teams now are actually having to do themselves, so I think that would be a great next step. So like many others, I expect to see, like, a consolidation around a few data platforms. But given, like, how fragmented the current market is and how much innovations on each of the different aspects individually are happening, I think it's gonna be quite a while before all of that gets into a nice packaged data platforms.
[00:39:45] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing on the SodaChex language and SOTA Core utilities. It's definitely a very useful contribution to the space. It's great to see more investment in tooling to help people gain confidence in the ways that their data is being used and being able to build trust for some of the data consumers. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias.
[00:40:14] Unknown:
It was a blast.
[00:40:21] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from
[00:40:47] Unknown:
the show, then tell us about it. Email hosts at data engineering podcast.com
[00:40:48] Unknown:
with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Holistic Approach to Data Quality
Scope and Design of Soda DSL
Community Feedback and Business Metrics
Implementation and Extensibility of SodaCore
Collaboration and Workflow Integration
Future Plans and Language Evolution