Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atland's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today I'm interviewing Tom Byans about SOTA Data's new DSL for data reliability.

So, Tom, could you start by introducing yourself for folks who didn't listen to your past appearance on the show and give a bit of a reminder about how you first got involved in data? Sure thing,

Tobias. Thanks for the invitation. I'm an enthusiastic,

fan of the podcast, so I'm really glad to be here.

So I'm, Tom Bynes. I'm a cofounder and CTO of SOTA.

I started a long time ago as a software engineer building workflow engines in open

source. And as such, I did a lot of SQL. Then I went all the way to NoSQL, so eventually, I was glad to see everything, coming back to SQL again.

And when I started to focus on data engineering more specifically,

that's when I met my cofounder, Maarten, who was at, Colibra at the time. So his background in data management led us into data quality

as it was a natural fit with my earlier work in open source. So it turned out that since a lot of other companies have also

confirmed that they see this space as the next frontier in data.

So and that's the environment I love. It's where things are not yet settled,

and it's where pushing the boundaries

of what we can do with data.

In terms of the

soda checks language and the DSL that you're building and the library that you're building to be able to take advantage of that definition, I'm wondering if you can just give a bit of an overview about what it is that you're building and some of the story behind why you decided that this was a project that was worth investing in.

Our driver

has always been to build a holistic approach to data quality.

We did not really start with the intention of building a language at first, but it quickly became clear that we needed a language for describing

what good data looks like. And before diving into the language itself,

let's set the scene a bit with some context so that you can see how the language fits into our overall data quality approach.

Most data teams, they know that data issues will occur regularly

in analytical data.

Let me illustrate this with a quick example. Like, imagine you're a bank and there's a legal requirement

to report tax information of customers.

So the bank has built a data application that produces a financial compliance report, and that's running fine for a while.

Now someone in the mobile team,

they in the mobile application team, they do a change in the operational database.

And as a result, a crucial field for the tax information has been lost. So if you miss this in your financial reporting, the company risks legal action from the tax authority.

So why this example?

It's because it shows the importance

of data quality for the business.

Data is not something that's only handled by 1 engineering team.

Data needs to be connected to the business.

On this example, it shows also

the the overall goal of data quality. As data applications are running in production,

how can you ensure that changes to the data do not break those data applications?

How to ensure

that we can trust the analytical data that flow into the data products?

That's the pain a lot of companies are dealing with right now. So when data teams want to take action on data quality,

they need to set up systems

and processes for 3 distinct steps. So first 1 is really all about finding the data issues, being the first to know. Is when data analysts

build reports, they want to express their assumptions

of the data that they consume.

And when engineers build pipelines, they want to express their assumptions

required for the correct operations of those pipelines.

Data stewards, they take responsibility for data in source systems, like for example, a Salesforce.

They wanna get notified if people

don't fill in the forms as agreed.

SodaCL,

the language, is a common foundation for all these use cases

to start finding data issues.

Now in finding data issues,

you have to pay close attention to not generate too many alerts

because it leads to alert

if there are too many alerts

that are not acted upon,

people start ignoring the alerts and the data quality initiative

fails. So a good signal to noise ratio on data quality alerts is a prerequisite.

So the CL plays an important role to make that happen.

So if you wanna be the first to know about data issues,

you need to make sure it's easy for people to express

what good data looks like. And that's the crux to making this first step scale.

Like, after finding the issues, there is the second step of root cause analysis.

For every alert that's raised, it needs to be analyzed.

You'll need to bring together

all kinds of information

that can help find this root cause.

This part of data quality is also known as observability.

Examples of information that can help find the root cause

is like all kinds of data metrics. That's 1. In our case, they mostly come from the check diagnostics.

All issues

found by checks have explainability

built in.

So that includes the metrics, but it can also include record level failed and passed rows, for instance.

Schema changes can help

diagnose.

And apart from the data and the metadata,

it's useful to correlate this with

pipeline code commits, for example, or pipeline execution failures because those will also help you find the root cause.

So in the last and third step, resolving the data issues. So once the root cause has been identified, it needs to be fixed.

And as data issues happen frequently, you need, like, systems and processes

to drive that collaboration between the teams.

More than half of the data issues

originate at the source of inside of the production systems actually. So

so dealing with those data issues

is most often not within the single engineering team, but you have to, like, establish these workflows that are part of the process. So all too often, we see

bad data

being extracted

in spreadsheets and then being sent around in emails. That's not ideal. So

and maybe 1 more thing is

there's a clear analogy here with the

software engineering. That's the world that came out before,

so it's good to see that these principles are now taking on as well.

So in order to get reliable software,

there's 2 main ingredients that are quite common right now, which is test suites and observability.

So if you want to keep releasing software with confidence, you need good automated test suites. And if you want to be able to diagnose your issues in production,

you need observability

tools like a Datadog or a New Relic. And the same is true for data. So data testing brings the issues to the surface,

make sure you are the first to know, and data observability

helps you diagnose

the root cause.

In terms of the scope of concerns that the DSL that you're building and the library that you built around it is intended to address, I'm wondering if you can talk to

which stages of the data life cycle it is designed around

and also

whether it is intended to be something that is stateless, where it will run a check against a particular stage and do validations against that, or if it is intended to be stateful, where it is maybe

aggregating checks across different life cycle stages so that you can understand whether

a kind of sequence of validations are holding true based on the set of transformations that are being executed on a

transformations that are being

executed on a particular record or batch of records?

Let me break it down a bit. So the first part is

is, like, we made sure that, like, both analysts and engineers

can express what good data looks like, so the cl. So that's quite important. So the core

is there the library that can execute the so the cl language.

It's implemented as an open source lightweight library, and that gives the engineers

the flexibility

to embed

data testing straight into their pipelines

and then stop the pipeline if needed. But also the analysts, they find

it easy to read and write these check files.

So with SodaCL,

they don't have to ask the central data team to implement the the checks for them. They can now write their own checks. That's quite crucial to as a background. So data quality has to follow here what

what BI did, business intelligence. Like, who remembers the days that for a new report,

you had to talk to the central data team? That was cumbersome. So

data quality has to become self serve as well to overcome the scalability problem.

And then, sort of CL, as the name says, it's a language to write checks.

And a collection of checks together

form like a data contract or an agreement as we call it. So data contract is what you need as the data team grows bigger.

This is actually where a lot of companies are struggling.

So it's sort of small

and manageable with a handful of pipelines.

Engineers leveraging

a pipeline's

intermediate result, for instance, to build the next data pipeline. And the result is long pipelines with many uncleared dependencies.

And this gets very hard to maintain.

The cause is really like these

uncleared data pipeline dependencies.

This

interconnected

mess of pipeline dependencies is similar to spaghetti code in the old days of software development, so I call them the spaghetti pipelines.

And this is typically what happens when

when the team grows in size and there's like more data products being produced.

And the concept

of data as a product as outlined in the data mesh, that really helps to break this down and scale the data team. So instead of long spaghetti pipelines,

the analytical team is split into domains, and each domain can then be a team.

Between the teams, the focus is on these handovers.

That's the data as a product. Like, when data is being handed over,

that's where focus needs to be to really productize,

these datasets.

And as a producer of data, you should make sure that your product, your data product is discoverable, that it's documented,

and also that it's monitored with a data contract.

So that's in a nutshell as your team grows,

part of that strategy to tackle that complexity.

Yeah. Maybe 1 more thing. Like, so the ACL agreement

is an executable data contract, actually,

that verifies at run time if the data you produce is as you expect.

And so that's why it's crucial during those handover phases.

An agreement

or a data contract can also be used to monitor individual usages.

So as a consumer

of data, you can state, like, we'll be using this data in this in this specific way, and these are the requirements that we have for this data. So that later, if any of these assumptions fail,

then that particular team can be informed

about any data breach or data,

data issue.

In terms of the kind of broader

scope of the ecosystem,

I'm wondering what you looked

to as far as other existing tools and practices

and what was

I don't know if lacking is the right term, but what are the pieces that were absent either in terms of being a cohesive whole or

just entirely missing from the ecosystem that made the Sodacx language and Sodacore

a

a necessary and useful contribution to that space. I'm thinking in particular in terms of things like Great Expectations

and some of the different kind of product focused data quality suites and kind of what the role of SodaCL and SodaCore is given the existing set of technologies that are available.

Yeah. We definitely looked very close at what's out there, and there's, like, 2 things that we wanted to merge. And 1 of them is, as you mentioned, great expectations. It's like a programmatic approach to data testing.

And this, like, for us, the key behind

scaling data quality towards the entire team is to make sure that also analysts and, like, less technical people

can also manage their own checks and their own data quality expectations.

That was the driver for us because we realized, like, analysts, data stewards, and the less technical,

they are often, like, technical enough to do some SQL or to write maybe hack together a Python script, but they're not technical enough to get, like, Python code into production.

That's often, like, a step too far, and that would actually require them to hand over or to communicate to engineering

on what they wanna get implemented. So that's 1 of the things we looked at, and that's where we saw the bottleneck happening is that we really needed to make this self serve for for analysts and and a broader

data team to get the business data quality checks involved there as well. And then the other thing we looked at was the older systems, like data quality existed,

like, a long time ago, which kind of validates that this has been around. But there,

it didn't really scale. And that was part of the same thing. It was also, like, a technical approach

to how these data quality checks were actually realized and implemented

that it wasn't really done as self serve. And so by now, the technology landscape has completely changed, and the amount of data and people involved

also has grown. So that made us set us down down the path of exploring SOTA CL as a language for for all of these use cases.

Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions.

HEVO Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance.

Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines.

You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preloads transformations and auto schema mapping precisely control how data lands in your destination,

models and workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action.

All of this, plus its transparent pricing and 247

live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast.com/hevodata

today and sign up for a free 14 day trial that also comes with 247 support.

As far as the design of that DSL,

what was your approach for identifying

what concerns needed to be

addressed in it, how to structure it,

why you chose YAML as the expression format. It's always the question when somebody adds another YAML DSL

and just the overall process of

thinking through what the available syntax should be, how to constrain it appropriately, but also

allow for enough flexibility and expressivity and the kind of iteration process of going from, this is a problem that we want to solve, to this is how we've decided to solve the problem and structure the language.

Yeah. That's a great question. Thanks.

I'll first dive a bit into the concepts

and then, gradually go into the language design

decisions that we made. So first of all, like, it's essentially, like, a collection of checks that we want to model or that you want to author.

So for every aspect that we want to check, there's a specific check type,

and that was in large part driven by our community because they really say, like, oh, they give the the input of this is the type of check that I wanna run on my data. So every type of check is for us, like, we were looking for the best possible syntax on

how can we express this so that it's also easy to read. That was also very important.

So a major improvement we did was over the previous versions, the initial versions of the language was that

we allowed for the organization of the files to be different so that we now support many to many between the datasets.

So before we had, like, 1 file per dataset,

and now you can group your checks on different datasets in files as you want. And that was crucial for analysts and also the use case of data contracts, where 1 data contract might span, like, multiple datasets.

And another concept that we added into the language was the filters. So filters allow you to express the checks that only have to be executed

a partition of the data.

So that's also helping the

the language there. Also, it's an important

concept that needs to be in there.

And then in terms of

the design of the language itself, we started actually from a blank sheet, of paper.

So we tried various alternative styles

of just like without any constraint. Right? Even before we decided on YAML, but it turned out that what we produced eventually,

if we want to make it compact and readable, it was very close to what our desired,

style was. We said, like, rather than having something that looks like YAML and is not really YAML, that's gonna lead to problems. So that was the background as to deciding that we can better

take YAML as the basis and then layer the language in there as well. That probably helps also a little bit with the tooling around it so that if people have a YAML editor, they can much more easily start working with this.

And then the rest of the design was always centered around readability.

That's front and center for us. So this leads us to group all the checks, for instance, of a certain dataset

dataset partition, actually, together so that you could put them together. That's really helping the readability.

It's kind of grouping them together.

The rest is is just working out the check configuration

details for each check type. And from there, it flowed very natural, the the rest of of the language. So

In terms of the tooling that you mentioned, 1 of the things that's always

nice to have, and some people find it a requirement, is

when they're actually working in the language and they're in their editor environment,

being able to have sort of syntax highlighting,

you know, syntax suggestions and just editor help around how they're actually writing that code. And I'm wondering

what level of investment, if any so far, has been made in being able to define some of that syntax and allow

for

linting and helpers in the overall process of defining

a set of checks that you want to write using this DSL?

There's 2 angles to this, and there's a lot of different people wanting different styles of that kind of support. Right? And we focus on 2 use cases. So first of all, there's the engineer. They want to do it in their editor as part of,

building and coding their pipelines.

And so we made sure there that the files can be just be added into any code repository and that you can read them from there so that

they can use their favorite editor to do this. And there, we didn't focus initially on, like, the building the support of, like, code completion and all that, but we focused on making sure that you can run this and that you get, like, great input as to where is your error and what is exactly going on here. So that is easy to create, like, a a test kind of dry run of the scan so that you get parsing and very clear highlights as to where the problems are and what exactly

that is going on. And then the second part is, like, when we are thinking about more advanced

help towards

editing, that's where we look at our sort of cloud offering.

That's where we want to build an experience

to which is kind of inspired by an IDE

kind of way of working, which is, like, you write and then you try it out. And that's what we want to do on sort of cloud for the analysts and for the less technical users.

That

experience

is mostly built on a round trip towards the actual data so that you can have a full round trip of testing.

So that's quite key to make it self serve for analysts

that they can write some YAML. They can take some snippets. There's some help in that editor as well. But then the very first step, and that's really crucial,

is that we can send it to your data. We can send the check files to your data. We can run them, and analysts get immediate feedback

as to whether the check runs okay and then also the results of the check, if that's as they expected,

before they actually put these checks into production.

As far as the

implementation

of the SodaCore

library, I'm wondering if you can talk to

how that's designed, how it's implemented, and some of

the evolution that you went through as you were defining the DSL and then building the utility that was intended to operate on that syntax and specification?

Thanks. That's a great question, and it gets me super excited because it's often the work that is under the hood that doesn't really get exposed yet, but this was nontrivial and very excited to work on.

So basically,

it's SOTA itself and the sort of core engine that that tries the interpretation of the language.

That's really implemented with performance and compute cost in mind. So first, all the checks are parsed,

and then all the metrics

are computed that are necessary for these checks. And often, it's the same metric that's used in many checks. So we first ensure that these are merged so that you compute every metric only once.

And then, like, we compile then the minimal number of queries

this avoids the pitfall.

In homegrown solutions, for instance, that you have, like, 1 query per check.

This strategy has helped our customers to save, like, a tremendous amount of costs on compute. And this is, of course, like, often under the hood and and and not always realized, but it's,

yeah, that was quite a feat. That's the stuff where I'm proud of that we did this.

In terms of the

extensibility of the soda core project, just the terminology of it being core implies that there are other things that can and will be built around it. And I'm wondering if you can talk to the

interfaces and extension points that you designed into that utility

and some of the ways that you envision it being extended or integrated into other projects.

The core, essentially, is like a Python library,

and this is open source so that you can embed it in all sorts of ways. And that's the engine that, like, explain how it works. Like, it compiles queries, and it runs the runs the checks.

Then, also, part of the open source core project is a CLI,

And then it's easy to embed this into, for instance, a Docker container or these are all ways and so that you can get it into your airflow and into your orchestration tools.

That's all towards the engineers, making it very easy for them to adopt this.

And then the other way to

that we extended this, yeah, in terms of making the difference between core and the rest is that we have, like, a managed version of SodaCore.

That's our agent.

So that the agent runs SodaCore in your organization next to your data, so the data can stay always where it is. And that we then can connect

to Soda Cloud from there to actually get the whole self-service experience,

and that the monitoring, the alerting goes, and the collaboration aspects go through SOTA Cloud. So that's how the relationship between, like, SOTA Core is and, SOTA Cloud.

For the workflow of somebody who is using

the SodaChex language and SodaCore, can you talk to just the

ways that it fits into their workflow and some of the places in their

development cycle that they will be interacting with the Chex language, either as somebody who's actually writing it or even as a consumer who is trying to understand

what are the constraints and validations that are being performed on this data asset that I'm consuming?

Typically, it's like

embedding into an orchestration tool. That's the most common way on how it's embedded.

Then you have, like, each time when your pipeline produces new data,

that's when the checks are executed

typically before or after a DBT run, for instance.

That would be quite difficult. And then as part of the orchestration solutions, like an Airflow or a DAXTER,

that's where you're gonna enter your checks into. This is indeed something that we had to build a language for to enable this.

That's like the split up between, like, who is actually running the checks. There's a data engineer building this into their pipelines,

and then the people authoring their checks. Sometimes it's the same person,

and then it's all self contained, self controlled that an engineer says, I write my checks on my pipeline, and I embed them here in my in my DAG or in my orchestration tool. But sometimes, it's also the case that you have other people of an analyst that are just writing a bunch of check files. They get centralized

in a folder,

for instance. And then the engineer that runs the checks or runs the scan

points to that folder to say, like, take all the files, the check files from this folder and run these checks in this stage in the pipeline.

This is gradually

how we got to see, like, the value of doing it as a self-service

in our cloud. Because that's kind of the flavor that we run to. So if you have data agreements,

for instance, or data contracts that you can manage those in sort of cloud, you could consider those as a bunch of files there, which are actually agreements.

And that these then get executed on a certain schedule or embedded into the pipelines.

As far as the

design of the language, you mentioned that you wanted it to be something that is accessible

beyond just developers.

And so I'm wondering what are some of the ways that you tried to

encapsulate some of the concepts to make it clear where somebody can look at a

file that contains a set of checks

and then

be able to

intuit what the expressed intent is supposed to be without even necessarily having to read through all the specifics or read through the entire file and some of the way that the language is structured to be able to make it scannable, particularly as you move from, you know, a small use case to I have thousands of assets, which means I have, you know, at least dozens of checks files with maybe hundreds of validations and being able to make it so that it is something that is useful without being overwhelming.

So on a check file level, like, the first thing we did, like, making sure that we have kind of grouping of those checks there, the grouping by dataset

by partition,

that's the first thing. And there, we also made sure that, like, the language compact

that that you don't have spread out these checks over, like, a gazillion lines and that you have to, like, scroll endless files so that in 1 screenshot,

you can already see, like, a very

large overview of all the checks involved. So that's also quite important there.

And then the other piece is as you

scale the usage of SodaCL,

you know, as with any

programming language, there need to be some ways to

manage reusability and modularity

and being able to kind of encapsulate concerns. So different languages have different ways to do it, usually with some sort of name spacing.

And so I'm wondering

what your thoughts have been along how to manage projects written in Sodecial

as they scale

in terms of size and expressed complexity

and, you know, some of the ways that you are working with some of the early users of the project to be able to understand

what are some of those kind of seams or points of division or units of composability

that you want to be able to support?

Yeah. So, actually, this is interesting because

initially,

we had the idea of, like, look. We need some kind of reusable construct in there. And then it turned out that 2 things. So first of all,

when this reuse was happening, it was also often, how do you say, working against them because

then someone changed something in the common package, and it wasn't intended in the second use case, for instance. And it wasn't properly

understood that these changes could happen. So 2 things, like, if it's on an engineering level,

then reuse is actually better organized on a file level. So then engineers know, like, I can have 1 file with the central checks. That's the 1 I load always. And then for this particular use case, I have a second file

with customizations and with, like, extra checks on that. So this is engineering level. Right? So engineers can actually just reuse files. And then when you go to the analyst side, that's where we see it more as a bad practice to leverage them because it's like, typically, it's for different usages. So if you have different

teams or different data products using the same data set, you actually want to capture

those specific usages as separate check files so that they can evolve also separately.

And the scans just make sure that all these checks get, merged together so there's no performance penalty to pay there.

Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale.

Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code.

Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it.

Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month.

For more information on Prefect, go to dataengineeringpodcast.com/prefect

today. That's prefect.

As far as the

kind of collaboration aspect of data and the fact that this is designed to be able to be

accessible and usable by the different stakeholders and approachable by people who aren't doing the engineering necessarily,

What are some of the ways that you see

the SOTA Core and SOTA CL utilities

reflecting in the ways that the work is done as it spans the different roles in a data life cycle

and some of the ways that data teams think about how to structure their work and how to

kind of build confidence and trust in the work that they're doing and the information that they're providing to the organization?

First of all, the language, it really helps to

distribute the work. So it creates a much clearer picture of who has to do or what to contribute to that overall picture of what does good data look like. So it's the analysts that can do with self serve, build up their part of the picture, which is usually from the usage perspective.

Then there's the engineers that can contribute their part as, like, what is needed

from the data pipelines operations perspective.

They can add that part

there. And then, like, as issues come out of this, then you also have and this is where the agreement is quite important.

Agreement states are, like like, when some problem comes out of these checks in this agreement, then there's also a workflow associated

to start dealing with this. So that's the kind of overall

workflows and collaborations

that we facilitate and that are based on this language.

Given the fact that this is still a very early project and a young language and something that is

intended to be used by the broader community and that you're obviously hoping to see adoption and growth for.

I'm curious how you're thinking about the overall plan for being able to

manage the kind of growth and evolution of the language and being able to introduce new syntax and how to do that in a way that is understandable by the community and approachable by the community while being

maintainable and sustainable

for the core engineering team that's responsible for it? We already have, like, a couple of iterations

and incorporated already a massive amount of feedback from the communities, like a lively

community for a while now.

So in terms of actual big changes into the language itself,

of course, there will always be extensions and, like, small changes left and right, but that's where we don't anticipate

the next big step. We see the biggest step that we're gonna do next is, like, extending our self serve

and, like, the authoring experience from the cloud,

making sure that that gets extended. So beyond what we have right now, which is really, like, the testability

of your checks in the cloud towards, like, supporting it with, like, check suggestions or, like, snippets or, like, a live templates that help you build checks much quicker. That's the track where we see more of the next steps going forward.

In terms of the

usage that you've seen so far and the applications for ODCL? What are some of the most interesting or innovative or unexpected ways that you've seen it applied?

I didn't expect to see such a demand for monitoring the business metrics at the start. Initially, we were focused on the technical data validations,

but our customers pushed us to include monitoring of business metrics. So for example, send me an alert

to know if the total sales volume per country goes down with more than 3% in a month. That was the biggest surprise for us that that push

came sooner than we thought.

In your experience of working on this language

and helping your team with the building of the utility around it and working with your customers of understanding

what are their use cases for it and how are they actually applying it in the real world? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?

I think the interesting 1 was that the agent is kind of a nontrivial thing, and that was really needed. So we adopted the agent principle same as of the other monitoring technologies, and it was necessary because, of course,

companies don't wanna give access to their data from a remote perspective. So you have to calculate and connect to the data, extract the metrics locally,

and that needs to be scalable, and that needs to be on-site.

And then in order to have the self serve experience, you need to connect that. So that was for us the biggest lesson learned. It's like, okay, this is nontrivial,

and how to set that up was more interesting and more challenging than we initially

expected.

For people who are interested in being able to

manage the definition of what quality checks and what types of reliability information they're expecting that,

let

that, let's say, if you're doing data quality, then, of course, I would argue it's the right choice, but I might be biased there. But maybe if you're looking to set up something like only data lineage, for instance,

that's a number of other tools out there. So this is where we also integrate them with tools, which are more specialized in

that area. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think, like, many parts of the data stack are getting simpler.

If you look at, like, Fivetran or an Airbyte,

they completely remove the need for custom coding to extract and land the data.

That really simplifies things. And then, like, DBT simplifying the transformations,

and we try to add our 2¢ by adding self serve to data quality.

Like, bundling all this

into, like, an integrated

data technology stack is is, I would guess, the biggest challenge ahead.

And it's something that all the data teams now are actually having to do themselves, so I think that would be a great next step. So like many others, I expect to see, like, a consolidation

around a few data platforms. But

given, like, how fragmented

the current market is and how much innovations on each of the different aspects individually are happening,

I think it's gonna be quite a while before all of that gets into a nice packaged data platforms.

Well, thank you very much for taking the time today to join me and share the work that you're doing on the SodaChex

language and SOTA Core utilities. It's definitely a very

useful contribution to the space. It's great to see more investment in tooling to help people

gain confidence in the ways that their data is being used

and being able to build trust for some of the data consumers. So appreciate all of the time and energy

that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias.

It was a blast.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from

the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links