Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst

and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Forms and data pipelines.

It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability,

a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/daxter

today to get started, and your first 30 days are free.

Your host is Tobias Macy. And today, I'm interviewing Maayan Salom about how to incorporate observability into a DBT oriented work flow and some of the ways that Elementary can help. So, Maayan, can you start by introducing yourself?

Yeah. Sure. So, happy to be here.

I'm Maayan.

My Starbucks name is Maya. It's, much easier to pronounce. I'm the CEO and cofounder of, Elementary.

Some people know us as Elementary Data. I've been in data roles for

12 years before starting Elementary,

mainly in, cybersecurity

companies.

I actually got into data much earlier,

because I was a

a kid that was obsessed with sports, originally Argentinian, and my dad wanted a boy. And when he didn't get a boy, then he's like, you're gonna watch football with me.

So I was really obsessed with stats and everything. Got all the way to databases

when I reached kind of the limits of Excel.

So that's how I started. And, obviously,

later on in my career, handled more critical,

data pipelines

at more intense environments.

And

data quality was a problem since I was doing stats for my own fun in sports and and later on in, like, much bigger, more complicated tasks as well. So that's what got me started with elementary.

You mentioned already how you first got interested working in data. I'm wondering if you can just give a bit of the sense of what it is about the space that has kept you interested and, why you want to focus your time and energy on that problem space.

So I think, in general,

I have a big passion for data. It's like the kind of the right way to make decisions.

And I think,

everyone who is a data professional probably feels that in many aspects of their lives, not in just in, their professional lives. And

it's something you trust. Right? When there's data, you know you're gonna make the right decisions. And when you can trust it, when you feel like it's lying, when

you see the way stats are used sometimes in media, maybe to,

kind of, create wrong messages, then it really breaks your heart.

So that's a very frustrating part of working intensely with data. When I was doing in my last role before elementary, I was doing cybersecurity incident response.

That's a very, very intense role. There's, like, a big crisis that you're there to solve,

and it's time sensitive. There's a lot of pressure, and you need to be very, very accurate

with everything. There's a lot of consequences.

And just the the amount of time we spent there on validating

and revalidating

and trying to understand if everything is okay was just so frustrating that

felt like something that, I want to focus on and and solve.

And now

digging into the question of observability and in particular for DBT projects,

data observability started coming to the fore in the data space maybe 2 or 3 years ago.

And I'm just wondering if you can talk to some of the elements of observability that are most applicable to people who are using DBT for managing their transformations in a SQL context?

Yeah. Yeah. So,

we we started Elementary

a bit over 2 years ago,

and we

saw the

revolution, let's call it, that, DBT is bringing to how people build,

how they make it so much easier, and they abstract so much of the complexity.

And we felt that

when it comes to observability,

the same simplicity needs to apply, and the same kind of

change needs to apply. And we felt that

there isn't a tool out there that we would use if we would build

a DBT project

that would make observability really easy. And in terms of what does it mean, your needs, when it comes to observability, when you have a DBT project, I think it has

3 aspects. So the first is not unique to DBT. It's like the data itself. You need to validate it. You need to monitor it. You need to understand if there are unexpected changes, if it really

adhere to your expectations.

Then there is the operational part,

which I think

part of what makes working with DBTs is that they let you, like, take all these small steps in your pipeline and do it into, like, 1 big

job that you run. But then in terms of observability,

you really need the details and to focus on, like, each steps,

the performance of each step, the execution time, the results, and also

kind of understand the trends over time,

which is hard to do because the operations in DBT are very stand alone. Like, it it run is on its own. And also there's a lot of metadata. Like, DBT lets you really structure

the way you build, which creates a lot of metadata, and it also facilitates you

creating a lot of metadata around your pipelines.

And that's gold. Like, if you can take that context and really leverage it, then you can really build a

much more

comprehensive plan of how to monitor, how to govern everything,

what's the importance

of each incident. And that's kind of,

the the 3 aspects of it that we really try to help with.

And for people who are using DBT, they're trying to gain some visibility into the overall metrics

of their project. They're trying to understand

what are the things that are going well, how can I improve, what are the reasons for these different failures, what are the anomalies that I have to deal with,

What are some of the ad hoc or DIY approaches that teams are likely to attempt in the process of trying to obtain those insights?

So

many teams,

when they're small, when they're just starting,

they're gonna do things like,

taking the log files of DBT and taking the,

manifest file, like all the different outputs you can use, and try to parse them and kind of load them to wherever it's comfortable for them to work. So either, like, sending it to some log processing tool like Datadog or Splunk or something like that, or taking that and, even uploading to the warehouse because that's where you feel most comfortable, right, with SQL. And then,

maybe work even with your BI tool and create some dashboards

on top of it. We also saw users doing stuff like

breaking down even their dbt project to run each model on a sync like, a different step of the orchestrator to kind of build

for better observability.

So all kind of,

hacks. And some teams have a a really good setup

that is, working for them. The question is really,

how does it hold over time? Right? Like, how much maintenance does it require?

How does it hold with version upgrades, with changes, with more and more needs, and how does it scale?

And for teams who are

scaling their usage of DBT,

a lot of the

work that the DBT product team is focused on is trying to

move them into the cloud environment as a means of getting some of that visibility, some of the ease of use, developer experience enhancements.

And I'm curious what you see as some of the tension for teams who are

evaluating that approach of, do I just go with DBT cloud and they're gonna solve all my problems? Or I really like the fact that I have full control over all of my project because DBT from the CLI is self hosted. I can do whatever I want. I don't have to necessarily worry about

the cost scaling with my usage. I'm just wondering if you could talk to some of the tensions the teams address in that question and maybe some of the ways that some of these self-service approaches to observability

can

mitigate that, potential pain point?

Yeah. So I think,

dbtcloud

has its value.

And I think as you as you said, a lot of it has to do with the user experience and the development experience. And I think

they did a great job with helping

users that,

are maybe less technical and less comfortable with a development environment and haven't worked with code in the past to work with it very easily.

So in terms of scaling, I think it does work for organizations

because they can invite more people to collaborate

on the project. And it's very easy to start. Right? No no setup,

in terms of getting orchestration

infra.

I do think that when it comes to observability,

we still see, like, a lot of the users of Elementary use dbt cloud,

so it doesn't answer their needs. I think

the main reason for that beyond

some gaps they still have and I'm sure they're they're gonna address is that your DBT project, although, like, all of your logic is there and,

there's a lot of, as I said, gold in there in the context of.

Eventually, what really impacts

the health of your data and the performance is a lot of moving parts. So there's the underlying data warehouse, and there's the orchestrator, and there are the sources, and there are the tools,

that pull data from the warehouse. And there are a lot of other,

elements. And as long as dbtcloud,

like, looks only at that single

element

of the pipeline, then you're still gonna miss stuff.

And on the other side of the scale

is these generalized

data observability systems or in some cases, people will lean on their observability stacks to try and get visibility into their overall data platform execution.

And I'm curious, what are some of the

shortcomings in the experience, particularly for DBT projects that teams are battling with and trying to adopt these

either larger scale or more generalized systems for data observability?

Yeah. So in my past, I tried to utilize,

systems like, as you said, application monitoring, like Datadog and Splunk,

to monitor data.

It was hard, I think. Like, it's easier to develop an ad hoc the ad hoc solutions we talked about than making those platforms kind of, work for you when it comes to data

observability. And then when it comes

to data observability tools that are not built

for this workflow,

or, like, what drove us to to

build the way we built is that

I think observability

has a lot to do with habits

and with, like, investing,

in creating like, implementing best practices.

It's like it's not a a pure tech problem. Right? It's

tech and people and processes problem, and you tech only takes you,

so far.

And it's kind of like in a a

with sports that you know it's good for you. You know, you need to work out. But then if you can't find

the settings that is comfortable and and works for you, like, if the gym is not close enough to home or anything like that, then you you're not actually gonna do it.

So we really try to build into

the way you already work into your workflow, into your, development workflow.

So

I think that for other tools in the market, the the barrier of entry for someone who's an analytics engineer is very high. Like you need a lot of setup, you need,

permissions, you probably rely on your DevOps team or your data platform, administrators or something to actually give you access. And then you would need to replicate a lot of the configuration you already invested in building to that tool.

And then you kind of, like, need to make that tool aware of, like, this is my production environment, and these are just dev tables. You should ignore them. And this is, like, how frequently

you should monitor this pipeline, and this is a table that loads incrementally. Like, there's a lot of context that you need to kind of load. And everything is so external to how you work, to your code, to your environments, to your logic. When you develop, you need to, like, go to a different system and remember to do it and

kind of,

scattered all over the place.

Or you say, okay. I have, like, my DBT test. This is what they give me, and I'm gonna stick to it because it's very convenient. I think the adoption of DBT test is very, very wide.

And it speaks to to how easy it is to use them and how incorporated they are to your workflow.

So if you end up with using both DBT tests and an external tool, then you get this mess

of nothing is consolidated and everything is even harder to to kind of monitor in terms of their processes.

Yeah. So

lastly,

another big difference is that the

being part of the pipeline kind of gives you powers.

So you can stop the pipeline

and you can prevent that data from propagating further.

You can only

monitor when your data is loaded.

So it's, like, the most timely monitoring and also the most efficient 1.

So that was another big incentive of, like, trying to really build into the workflow and build into the pipeline.

In terms of that

aspect of embedding into the workflow,

a lot of these

more generalized observability systems will use the data warehouse as their focal point for

identifying activity, figuring out what are the different signals that are going to be useful for determining whether everything is healthy,

particularly if they're trying to do any sort of anomaly detection across the data.

But as you pointed out, that leaves out a whole

chunk of the work that's being done where you only know if there's a problem after you've already pushed it into production.

I'm curious

for people who are building a DBT and for the case where you are able to embed into that development workflow and the CICD workflow,

what are some of the useful signals for being able to raise that early warning to teams to say

this change that you're making is likely to cause these downstream problems

and just some of the types of insights that you're able to generate for people so that they can

reduce that cycle time for being able to identify and address problems?

Yeah. So,

what we see

a lot of our users do is that they work, with Elementary in different environments,

just like they work with dbt. So they have, like, their dbt project, which they run-in dev, which they run-in staging, which they run-in production.

And the fact that Elementary and and our monitors and the testing and everything is so incorporated

to your DBT project means that you also have 3 elementary environments that are equivalent to your DBT environments.

And we we see all kind of deployments. Right? That's also part

of

being part of your code. You can really,

have the same flexibility.

So some of our users

only use

our monitors in staging because they only load data to production after they validate it in staging and see that everything is okay and only then load production. Some others

monitor in production, but they use

dbt build, and they use a lot of the elementary tests as,

tests that actually stop the pipeline.

So if there is a a problem,

like, it only loads to the table where a problem was detected and doesn't propagate further.

Very often, the problems are in the sources. Right? So the the pipeline doesn't even start because the source

has issues.

So this is,

this is kind of how it's used today. We have some plans around it. Like, we want to,

provide

more options

around how you can use Elementary to prevent issues.

But

right now, I think we're still at the the

phase

where working with the different environment is already very valuable. And and I think a lot of teams that incorporated that,

successfully

when they build their dbt project already got a huge benefit,

in reducing the number of incidents they have in production.

And then

for

that earlier in the development cycle problem, there are also another set of tools that have been developed in particular for dbt

of these various linters, pre commit checks, some of the best practices

and sanity checks for the code style and the

structural elements of the DBT project. And I'm curious how that overlaps with these

more generalized observability

and

data quality and developer quality issues that teams are addressing? I think

something very

powerful that happens to users when they start using Elementary heavily

is that they actually start

getting more benefit

from,

implementing best practices. So when I say best practices, it's things like,

assigning owners

to the different models, to the different tests,

using tags, using descriptions,

kind of,

even reducing

the amount of tests that nobody actually addresses and then adding to other tests that,

people actually care about.

So we see a lot of of that impact. And I think the the teams that,

implement elementary

at the highest level

also start enforcing that in their development process. So they start enforcing that you can't add a new model without defining an owner, defining,

like, which channel should alerts go to without defining

what they define as, like, baseline observability.

So it can be volume anomalies and freshness anomalies and, schema monitoring and things that are, like, the absolute baseline for them.

So we actually see

teams leverage the fact that they can enforce that as policies in their CI

to kind of maintain a high standard over time.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow from migration to DBT deployment.

DataFold has recently launched data replication testing, providing ongoing validation for source to target replication.

Leverage DataFold's fast cross database data differing and monitoring to test your replication pipelines automatically and continuously.

Validate consistency between source and target at any scale and receive alerts about any discrepancies.

Learn more about DataFold by visiting data engineering podcast.com

/datafold

today.

And digging into the elementary tool chain and the technology stack, I'm curious if you can talk to some of the design aspects that you were focused on for the

initial development process and some of the core goals that you're focused on as you

build out the

product, build out the open source side of the system, and some of the ways that you're thinking about the specific

challenges and problems that you're addressing first and foremost,

and some of the ways that that has evolved as you build out more capability?

Yeah.

So

our kind of, main design principle was that

we want,

to give our users the ability to use the product

without learning anything new. Right? Well, like, they don't need a learning curve to start using Elementary. So you need to really stick to the tech they already know, and the tools they already know. And you need to make it as easy as possible for them without any barriers, without any, without relying

on anyone else. And that was really challenging. It was 1 of the biggest challenges, in building elementary.

So we started with a dbt package,

because we're like, that's where they live. So we must be part of their development cycle, part of the project.

And I don't know. Did you ever,

try to develop a dbt package or something?

I haven't done my own development of dbt packages. I've looked looked a little bit into them structurally and Yeah. Started to consider using them for purposes of being able to

separate some of the core product,

developers, some of the core business rules around a particular product so that that can live in the code base of the application where that data originates, but I haven't actually gone down that path yet. So I'm curious to hear your experience of building and maintaining dbt packages and some of the, sharp edges that you've run up against.

Yeah. So

I think at first when we heard about the concept of DVD packages, you're like, oh, we can just build a plug in. Right? But a DVD package is actually just a DVD project. So it's like more another project that is kind of, attached to your own project.

And it means that you're limited

to what dbt

was designed for. Right? And dbt wasn't designed

to,

like,

facilitate

plugins. It was designed to facilitate

DBT projects and data modeling and,

simple macros.

So it was really

challenging to do, like, a a complex engineering there. And,

I think we did some of the

probably some of our team, knows the dbt code base better than some of the developers in dbt because they had to understand

so well what are the different possibilities and what's actually exposed to you. We also made,

several contributions to DBT core to enable,

stuff we needed. But I think it was a a very, very good decision because, like, we paid the engineering price in order to build something that is so easy for our users to start with. Like, it's just a 2 minute setup

with the code they already know, the permissions they already have, the the data warehouse they already have access to. Like, everything is there. They can,

get all the outputs very easily, query it in SQL, work with their

BI tool to to analyze it. Like, everything is is super simple for them to start.

And then when we moved on from there to other needs like,

visualization

and alerting

and all that, we also tried to maintain the same principles. So for example,

we have a UI in the open source offering,

but you don't need a server or anything to run it. You don't need like, it's a file, basically,

that you can even,

some of our users send it over on Slack.

Like, they don't even host it anywhere.

So that was

a a a decision to, like, keep things very, very simple and keep our users very independent.

And then, obviously, as you

scale and your needs scale and, like, we got 2 limits of what you can do with it and we wanted to give more advanced solutions,

then we we also built the cloud offering. But we still try to

keep the same principles and keep as much code as possible,

mutual with the open source.

And

1 of the big benefits

with building like this is that our cloud product doesn't require access to your data. So the way it works is that you deploy our DBT package, it writes all the outputs to an elementary schema, and the cloud only requires access to that

schema and to other, like, metadata and query history information schema, stuff like that in your data warehouse.

And this way, we kind of kept the same,

principle of, like, removing as much friction as possible when you're adopting the tool, to actually make it easy to start, to make it easy to to adopt it.

Another interesting aspect of this space right now is that dbt

was 1 of the earliest entrants that helped to define the overall space of analytics engineering.

And as it has grown, it has helped to elevate that

workflow and those capabilities. But now that that success has been gained,

there are a number of other projects that are coming along to try and capitalize on that growth and offer

additional enhancements or better user experience in different aspects. And I'm curious as somebody who is

so deeply integrated into the DBT ecosystem,

how you're thinking about being able to keep your options open of also being able to integrate with some of those other systems as they grow and gain adoption? So picking things like SQL mesh, Malloy,

SDF,

etcetera.

Yeah. So I do think

1 of the powers of, standards, and I think DBT became the de facto standard,

is not only the tool itself or the framework itself, but also the ecosystem around it.

And I do think that today,

you're gonna get so much value

out of other tools in the ecosystem when you adopt dbt,

and it makes it very hard,

to switch to any other solutions. But obviously, as those solutions will get more traction and and get more get adopted more widely, then an ecosystem would be created around them as well.

And

I think that at the end of the day,

the same principles

we applied to DBT can be applied to other tools as well. So it's kind of a similar workflow.

Eventually, at the end of the day,

Elementary runs queries against your datasets,

SQL queries. So the fact that today we construct them with

very complicated dbt macros,

can still be translated to like any other,

hopefully, simpler

coding language than, Jinja.

So I think in in that case, we do try to build generically, and, we are open to adopting

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links