Summary
Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
- Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining what elements of observability are most relevant for dbt projects?
- What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights?
- What are the challenges/shortcomings associated with those approaches?
- Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools?
- What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle?
- Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects?
- How is Elementary designed/implemented?
- How have the scope and goals of the project changed since you started working on it?
- What are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary?
- Can you talk us through the setup and workflow for teams adopting Elementary in their dbt projects?
- How does the incorporation of Elementary change the development habits of the teams who are using it?
- What are the most interesting, innovative, or unexpected ways that you have seen Elementary used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary?
- When is Elementary the wrong choice?
- What do you have planned for the future of Elementary?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data.
Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Forms and data pipelines. It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free.
Your host is Tobias Macy. And today, I'm interviewing Maayan Salom about how to incorporate observability into a DBT oriented work flow and some of the ways that Elementary can help. So, Maayan, can you start by introducing yourself?
[00:01:52] Unknown:
Yeah. Sure. So, happy to be here. I'm Maayan. My Starbucks name is Maya. It's, much easier to pronounce. I'm the CEO and cofounder of, Elementary. Some people know us as Elementary Data. I've been in data roles for 12 years before starting Elementary, mainly in, cybersecurity companies. I actually got into data much earlier, because I was a a kid that was obsessed with sports, originally Argentinian, and my dad wanted a boy. And when he didn't get a boy, then he's like, you're gonna watch football with me. So I was really obsessed with stats and everything. Got all the way to databases when I reached kind of the limits of Excel.
So that's how I started. And, obviously, later on in my career, handled more critical, data pipelines at more intense environments. And data quality was a problem since I was doing stats for my own fun in sports and and later on in, like, much bigger, more complicated tasks as well. So that's what got me started with elementary.
[00:02:57] Unknown:
You mentioned already how you first got interested working in data. I'm wondering if you can just give a bit of the sense of what it is about the space that has kept you interested and, why you want to focus your time and energy on that problem space.
[00:03:11] Unknown:
So I think, in general, I have a big passion for data. It's like the kind of the right way to make decisions. And I think, everyone who is a data professional probably feels that in many aspects of their lives, not in just in, their professional lives. And it's something you trust. Right? When there's data, you know you're gonna make the right decisions. And when you can trust it, when you feel like it's lying, when you see the way stats are used sometimes in media, maybe to, kind of, create wrong messages, then it really breaks your heart. So that's a very frustrating part of working intensely with data. When I was doing in my last role before elementary, I was doing cybersecurity incident response.
That's a very, very intense role. There's, like, a big crisis that you're there to solve, and it's time sensitive. There's a lot of pressure, and you need to be very, very accurate with everything. There's a lot of consequences. And just the the amount of time we spent there on validating and revalidating and trying to understand if everything is okay was just so frustrating that felt like something that, I want to focus on and and solve.
[00:04:28] Unknown:
And now digging into the question of observability and in particular for DBT projects, data observability started coming to the fore in the data space maybe 2 or 3 years ago. And I'm just wondering if you can talk to some of the elements of observability that are most applicable to people who are using DBT for managing their transformations in a SQL context?
[00:04:53] Unknown:
Yeah. Yeah. So, we we started Elementary a bit over 2 years ago, and we saw the revolution, let's call it, that, DBT is bringing to how people build, how they make it so much easier, and they abstract so much of the complexity. And we felt that when it comes to observability, the same simplicity needs to apply, and the same kind of change needs to apply. And we felt that there isn't a tool out there that we would use if we would build a DBT project that would make observability really easy. And in terms of what does it mean, your needs, when it comes to observability, when you have a DBT project, I think it has 3 aspects. So the first is not unique to DBT. It's like the data itself. You need to validate it. You need to monitor it. You need to understand if there are unexpected changes, if it really adhere to your expectations.
Then there is the operational part, which I think part of what makes working with DBTs is that they let you, like, take all these small steps in your pipeline and do it into, like, 1 big job that you run. But then in terms of observability, you really need the details and to focus on, like, each steps, the performance of each step, the execution time, the results, and also kind of understand the trends over time, which is hard to do because the operations in DBT are very stand alone. Like, it it run is on its own. And also there's a lot of metadata. Like, DBT lets you really structure the way you build, which creates a lot of metadata, and it also facilitates you creating a lot of metadata around your pipelines.
And that's gold. Like, if you can take that context and really leverage it, then you can really build a much more comprehensive plan of how to monitor, how to govern everything, what's the importance of each incident. And that's kind of, the the 3 aspects of it that we really try to help with.
[00:06:58] Unknown:
And for people who are using DBT, they're trying to gain some visibility into the overall metrics of their project. They're trying to understand what are the things that are going well, how can I improve, what are the reasons for these different failures, what are the anomalies that I have to deal with, What are some of the ad hoc or DIY approaches that teams are likely to attempt in the process of trying to obtain those insights?
[00:07:28] Unknown:
So many teams, when they're small, when they're just starting, they're gonna do things like, taking the log files of DBT and taking the, manifest file, like all the different outputs you can use, and try to parse them and kind of load them to wherever it's comfortable for them to work. So either, like, sending it to some log processing tool like Datadog or Splunk or something like that, or taking that and, even uploading to the warehouse because that's where you feel most comfortable, right, with SQL. And then, maybe work even with your BI tool and create some dashboards on top of it. We also saw users doing stuff like breaking down even their dbt project to run each model on a sync like, a different step of the orchestrator to kind of build for better observability.
So all kind of, hacks. And some teams have a a really good setup that is, working for them. The question is really, how does it hold over time? Right? Like, how much maintenance does it require? How does it hold with version upgrades, with changes, with more and more needs, and how does it scale?
[00:08:42] Unknown:
And for teams who are scaling their usage of DBT, a lot of the work that the DBT product team is focused on is trying to move them into the cloud environment as a means of getting some of that visibility, some of the ease of use, developer experience enhancements. And I'm curious what you see as some of the tension for teams who are evaluating that approach of, do I just go with DBT cloud and they're gonna solve all my problems? Or I really like the fact that I have full control over all of my project because DBT from the CLI is self hosted. I can do whatever I want. I don't have to necessarily worry about the cost scaling with my usage. I'm just wondering if you could talk to some of the tensions the teams address in that question and maybe some of the ways that some of these self-service approaches to observability can mitigate that, potential pain point?
[00:09:40] Unknown:
Yeah. So I think, dbtcloud has its value. And I think as you as you said, a lot of it has to do with the user experience and the development experience. And I think they did a great job with helping users that, are maybe less technical and less comfortable with a development environment and haven't worked with code in the past to work with it very easily. So in terms of scaling, I think it does work for organizations because they can invite more people to collaborate on the project. And it's very easy to start. Right? No no setup, in terms of getting orchestration infra.
I do think that when it comes to observability, we still see, like, a lot of the users of Elementary use dbt cloud, so it doesn't answer their needs. I think the main reason for that beyond some gaps they still have and I'm sure they're they're gonna address is that your DBT project, although, like, all of your logic is there and, there's a lot of, as I said, gold in there in the context of. Eventually, what really impacts the health of your data and the performance is a lot of moving parts. So there's the underlying data warehouse, and there's the orchestrator, and there are the sources, and there are the tools, that pull data from the warehouse. And there are a lot of other, elements. And as long as dbtcloud, like, looks only at that single element of the pipeline, then you're still gonna miss stuff.
[00:11:16] Unknown:
And on the other side of the scale is these generalized data observability systems or in some cases, people will lean on their observability stacks to try and get visibility into their overall data platform execution. And I'm curious, what are some of the shortcomings in the experience, particularly for DBT projects that teams are battling with and trying to adopt these either larger scale or more generalized systems for data observability?
[00:11:47] Unknown:
Yeah. So in my past, I tried to utilize, systems like, as you said, application monitoring, like Datadog and Splunk, to monitor data. It was hard, I think. Like, it's easier to develop an ad hoc the ad hoc solutions we talked about than making those platforms kind of, work for you when it comes to data observability. And then when it comes to data observability tools that are not built for this workflow, or, like, what drove us to to build the way we built is that I think observability has a lot to do with habits and with, like, investing, in creating like, implementing best practices.
It's like it's not a a pure tech problem. Right? It's tech and people and processes problem, and you tech only takes you, so far. And it's kind of like in a a with sports that you know it's good for you. You know, you need to work out. But then if you can't find the settings that is comfortable and and works for you, like, if the gym is not close enough to home or anything like that, then you you're not actually gonna do it. So we really try to build into the way you already work into your workflow, into your, development workflow. So I think that for other tools in the market, the the barrier of entry for someone who's an analytics engineer is very high. Like you need a lot of setup, you need, permissions, you probably rely on your DevOps team or your data platform, administrators or something to actually give you access. And then you would need to replicate a lot of the configuration you already invested in building to that tool.
And then you kind of, like, need to make that tool aware of, like, this is my production environment, and these are just dev tables. You should ignore them. And this is, like, how frequently you should monitor this pipeline, and this is a table that loads incrementally. Like, there's a lot of context that you need to kind of load. And everything is so external to how you work, to your code, to your environments, to your logic. When you develop, you need to, like, go to a different system and remember to do it and kind of, scattered all over the place.
Or you say, okay. I have, like, my DBT test. This is what they give me, and I'm gonna stick to it because it's very convenient. I think the adoption of DBT test is very, very wide. And it speaks to to how easy it is to use them and how incorporated they are to your workflow. So if you end up with using both DBT tests and an external tool, then you get this mess of nothing is consolidated and everything is even harder to to kind of monitor in terms of their processes. Yeah. So lastly, another big difference is that the being part of the pipeline kind of gives you powers.
So you can stop the pipeline and you can prevent that data from propagating further. You can only monitor when your data is loaded. So it's, like, the most timely monitoring and also the most efficient 1. So that was another big incentive of, like, trying to really build into the workflow and build into the pipeline.
[00:15:06] Unknown:
In terms of that aspect of embedding into the workflow, a lot of these more generalized observability systems will use the data warehouse as their focal point for identifying activity, figuring out what are the different signals that are going to be useful for determining whether everything is healthy, particularly if they're trying to do any sort of anomaly detection across the data. But as you pointed out, that leaves out a whole chunk of the work that's being done where you only know if there's a problem after you've already pushed it into production. I'm curious for people who are building a DBT and for the case where you are able to embed into that development workflow and the CICD workflow, what are some of the useful signals for being able to raise that early warning to teams to say this change that you're making is likely to cause these downstream problems and just some of the types of insights that you're able to generate for people so that they can reduce that cycle time for being able to identify and address problems?
[00:16:10] Unknown:
Yeah. So, what we see a lot of our users do is that they work, with Elementary in different environments, just like they work with dbt. So they have, like, their dbt project, which they run-in dev, which they run-in staging, which they run-in production. And the fact that Elementary and and our monitors and the testing and everything is so incorporated to your DBT project means that you also have 3 elementary environments that are equivalent to your DBT environments. And we we see all kind of deployments. Right? That's also part of being part of your code. You can really, have the same flexibility.
So some of our users only use our monitors in staging because they only load data to production after they validate it in staging and see that everything is okay and only then load production. Some others monitor in production, but they use dbt build, and they use a lot of the elementary tests as, tests that actually stop the pipeline. So if there is a a problem, like, it only loads to the table where a problem was detected and doesn't propagate further. Very often, the problems are in the sources. Right? So the the pipeline doesn't even start because the source has issues.
So this is, this is kind of how it's used today. We have some plans around it. Like, we want to, provide more options around how you can use Elementary to prevent issues. But right now, I think we're still at the the phase where working with the different environment is already very valuable. And and I think a lot of teams that incorporated that, successfully when they build their dbt project already got a huge benefit, in reducing the number of incidents they have in production.
[00:18:03] Unknown:
And then for that earlier in the development cycle problem, there are also another set of tools that have been developed in particular for dbt of these various linters, pre commit checks, some of the best practices and sanity checks for the code style and the structural elements of the DBT project. And I'm curious how that overlaps with these more generalized observability and data quality and developer quality issues that teams are addressing? I think
[00:18:37] Unknown:
something very powerful that happens to users when they start using Elementary heavily is that they actually start getting more benefit from, implementing best practices. So when I say best practices, it's things like, assigning owners to the different models, to the different tests, using tags, using descriptions, kind of, even reducing the amount of tests that nobody actually addresses and then adding to other tests that, people actually care about. So we see a lot of of that impact. And I think the the teams that, implement elementary at the highest level also start enforcing that in their development process. So they start enforcing that you can't add a new model without defining an owner, defining, like, which channel should alerts go to without defining what they define as, like, baseline observability.
So it can be volume anomalies and freshness anomalies and, schema monitoring and things that are, like, the absolute baseline for them. So we actually see teams leverage the fact that they can enforce that as policies in their CI to kind of maintain a high standard over time.
[00:19:56] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow from migration to DBT deployment. DataFold has recently launched data replication testing, providing ongoing validation for source to target replication. Leverage DataFold's fast cross database data differing and monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale and receive alerts about any discrepancies. Learn more about DataFold by visiting data engineering podcast.com /datafold today.
And digging into the elementary tool chain and the technology stack, I'm curious if you can talk to some of the design aspects that you were focused on for the initial development process and some of the core goals that you're focused on as you build out the product, build out the open source side of the system, and some of the ways that you're thinking about the specific challenges and problems that you're addressing first and foremost, and some of the ways that that has evolved as you build out more capability?
[00:21:07] Unknown:
Yeah. So our kind of, main design principle was that we want, to give our users the ability to use the product without learning anything new. Right? Well, like, they don't need a learning curve to start using Elementary. So you need to really stick to the tech they already know, and the tools they already know. And you need to make it as easy as possible for them without any barriers, without any, without relying on anyone else. And that was really challenging. It was 1 of the biggest challenges, in building elementary. So we started with a dbt package, because we're like, that's where they live. So we must be part of their development cycle, part of the project.
And I don't know. Did you ever, try to develop a dbt package or something?
[00:21:59] Unknown:
I haven't done my own development of dbt packages. I've looked looked a little bit into them structurally and Yeah. Started to consider using them for purposes of being able to separate some of the core product, developers, some of the core business rules around a particular product so that that can live in the code base of the application where that data originates, but I haven't actually gone down that path yet. So I'm curious to hear your experience of building and maintaining dbt packages and some of the, sharp edges that you've run up against.
[00:22:32] Unknown:
Yeah. So I think at first when we heard about the concept of DVD packages, you're like, oh, we can just build a plug in. Right? But a DVD package is actually just a DVD project. So it's like more another project that is kind of, attached to your own project. And it means that you're limited to what dbt was designed for. Right? And dbt wasn't designed to, like, facilitate plugins. It was designed to facilitate DBT projects and data modeling and, simple macros. So it was really challenging to do, like, a a complex engineering there. And, I think we did some of the probably some of our team, knows the dbt code base better than some of the developers in dbt because they had to understand so well what are the different possibilities and what's actually exposed to you. We also made, several contributions to DBT core to enable, stuff we needed. But I think it was a a very, very good decision because, like, we paid the engineering price in order to build something that is so easy for our users to start with. Like, it's just a 2 minute setup with the code they already know, the permissions they already have, the the data warehouse they already have access to. Like, everything is there. They can, get all the outputs very easily, query it in SQL, work with their BI tool to to analyze it. Like, everything is is super simple for them to start.
And then when we moved on from there to other needs like, visualization and alerting and all that, we also tried to maintain the same principles. So for example, we have a UI in the open source offering, but you don't need a server or anything to run it. You don't need like, it's a file, basically, that you can even, some of our users send it over on Slack. Like, they don't even host it anywhere. So that was a a a decision to, like, keep things very, very simple and keep our users very independent. And then, obviously, as you scale and your needs scale and, like, we got 2 limits of what you can do with it and we wanted to give more advanced solutions, then we we also built the cloud offering. But we still try to keep the same principles and keep as much code as possible, mutual with the open source.
And 1 of the big benefits with building like this is that our cloud product doesn't require access to your data. So the way it works is that you deploy our DBT package, it writes all the outputs to an elementary schema, and the cloud only requires access to that schema and to other, like, metadata and query history information schema, stuff like that in your data warehouse. And this way, we kind of kept the same, principle of, like, removing as much friction as possible when you're adopting the tool, to actually make it easy to start, to make it easy to to adopt it.
[00:25:36] Unknown:
Another interesting aspect of this space right now is that dbt was 1 of the earliest entrants that helped to define the overall space of analytics engineering. And as it has grown, it has helped to elevate that workflow and those capabilities. But now that that success has been gained, there are a number of other projects that are coming along to try and capitalize on that growth and offer additional enhancements or better user experience in different aspects. And I'm curious as somebody who is so deeply integrated into the DBT ecosystem, how you're thinking about being able to keep your options open of also being able to integrate with some of those other systems as they grow and gain adoption? So picking things like SQL mesh, Malloy, SDF, etcetera.
[00:26:27] Unknown:
Yeah. So I do think 1 of the powers of, standards, and I think DBT became the de facto standard, is not only the tool itself or the framework itself, but also the ecosystem around it. And I do think that today, you're gonna get so much value out of other tools in the ecosystem when you adopt dbt, and it makes it very hard, to switch to any other solutions. But obviously, as those solutions will get more traction and and get more get adopted more widely, then an ecosystem would be created around them as well. And I think that at the end of the day, the same principles we applied to DBT can be applied to other tools as well. So it's kind of a similar workflow.
Eventually, at the end of the day, Elementary runs queries against your datasets, SQL queries. So the fact that today we construct them with very complicated dbt macros, can still be translated to like any other, hopefully, simpler coding language than, Jinja. So I think in in that case, we do try to build generically, and, we are open to adopting other solutions, but not not something I see in the near future, at least. Like, we we like the fact that we're focused, and we still have a large user base to to serve, being focused on DBT.
[00:28:00] Unknown:
And so for teams who are interested in adopting Elementary for their workflow, I'm curious if you can just talk to the overall process of setting it up, getting it integrated, and starting to adopt the various capabilities as part of the development cycle?
[00:28:18] Unknown:
Yeah. So, the question of, I I started building a dbt project or I have a dbt project. Like, when should I start using Elementary is, yesterday. So right when you when you start, at least with the dbt package, and you can really think of it as a a gradual approach. So you can start with a dbt package. It's it's gonna take you 2 minutes. It has, like, a 0 friction, 0 cost, 0 setup, and you're gonna start getting value. You're gonna start seeing the outputs, of what elementary produces. It's gonna give you visibility, that you didn't have before, and it's gonna give you the ability to, do anomaly detection and, like, advanced tests that are not offered in the ecosystem, in the, like, YDVT tests ecosystem.
And then from there, your needs are gonna start growing. So you're gonna start saying, oh, I I wish I could get, alerts around those stuff. I wish I could, like, route these alerts to different people and tag them and, like, leverage all these metadata. And I wish I could, like, see all these results on a a lineage graph and go down to the column level and see the impact on my, dashboards and see like, there's a lot of, room to to use the capabilities that, like, help you, reduce the the time to resolution when you have an issue or avoid doing breaking changes or, like really taking a more proactive approach to data issues. And and that's where you should consider 1 of our other offerings, like the cloud offering or using the the CLI tool.
The way we when we do POCs, like, with with users, who who start adopting the cloud product, we structure it in, like, a 3 phases manner. So first, we're trying to get them to, like, this baseline of observability that is, like, let's make sure that we prevent all the super embarrassing stuff. Right? Like, those don't go undetected anymore. So let's get you to this basic coverage of, like, freshness and volume and schema and uniqueness and numbness. Like, let's get you to that level, and let's talk about the most, embarrassing incidents you had and and see that they're covered.
Then we're doing this enhancement phase where we're like, okay. Let's focus on your critical models, what can go wrong with each, and and try to build a plan for that. And then lastly, I think the the advanced part is getting to the process and the enforcement and how do you maintain that over time, and how do you incorporate that into your dev process, and how do you enforce a governance, policy that that would like, it it's not enough to have, like, this onboarding with Elementary, which is really cool. You get a lot of like, make a lot of progress in 2 weeks, but then a year later, your project is different and you lost everything. Right? You need to, find a way to easily maintain that over time. So that this is, like, the 3 phases approach. And I think a lot of our open source users are trying to incorporate kind of the same phases on their own.
[00:31:32] Unknown:
And once somebody is using Elementary, they're leaning on the insights that it's able to provide and incorporate that into their development workflow and their team review process. I'm curious how you've seen that impact the overall approach to development, some of the ways that it shifts the thinking, some of the planning, and just the overall experience of working on a DBT project in ways that it causes teams to either accelerate their delivery pace or change the way that they design their systems, etcetera?
[00:32:04] Unknown:
So I think something we see is that, they become a lot more intentional, when they're testing. So when they're building, when they're making changes, when they're building a new pipeline, building a new model, they're already thinking about, okay, who should be alerted? On what? What what are the things that, if they happen, then this pipeline shouldn't be relied on anymore, and we need to, like, tell people, okay. Don't don't trust it anymore. We'll fix that. Like, what are the things we should validate? And I think going back to my incident response background, I used to give a talk saying that your incident response doesn't start when you have an incident.
It starts today. Right? You need to like, the incident will happen, and you need to kind of, think about what how can you be prepared and, like, how can you proactively make the impact of the incidents lower? Because incidents would happen. Right? Data breaks just like, that's just gonna keep happening. Or, like, the only way you can have, data pipelines that don't break is if you don't make any changes, if you don't work with any complex datasets and, like, if you probably you don't bring any value, if if that's, where you're at. So the more complexity, the more value you bring, then more issues are gonna happen. And and the thing is how how you can build your system to proactively think about it and, like, be prepared. And the more preparation and the more part of the development cycle it would be, and you're gonna adopt those best practices, then you're just gonna have less spend less time on each issue.
And, your stakeholders are gonna trust that if there's an issue, like, they're gonna accept that issues happen, but you're gonna be the 1 telling them and not they're gonna be the 1 doubting you and trying to understand if if stuff went wrong, and you didn't notify them or anything like that. So we really see that comes into play with users who adopt Elementary, like, heavily.
[00:34:08] Unknown:
Another way that some of these types of tools, in particular, the pre commit style checks, but also just the tools that bring additional rigor to the process, it can if you're not careful in terms of how you implement it and roll it out to the team, it can actually cause you to either stall out in terms of the velocity that you're able to build up, or it can cause the team to discard the tool wholesale because they don't want to deal with the pain of adapting to the practices that it's trying to encourage. And I'm curious how you are approaching that side of the problem as well of making sure that the the the overall burden of extra work doesn't cause teams to try out elementary, say, this is going to add too much work to my plate, so I'm just going to get rid of it and not bother and just ignore the fact of all these issues that it's trying to highlight.
[00:35:06] Unknown:
Yeah. So I think 1 thing is that, until now, we kind of, had the privilege, that users who come to elementary are already kind of, paying a price for not investing, enough in their observability. And it it's kind of like that time is already spent. Right? So our job is to kind of convince them that if they're gonna invest it differently, then over time, they're gonna reduce it, significantly. And also, like, you kind of prefer to invest in in the, positive steps and not in the negative steps. Right? Like, it's better to not have fires and, like, invest in building, I don't know, buildings that don't burn than dealing with fires and kind of trying to put water on them as early as possible all the time. So I I do think that users have more awareness today, to the return on investment of spending time with those best practices.
We don't enforce ourselves. We like work with our users and help them plan what they should enforce and what's working for them and what's not. Actually, something we're working on now is to give them the visibility to they already have visibility in elementary to see kind of which tests fail often and what are the fail rates and success rates of of their monitors. But then right now, we're trying to help them also monitor, do people address the problems. And our recommendation is general. It's like, if no 1 will address, this test, if it fails, then you shouldn't test for it. Right? Because nobody cares.
So we help with both. We also like, we don't tell them add, like, thousands of test tests for the sake of testing. Right? We try to help them make conscious decisions and, like, create a coverage that works for them, and they don't work for the coverage. Right? It serves their goals.
[00:37:12] Unknown:
As you have been investing in this space of observability and developer experience improvement and data quality for people who are investing in this dbt ecosystem and using that as their de facto approach for managing transformations. What are some of the most interesting or innovative or unexpected ways that you've seen the elementary tool chain used?
[00:37:38] Unknown:
That's, that's an interesting question. We saw a lot of, like, super creative stuff, that people do. I think something very cool about Elementary is that it saves all of the outputs to your warehouse, to that Elementary schema I spoke of. And then it's accessible to our users. And we saw a lot of use cases that our users solve with it. So, we saw teams that use it to do, like, an automated data warehouse cleanup workflow to kind of, maintain everything clean and and reduce cost. And we saw, it's being used for, cost analysis to, like, understand exactly which, like each pipeline, or each business domain, how much does it cost or to do stuff around change management.
So we saw a lot of, like, ad hoc use cases, that users use telemetry to solve. An interesting use case was, migrations, where we saw users when they were migrating between data warehouses with the same dbt project, then they used to, like, run the exact same tests and also monitor the pipeline itself and then, compare the results they got, in elementary from 2 different data warehouses to kind of validate the the migration. And we also saw users do things that we didn't expect in data quality, like, monitoring, trends. So they they have patterns of, like, configuring tests to warning and ignoring the results and not sending any alert on them, but then they create an alert if a certain test, warns over 3 times the same week or twice the same day or, like, kind of, creating this, leveled approach to how they test the data.
[00:39:34] Unknown:
And in your experience of investing in this ecosystem, putting in the engineering time and effort to build this suite of capabilities and working with end users. I'm curious what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:51] Unknown:
So, being a startup founder in general is a very, humbling experience, and, building a product is a very humbling experience. I think the biggest lesson is that you need to be very, very, very attentive to the users, and you need to keep experimenting and you need to always listen because it's shocking to realize how little you can predict what will actually make an impact and and what users will actually react to. So you think you know and, like, you you think you're already an expert in the space and you've been through a lot, but you keep having surprises, both positive ones and negative ones.
So I think every time, like, we lose sight of that and and do things without, getting enough feedback and and experimenting fast and failing fast and, getting feedback fast, then it's really like, it's always a mistake. So that's something we keep doing. I can say even when we started elementary, we were very, very focused on the anomaly detection part, and the data observability part. And then we actually created a lot of the metadata tables and all that collection just so we could kind of route the alerts or add that information to the alerts. And then we saw that most of the users actually incorporate elementary for that and then discover that we have anomaly detection and adopt that only later. So that's, like, just an example of a super positive surprise that we we had no way of predicting in a way. Like, that became a super big part of the of the product without us planning it.
[00:41:36] Unknown:
And for teams who are building their DBT projects and they're trying to improve their overall productivity and uptime and capabilities? What are the cases where Elementary is the wrong choice?
[00:41:52] Unknown:
So, obviously, if you don't work with DBT and you have, like, a all of your critical pipelines are non DBT, then it's probably the wrong choice for you. Also, I think we did meet some teams out there that, did incorporate dbt and work with it, but it's challenging for them. Like, the the coding part, the deployment part, like, the development process, it's not empowering. It's more of a struggle. So elementary isn't looking to abstract that. Like, we're looking to be part of that, so we're really a much better fit for teams that feel empowered by it and and can leverage the the span of possibilities that it opens up. Yeah. And also if you don't have any data issues and or you don't care about them, then, like, if you have, like, those, unicorn datasets that everything is always okay with, then you're lucky, and you don't need us.
[00:42:47] Unknown:
And as you continue to build and iterate on the technology and the product, what are some of the things you have planned for the near to medium term or any, projects or problem areas you're excited to explore?
[00:42:59] Unknown:
Yeah. That's always a big dilemma in a startup. Right? Because, things change so rapidly. So we always we're very open with our users that we only have a road map that goes, 2 corners, max. But then it's also an opportunity because they have a lot of impact on what we build, and the feedback from them is is super valuable. A big dilemma we faced, and I think we will probably keep keep facing it, as we grow is should we go wide or should we go deep? Right? Like the question even you asked me before about platforms that aren't dbt, other frameworks out there. You asked about teams that elementary is the wrong choice for them, so teams that are not using DBT heavily.
So in terms of the problems we solve, the users we serve, the stack we support, should we go wider or should we go deeper? And our lesson so far was that we should really we're at our best when we're very focused and we when we go deep. So that's still our plan. And despite the progress that teams that incorporate elementary face, like experience, they still have a lot of challenges around data observability that we want to solve. At the moment, we're focusing on, on 3 areas. So 1 is we're really trying to learn how our users decide what to monitor. And we look at the testing they add, and we ask them, and we try to understand the decision making process so we can make it easier for them, moving forward. And we already automated some of it, but we still have a lot of aspirations there. We also see that they really struggle around communication of data health and data issues. So kind of the people processes part of the problem, we can still make a lot of, progress there and help them with that. And then we keep, kind of, trying to measure what's the time to resolution when they do have incidents, and we're trying to make a positive impact there. And we also have a lot of ideas and areas that we're exploring on that area. But I can promise to users of Elementary that we're gonna keep, making data observability easier for you, and we're gonna keep refusing your requests for us to solve, other, issues you face. Although we wanna solve them, but we're not there yet.
[00:45:22] Unknown:
And are there any other aspects of the overall space of data observability for DBT projects, the work that you're doing at elementary, or some of the ways that you see this overall challenge of data quality, data observability evolving as the ecosystem grows and matures that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:44] Unknown:
Yeah. I think, this whole ecosystem is still, growing, and I think there was a a phase of doing more and more. And now people are trying to, like, consolidate and doing less and being more focused on the valuable things. I think that with observability, we need to be able to support that process and and do the same. So help them with priorities and understanding what's actually critical and reducing the noise and helping them monitor what's actually important. And I also think that still the big problem in data modeling, data observability, in in analytics maybe, is the depth of the business context that people have, and that's just, not something we can, ever automate probably.
Like sometimes we see users add tests and we have no idea why they decided to add them or why they decided to model their data in a certain way. And then we ask them and it becomes super clear. But we still need that context. We still need to ask them. So we we won't have like the, I don't know, the magical AI, bot that could, like, replace that context. But I think the big progress we can make is, how we can take create the interface that they can kind of feed that context into and get as easy as possible, the coverage that they need and the coverage that works for them and the coverage that really supports their goals. So that's, that's, I think, an area to to make a big progress at. And and I think other domains in data, if they'll be able to, like, create better interfaces for users to input context and get out, an easier workflow, then that's definitely gonna create progress.
And maybe maybe some someday someone will figure out the time zone differences issue that creates so many data quality problems. But I think that's just, too far ahead. Like, we're not there yet in terms of
[00:47:54] Unknown:
technology. Everybody just needs to use UTC all the time.
[00:47:58] Unknown:
Yeah. Yeah. Not gonna happen, I think. I'm afraid.
[00:48:02] Unknown:
Unfortunately not. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:19] Unknown:
Yeah. So I think going back to that context question, like, how can we make it easy for people to share why they made the decision they made in data modeling, why they made the decision they made in data observability, why they made the decision they made in documenting or not documenting stuff, like, if things would make more sense to the new members on your team and to your stakeholders and to everyone you collaborate with, then, and even to the vendors you work with. Right? Like, if we'll have more context from our users about what what drove their decisions, then we could give them better advice and better outcomes.
And that's still something that's I don't think anyone figured out. Like, how can we communicate better around kind of the the decisions and the design patterns that we applied and why we did it.
[00:49:17] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your work on elementary and share your experience and perspective on the overall space of data observability for DBT projects. Definitely a very, interesting and complex problem area, so I appreciate the time and energy that you and your team are putting into helping to solve for that. And I hope enjoy the rest of your day.
[00:49:41] Unknown:
Yeah. Thank you for having me. And, also, I hope, our listeners enjoy, and I do want to point out that English is my third language. So I hope, people would, forgive my, my mistakes and, and enjoy listening.
[00:50:03] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Maayan's Journey into Data
Challenges in Data Quality and Observability
DIY Approaches to Data Observability
DBT Cloud vs. Self-Hosted DBT
Shortcomings of Generalized Observability Systems
Embedding Observability into Development Workflow
Best Practices and Sanity Checks in DBT Projects
Elementary's Design Principles and Technology Stack
Adopting Elementary in Your Workflow
Impact of Elementary on Development Workflow
Interesting Use Cases of Elementary
Lessons Learned in Building Elementary
When Elementary is the Wrong Choice
Future Plans for Elementary
Evolving Challenges in Data Observability
Final Thoughts and Contact Information