Summary
As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Barr Moses and Anna Filippova about how and whether to measure the ROI of your data team
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the typical motivations for measuring and tracking the ROI for a data team?
- Who is responsible for collecting that information?
- How is that information used and by whom?
- What are some of the downsides/risks of tracking this metric? (law of unintended consequences)
- What are the inputs to the number that constitutes the "investment"? infrastructure, payroll of employees on team, time spent working with other teams?
- What are the aspects of data work and its impact on the business that complicate a calculation of the "return" that is generated?
- How should teams think about measuring data team ROI?
- What are some concrete ROI metrics data teams can use?
- What level of detail is useful? What dimensions should be used for segmenting the calculations?
- How can visibility into this ROI metric be best used to inform the priorities and project scopes of the team?
- With so many tools in the modern data stack today, what is the role of technology in helping drive or measure this impact?
- How do your respective solutions, Monte Carlo and dbt, help teams measure and scale data value?
- With generative AI on the upswing of the hype cycle, what are the impacts that you see it having on data teams?
- What are the unrealistic expectations that it will produce?
- How can it speed up time to delivery?
- What are the most interesting, innovative, or unexpected ways that you have seen data team ROI calculated and/or used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on measuring the ROI of data teams?
- When is measuring ROI the wrong choice?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/ rudderstack.
Your host is Tobias Macy, and today, I'm interviewing Bar Moses and Anna Filipova about how and whether to measure the ROI of your data team. So, Bar, welcome back. For people who haven't heard your past appearances, if you can just give a quick introduction.
[00:00:57] Unknown:
Yeah. Absolutely. Great to be here again. I think this is our 3rd time. So, my name is Barb Moses. I'm the CEO and cofounder of Monte Carlo, the data observability platform. The best way to think about us is just like engineering teams use solutions like Datadog and, New Relic to make sure that their applications are reliable. We help data teams make sure that their data products are reliable. We're fortunate to work with hundreds of customers, including folks like, CNN, JetBlue, Gusto, and PagerDuty, and many others. And I'm stoked to be here today because ROI of data team is something that I get asked about maybe daily at this point. So really looking forward to discussion.
[00:01:36] Unknown:
And, Anna, can you introduce yourself?
[00:01:38] Unknown:
Absolutely. Thanks so much for having me here. My name is Anna Filipova, Filipova, whichever. And I run the data team at, DBT Labs. And, DBT Labs is, the company behind the DBT open source project, and we also have a cloud offering that is the, best and definitive way to to run GBT. And for folks who aren't familiar, DBT is a industry standard for for data transformation and, a increasingly useful collaboration platform for, data teams to work and share code together. And we work really closely with, with Barr and, her company.
[00:02:22] Unknown:
And going back to you, Barr, if you can, just share how you first got started working in data.
[00:02:28] Unknown:
So, Yeah. So as I mentioned, you know, how did I get started with data? I guess very early as as a as a kid. Actually, I you know, my my dad is a physics professor, and my mom my mom is a meditation teacher. And so, actually, just in a very early age, I, was hanging out in my dad's lab trying to blow things up and, you know, create things in petri dishes and whatnot. And I remember as a kid, you know, playing lots of sort of guesstimates games with my dad. And, you know, growing up, that sort of follow and continue to be a hobby, and over time, you know, just got more and more interested in those topics. Ended up studying, math and stats in college, and then worked with data teams throughout, you know, the last sort of decade and a half or so.
Always intrigued by what is possible with data. I think the last, you know, last couple of years has definitely showed us a whole new world as it relates to that. 1 of my favorite subjects in college was actually mathematics, the application of mathematics and magic tricks. We basically had, like, make up a magic trick using, using a mathematical concept, which I love. And so it's kind of been a theme throughout throughout my life, and, I've been really fortunate to to work with some of the best data teams and learn from them every single day. And, obviously, at Monte Carlo, we get to partner with the best and to see, you know, what challenges they have and be a part of their journey, and that's what gives me the most pleasure in in my work in day to day.
[00:03:53] Unknown:
And, Anna, do you remember how you first got started working in data?
[00:03:57] Unknown:
I do. And, like many things in my life, it was a happy accident. I started out as a researcher like many data scientists before me. I was in academia for a long time, and I landed in a data role doing something that I had spent quite a bit of time doing in academia, studying open source software. And people who develop open source software and the motivations behind that, the team dynamics behind that. And 1 of the really fun things about the area that I was, doing research in was there are a lot of public datasets available, in particular, datasets around open source activity on GitHub. And, that was, my first exposure to to really big data, and I accidentally also fell into, more of data management than analytics and data science. As many people before me, I got frustrated by the lack of tooling that was available to me at the time. I I tried to build a lot of it myself, as I think also many people have, and that's what landed me at dbt Labs, when I realized that there was a company that was building a lot of the things that I was trying to do.
[00:05:07] Unknown:
And now in terms of the subject at hand, we're talking about the idea of ROI or return on investment for data teams. And before we get too much into defining what that even means, I'm wondering if we can start by getting an understanding of what the typical motivations are for an organization to think about measuring and tracking that ROI.
[00:05:28] Unknown:
Anna, do you wanna start? Sure. I'd love to. Putting on my cynic hat for a second. I think very often a question about tracking data team ROI comes a little bit later than it probably should in the life cycle of most data teams and organizations, at a point where folks are starting to think about business costs. And, I think that the earlier you start thinking about it, the more useful that question becomes. So I think that just like it's important to understand the efficacy and efficiency of your engineering team and your if you work in an EPD kind of model, your your EPD team, similarly, you need to be thinking about the efficiency of all the different parts of the business and have a really clear mental model of where and how different pieces of your business fit into, the overall trajectory.
So I think, typically, the motivations are are well intentioned, but sometimes a little bit late in the game.
[00:06:28] Unknown:
Bar, what have what have you observed? Yeah. I think I think, I agree with your somewhat, sort of, you know, this cynical start. I think, you know, data teams are oftentimes sort of hired or created on a promise to really accelerate the business, And oftentimes, those data teams attract really big investments in the form of people and, management and infrastructure and time and resources. And oftentimes, when organizations are confronted well with the question of, well, what did we deliver? Or or in another way, what did you do for me lately? Then data teams have to ask themselves about ROI. And so oftentimes, I actually think when people ask a question of ROI, what they really mean is, are you aligned with the parties of the organization? And so there's very many ways to measure that and to go about that. But if you ask the 5 whys or even just 1 or 2 whys behind that question, it's actually, hey. Is the data team working on something that's actually driving and pushing the business or not? And when you're 1 or 2 people in an organization, it's a lot more obvious whether it's aligned or not. And when the organization is bigger and more complex, it's a lot harder to answer that question. And when it
[00:07:40] Unknown:
comes to understanding ROI, there are a number of different ways to track both the return aspect of it as well as what it means to even invest in the data team and the data platform. And I'm wondering if you can talk to some of the areas where those those different responsibilities lie. Who is it who is responsible for collecting the information of how much am I investing into this overall process and business capability? Who is responsible for calculating what the return looks like, and what are some of the main ways that that information might get used within the organization?
[00:08:15] Unknown:
Yeah. I can take a shot at that. I mean, I think at the end of the day, maybe taking even a step back before looking at sort of the tactics of who's collecting the data and who's measuring it, the data team should be in a position where they are interested in helping to sure the ROI on their work is very clear. And that's because what we do matters. Right? So we should be working on data products that matter. We should be working with, consumers that matter. We should be working on answering questions that matter. And if we're not, I'm not sure what we're doing here. We should reevaluate and change course. Right? And so at the end of the day, there is sort of a an aligned incentive there on actually, making sure that what we're working on is is important.
And, you know, I think this is very well known, but just to mention in case not, there are entire industries where data teams have literally changed the face of the organization with the power of data. I think 1 of the most interesting ones that are recently in the news is the media. Right? If you look at, companies like Vox or The New York Times or CNN or many others, their data teams have literally changed the business models for media, from print to advertising to, user subscription, understanding that in-depth. And data teams have been fueling that. Can't imagine that in this industry without some strong data capabilities. And so the ROI there, for example, I think, is very, very clear and apparent. And there are other industries as well that are changing changing their face, sort of changing their, entire business model as a result. I'm, you know, I'm even seeing CPG turning more to ecommerce and more, stronger usage of data there, and very sophisticated analysis of supply chain and optimizing operations.
So there's so many examples where literally data can propel your business forward. I know that sounds very high level, but it just really, really varies by industry, and we see it everywhere. I think at the at the at the start, it's making sure that you're clear on how the data team aligns with that. And then to to your question, Tobias, who actually collects and who actually sort of builds the business case, I do think that's a data team as well in collaboration with the stakeholders. And it starts with defining who the stakeholders are internally. For many data teams, you start with 1 internal stakeholder. For example, it could be the executive team in the very early days of a company. Maybe, you know, the CEO wants to answer the question of how many customers do we have or how much are they paying, really basic questions about the business. And so your main stakeholder is the ability to answer that. And then over time, you can obviously build more capabilities and be a good partner to marketing teams and sales teams and r and d teams, and also, actually, something that we're seeing obviously more and more is, having external customers. So where data teams are responsible for data products, whether that be machine learning models in production or reports, data sharing that get sense with customers.
In all of those instances, there is room to actually measure the ROI of those activities and get conviction that what you're doing is is improving the business.
[00:11:09] Unknown:
Yeah. I really agree with that bar, and that really resonates as a, as a practitioner and as a data leader. The way that I think about measuring ROI tends to differ depending on also the type of data role that someone is filling, and I I tend to think about this in 3 buckets. There are folks that are working on data products explicitly that are a part of a company offering, And there, it's, a very similar process in terms of thinking about ROI, the way that you would think about ROI of your engineering product and design function because it is directly customer facing. It is, impacting the business in some way. Maybe that is the way that you, show advertising on your platform.
Maybe that is, augmentation, increasingly AI. Right? There's also the internal analytics component of that, and that tends to be where folks find it the hardest to measure ROI in my experience, because their folks are thinking about how fast are we answering questions, how fast are we delivering insight to the business, how what does it mean to be data informed and data enabled? And, usually, when I think about framing this for a data team, I take a step back and I I encourage folks to actually switch from that reactive mindset of measuring success to a proactive means of measuring success. So there, you're thinking more about how do I set up this team and the work that I'm doing to to help the business anticipate things that it is not yet anticipating.
How do I prepare for the question that isn't being asked yet? Because that is what impacts and dramatically changes the trajectory and velocity of a team in the examples that you've described, Bar. Usually, when that works well, it's because the data team is really well connected to decision making, and it is, data teams are in the room and giving recommendations and saying, hey. We see this thing happening in 6 months or in a year, or we see an opportunity here in this area of the business, and, we should go after it. And here's what that looks like. And I think that is, that is where it becomes really easy to measure that ROI again because you're not thinking about it post hoc, but you're thinking about it upfront in terms of how you spend and allocate your time. And then the 3rd bucket I tend to think about are, the unsung heroes of, data work, data platform folks, data engineering folks.
And, there there's a really interesting, trade off between growing teams and buying software and thinking about the trade offs on a platform team are not necessarily how much money you spend, versus a particular business outcome, but how do you allocate those resources? Like, are you, are you spending more resources on humans to build things yourself because you have a very, bespoke solution that you need to create in your business, or are you investing in, tooling and resources and integration of those things? But either way, that's is almost a requirement for, for business operations the same way that infrastructure spend on AWS for engineering teams is almost a requirement.
[00:14:28] Unknown:
And as far as the inputs to the overall calculation of return on investment, starting with the investment aspect, what do you see as the major contributors to that investment calculation? Is it largely payroll? Is it infrastructure? Are there ways that they incorporate the cost of maintenance and the storage volumes. I'm just curious, what are the biggest drivers of that investment calculation when teams are starting to go through this exploration?
[00:14:58] Unknown:
Yeah. Just like for an engineering team, I think that the biggest driver of the cost aspect of the ROI calculation, tends to be people cost, more so than platform and tooling costs. It is much easier to focus on the platform and tooling cost as something that is easy to control, but we sometimes forget about the inherent cost of, staffing a team. So for example, a really common thing that I see on the data platform side of the house is, a consideration between, should I use, an open source framework, or should I pay, a particular vendor to do something? And we often forget about the cost of building a team around maintenance of a thing. So anytime you add a new piece of infrastructure that isn't a cloud service, you have to have folks maintaining it and, making sure that it's it's staying up, and that tends to be more than folks anticipate at the beginning. It takes more than 1 or 2 people, and so, that tends to be 1 of the variables that I come across, and what I encourage folks to think about is rather than thinking about the volume of people you're hiring or how much you're spending on queries, thinking about how you are leveraging the investments that you've already made. Start small, hire a couple of people, and make sure they're working on the most important problems, and make sure they are there in the room with you to have that impact from the beginning because then it makes it really easy for you to see how to scale that process going forward.
[00:16:36] Unknown:
I was just gonna add 2 more more points to that. 1, I think, you know, folks are starting to think about ROI a little bit differently from what I'm from what we're seeing. Totally agree with Anne on sort of what's the input to that. And then on the question of what's the output, 1 of the most interesting trends that we're seeing is that folks are actually looking at ROI by data product. And so, actually, we, you know, 1 of the things that's been very effective is actually having sort of a data product dashboard that allows to evaluate the effectiveness and efficiency as well as reliability and quality of a particular data product. I think that's a very interesting lens because it brings information from very different systems and very different perspectives into 1 view. That's actually fairly new for data teams. Typically, data teams will look at a particular table in Snowflake or a particular job or a particular pipeline and actually starting with, hey. What's a data product that we're responsible for and making sure that, you know, its freshness and accuracy and reliability, as well as the value that it's driving for organization is tight is something that I think is, new and interesting, and and that more and more teams are doing. And then the second thing I'm seeing is that more and more folks are looking for automated ways to measure ROI.
And so in the past, like, there was a lot of discussion in sort of a very, you know, maybe long and cumbersome process. Folks are looking for more automated ways to have that information sort of surfaced. And so I'm seeing more and more tools lean into that particular aspect and to make it easier for data teams to answer that question themselves.
[00:18:06] Unknown:
And then on the other side of the equation of calculating return, I'm wondering what are some of the useful heuristics that you've seen folks lean on to be able to understand how to even think about that because return could mean because of the fact that I knew that this thing was going to happen or that I was aware of this stock shortage, I was able to get enough inventory to make sure that I was able to sell this much because most of the ways the data is being used is for predictive reasons, and so you're making a projection of how things might be versus how they do end up being, or you you end up dealing with a lot of what ifs. And so I'm curious how you see folks trying to tackle that challenge of being able to say, okay. Based on the information I was able to retrieve and the insights I was able to gain from this overall effort of the data team, this is how much money I think I made different than what I would have done otherwise.
[00:18:58] Unknown:
It's a really interesting problem because, if you do this well and you're actually able to anticipate and help the business anticipate problems, it becomes almost invisible. It's kind of like preventative health care versus health care that happens after you've actually diagnosed an issue. It's really hard to measure the value of preventative health care and the ROI on that until you actually have an out a bad outcome. So I don't know that there are silver bullets here necessarily, but what I have found to be really helpful is a combination of qualitative and quantitative information. And so 1 of the things that we sometimes forget as data folks is that qualitative data is also data, and, gathering feedback in particular from stakeholders, from customers, from, areas of the business where you have had an impact, can say a lot about the value of what the data team is producing, at any given moment. In particular, if you've made an influence on, an area of the business, getting direct feedback from stakeholders about how you have worked with them and how you have contributed to that outcome, maybe, growing sales or or, growing the total addressable market because you have identified an opportunity or or things like that. And there are sometimes also really direct things that you can measure, like saving costs.
And and 1 of my experiences, our team was building a a data model to, reflect pricing changes for a particular business. This is 1 of my past roles. And in the process of doing that, you know, we had, like, a managed business, we had self-service business, and, obviously, those datasets were 1 was in Salesforce, 1 was in Stripe. You had to bring all those things together. And as you started doing that reconciliation, you started to realize that there are a lot of gaps in billing that were happening on the product side that weren't obvious, and it's literally like finding money. Hey. Turns out we haven't been building billing this cohort of customers, and here's an extra $4, 000, 000 that you didn't know that you could that you could have or or something like that. So sometimes it can actually be really direct, and, and it's important not to write that off as well.
[00:21:10] Unknown:
And so once you have some insight into the overall ratio of return versus investment, what are some of the ways that data teams can use that to inform the areas of focus that they should be paying attention to, the types of work that they need to invest further into, some of the ways that they can try to generate more return for the business, and and maybe even some negative incentives that can come out of this overall exercise?
[00:21:37] Unknown:
Yeah. So I think there's maybe a couple of different ways to think about this. The first, which is, like, the holy grail, which I think Anna spoke about, which is this particular activity that I did drove x 1, 000, 000 of dollars improved to the business. And, you know, we all celebrate in Kumbaya, and we're so happy because we made this huge impact. And, by the way, we also made a huge impact on on on the world, you know, improved, social outcomes or, you know, in health care in particular is a good example where you could do that, where data can literally drive an impact on on people's lives. You know, I think that's sort of the ultimate holy grail that that we're after. Right? That is very hard to measure. Some teams do that and can put a number to that, but it's incredibly hard to do. I think there's some sort of steps towards that, tactical steps that teams take to that. So I'll give you a very specific example again for from sort of the Monte Carlo world. Many data teams that we work with, before we started working with them, would spend, you know, weeks or months attending to data fire drills. That would be their lives. They would wake up, and they would spend most of their time, maybe 40 to 80% of their time, trying to figure out why is that person yelling at me, yelling on Slack, or just yelling verbally?
What went wrong here? Why is my CEO so ups c CMO so upset that the, you know, marketing dashboard is not working? Where is the problem? Starting to peel the onions the layers of the onion. Where exactly is the issue going upstream, etcetera? And that will literally take a ton of time during which the business is losing money most both because there's wrong data out there that your customers are using for their ad campaigns, but also because your team can be doing a lot of other things instead. And many teams reduce that and actually look at very you know, 3 specific metrics. The first is status update rate. How many sort of, incidents did they respond to and making sure that they are on top of all of them. 2, reducing the time to resolution from weeks months to couple hours and reducing the time to fix from, you know, months to literally 5 to 6 hours. That is big impact on on the business. That means, you know, that's multiple people who are focusing their time elsewhere and improving efficiencies for their organization overall.
So that I think is a good sort of, you know, very tactical way that you could measure a particular impact of that the data team has because they can take all of that time and go look at a new segment of the business that you wanna look at or a new experimentation platform that, they're considering or many other things that could actually add value to the business. So I think there's very different sort of even within the path to ROI, it's sort of a maturity curve, if you will, to actually get to where where you wanna go, but there's multiple steps along the way.
[00:24:22] Unknown:
That's such a great overview, Bar. The only thing I might add to that is a framework that I have in my head for how I think about reasoning about different types of data investments because it actually really varies based on the stage of company you're in and where your business is at this moment in time. And so if you are early on in your journey, if you're an early stage startup, you're probably investing a lot more time in greenfield exploration and building product for the first time and understanding, is this, useful for users?
And you're spending a lot less time on tech debt. You're spending and and that reflects on the activities that your data team should be prioritizing as well. Your data team should be focusing more on those things. But as you grow as an organization, you start to need to make trade offs between the amount of time you spend on technical debt and making sure that your systems work really well and making sure that you have that visibility so that you're not spending all of that time firefighting like you've just mentioned, Bar, because it is a really huge drag on, data team productivity. It is probably the biggest thing that's holding a lot of data teams back today from my experience.
And making sure that you are reasoning about that investment in a way that allows you to continue to prioritize some of the most important things in the business. But it's, it's really helpful to be aware of where the business is at a moment in time to understand how much of that you need to spend your time on versus pushing things forward and helping the business move as well. Yeah. And and just to build on that, probably the strongest framework that I've seen,
[00:25:57] Unknown:
is actually something that worked on with the chief data officer of Poshmark a few years ago. It's probably sort of the most robust framework that I've seen both from being able to answer, you know, what what's going well, but also to your question to bias, where should we focus. And, basically, the idea with this framework is it it's it's basically sort of a 2 by 2. On 1 axis, it's actually laying out the different roles of the data team. And to your point earlier, Anna, like, in some instances, it could be, you know, driving new growth initiatives.
And for some later stage more mature companies, it could be actually improving and optimizing the operations of the company. For some organizations, it could be driving scale initiatives in particular. That's also sort of more seen in in later in later stage. And in many organizations, it's actually looking into totally new capabilities. So for example, it could be for a marketing team, you know, a totally new multi touch attribution model, that the marketing team has never used before. Right? And so actually categorizing all the different things that data teams can be doing into, you know, maybe 3 to 4 buckets of that type and then crossing that against all the different teams that they could be working with. So I'm using marketing because it's, I think, is the most tangible example.
But, you know, data teams can can drive similar growth, optimization, scale, new business, new initiatives across not only just marketing. It'd be across customer success and support. It can be against people and HR. I think a a new area of innovation, how to use data to drive employee retention, employee success. It could be against product development. It can be against engineering and infrastructure scaling. For example, figuring out, you know, how to optimize your engineering operations and your cloud infrastructure cost, which is something that a ton of data teams are doing right now, figuring out how to how to scale but not spend as much on infrastructure. Right? And so if you actually start mapping, like, what are the different organization I'm I'm showing I'm I'm not sure if folks can see this, but I'm showing with my hands. You know, on 1 side, it's sort of the different teams and crossing that against the different activities and then actually starting to plot numbers against that. So you could be really simple and say, okay. 0 to 10, we're really effective at driving new initiatives for the marketing team. But you know what? We really suck at doing that for engineering. On the other hand, we're really great at supporting scaling initiatives for engineering team. And, actually, you can even imagine, like, a heat map, and that helps you plot, okay, what are the things that we've done to date, and what are the areas that we need to focus on? And I'm actually not sure that the outcome should be green everywhere. I'm not sure that's, like, the best strategy. I would actually encourage to focus on where you think you could add the most value in the short amount time as a data team because I think more than ever, data teams are expected to deliver these days. And so if it's gonna take you 5 years to deliver some outcome for a particular team versus, you know, a bigger, or a similar outcome in maybe half or a third of that time, I think that should go into consideration.
But actually having that view of the world is 1 of the most powerful thing that I've seen data teams do, which, a, which is able to communicate both how we're helping the organization, but also where should we focus and why.
[00:29:10] Unknown:
I'm nodding very aggressively for folks who are are not gonna be able to see this, but but I've been nodding very aggressively the entire time that Barr has been talking because I completely agree. I think that's a really important way of looking at it, and I've actually constructed kind of versions of those heat maps before to, like, think about coverage, data maturity. Do you have good data models in place, when can you let the business work with some of the data directly. This is also where self-service becomes, a really interesting piece of puzzle because you can reason about investment in terms of, these are areas of the business where things are extremely raw and there's a lot of things that we can do in order to clean this up and deliver value really quickly. And these are much more mature areas of the where we've built out a lot of reporting and models and things. And there, we can focus on, enablement. We can focus on giving our stakeholders some more tools to be able to to do some of the work themselves so that they don't have to go through the data team. And letting go of some of that work is also tremendous ROI because it, it shortens the time to insight in in many ways.
1 of the things I I think, Bar, you've mentioned so many different examples of really useful activities that a data team does. There's 1 that isn't always, at the forefront, but I think can be incredibly impactful, and that is the way that data teams participate in, processes, around IPO and, in public companies and, making sure that, the data that you're putting out is accurate, is reliable, is consistent with what you've been reporting in the past, and also, setting you up for, for audits. And this is when you start to work more with your security team, your legal team as a data team, and, it's an entirely new world of interesting problems to solve, and it can be, really, really exciting and really different for for folks, if they wanna develop their careers.
[00:31:13] Unknown:
A 100%. I would say the stakes are way higher when that happens for everyone involved. You know, the the number of times that someone mentioned to me, oh, you know, we almost reported the wrong numbers to Wall Street, and we found out about that, like, 24 hours before is probably way higher than I'd like to I'd like I'd like to admit. And you're right. A lot of that has to do with the data teams getting prepared for years, actually, in order to make sure that the data that we're providing to the street and to stakeholders is consistent, reliable, accurate. It's a huge deal. Exactly. And this is why, once again, it becomes really important to start reasoning about the investment
[00:31:50] Unknown:
in uptime reliability, tech debt as you get more and more mature as business because that sets you up to be able to do these sorts of things much more easily when the business gets to that stage.
[00:32:02] Unknown:
Another interesting aspect of collecting and computing this information is I'm wondering if you have seen it provide the opportunity to identify and understand and understand when your team is getting bogged down with issues around tech debt and being able to identify areas of optimization for this dataset is completely useless. Nobody cares about it. It's taking all of our time, or the workflow around this process takes way more time than it needs to. We really need to invest in actually cleaning this up so that we can generate more value from it in, situations like that. So in you know, I'll I'll give you an answer from from our space just to make it a little bit more tangible. But,
[00:32:42] Unknown:
you know, when we at Monte Carlo, we think a lot about data observability, data reliability, and that means making sure that the customers of your data team, whether those be internal or external users, actually have a strong experience, meaning they can have highly reliable, accurate, and trusted data. And we think that there's 3 different areas or 3 different reasons for why that could happen. The first or or why that could be at risk. The first is when there are changes to your data. That's sort of the classic data quality notion. That can be a schema change. It which is basically a change in the structure of the data. A lot of things can go wrong with the data itself that could impact the quality of and trust of of the data, that you're delivering.
The second is that there actually could be changes to code that impact that. And so, particular if someone could could make a change somewhere that could impact that. And so looking at, you know, sort of code changes and reviewing that and and query logs can help sort of build a view of what that looks like before it impacts, your downstream consumers. And then the 3rd area that's actually interesting is infrastructure changes. And so this relates to to, by the way, your question, oftentimes, there are queries that are taking, too long or very heavy queries.
Those not only create to or contribute to a bad experience because your users may not be looking at the most up to date data, or they might be just looking at plain wrong data. But, also, that's probably an area that you want to rethink how you're building your pipeline because, potentially, it's done in a very inefficient or, not effective way. And so those give us clues for how we should, be building stronger, more reliable pipelines for the purpose of delivering a reliable product, but also because folks are thinking about how do we build, pipelines that are reliable and trusted, but also stick within the frames of the budget that we need to be accountable for as data teams. And so we see more and more in sort of the area that we're working on that data teams are looking at that as well and asking themselves, can we build this pipeline in a more thoughtful way? You know, can we run this query differently to structure it for different purposes?
And I think in particular, when folks thinking about building with AI, the question of cost and structure becomes even more important just given the work that's that's done there. So So a 100%, Tobias, to your question, I think that's spot on, and and we are seeing folks doing that. I think there's an additional dimension to this from the point of view of data developers themselves, which is, the cost of poor developer experience,
[00:35:13] Unknown:
for a data team. And, this is something that I don't know that I've seen a lot of data teams start to measure, but there's a lot of really good patterns from, again, software engineering that can tell us a lot about how to think about this and how to reason about it. Things like, how long does it take to merge a PR, after someone has opened it? How many revisions does it take before code goes into production? How fast can a data team recover from an issue that is, in their data pipeline? And those things can really easily stack up. I've I've had experiences earlier on in my career where, things would take weeks that should probably take days because of the friction in the developer experience.
And, those are nontrivial costs, that add up really quickly over time. So, 1 of the things that I always encourage, folks to think about is, is trying to think about measuring developer velocity in some way. And, there's really good frameworks out there, again, from software engineering that that can be really useful.
[00:36:24] Unknown:
In terms of making this an easy process for teams, I'm wondering what are some of the methods or utilities that you have seen be most useful for, organizations who want to start collecting and calculating this information, some of the useful visualizations of this information? Are there off the shelf systems that will just say, I just wanna know what my ROI is and it will go through in spite of all the information, or is this always going to be a bespoke situation where based on your specific organizational requirements and operations, every team is going to have to go through this exercise on their own? It's a really good question.
[00:37:09] Unknown:
I can tell you how I think about doing this in teams that I've managed, and I'm really curious, what Bar sees, more broadly across your customer base. I tend to think about, starting with the real basics. What is the easiest thing that you can do to start measuring and generating data? Because a lot of times, what you're doing is you're generating data that didn't exist before in some form. Can you start recording how much time you're spending on managing and keeping your data pipelines up to date relative to the amount of time that you are able to spend on analytics and insights work and actually, you know, business generating activities.
And can you track progress of that over time, for example? And it helps to start really simple and really easy because, 1 of the things you don't wanna do when you're first rolling something out is, spend several months implementing a new process and, trying to wire up a new tool before you can get to that insight. It's really helpful to calibrate and get a sense of where some of the biggest challenges are and kind of build a mental map of, where are, things going really well, where are, areas of focus, and then develop more rigorous systems that can help you, track that over time as you get more sophisticated about that process. So, for example, maybe you start out with just people logging how much time they spend on, on uptime and and other kind of keep the lights on activities.
And, eventually, maybe you hook up a a tool like Datadog and, start building dashboards around that that show you, like, platform uptime. But I wouldn't start there, because, it may it may not be the thing that is most important. So I really I I really like, Bar, and I keep coming back to your analogy of, like, that heat map. Build the heat map of the things that you, that you know are going really well, and, where are areas of investment, and then think about,
[00:39:12] Unknown:
going deeper. Yeah. I would say and this goes back to your earlier question, you know, and it's tied somewhat to to kind of unintended consequences of measuring ROI. I think as people, we tend to want to do great on something specific that we measure. And so if we're measuring something, we might go to something specific that we measure. And so if we're measuring something, we might go to great lengths to accomplish that thing even if it's the wrong thing to accomplish. That makes sense. And so if we decided to really prove, you know, ROI in a particular area, to improve a particular metric, we have to remember that that metric potentially might have been the wrong 1 that we chose, and potentially along the way, we might need to course correct. And so I would just say whatever it is that we're measuring, yeah, remember what the goal is.
Again, this sort of sounds obvious, but it's so easy to to forget. You know, I think there's there's some additional on top of sort of the hard metrics, if you will, for application uptime and data uptime and reliability and all that stuff, they're they're also another metric that data teams use, which is actually team performance and team motivation.
[00:40:15] Unknown:
And so Yes.
[00:40:16] Unknown:
Yeah. It's so common for data teams to measure the NPS of both their data teams, but also their data consumers. People love this. Now, obviously, it's the most subjective metric on the planet, which is, like, how happy are people with your services and and with your teams. But it's a very powerful metric, and people love it. And, you know, for example, JetBlue, which is 1 of the biggest users of both DBT and Monte Carlo, actually presented at at Snowflake Conference a few weeks ago. And thanks to a lot of the work that they've done on, observability and performance and operations, they were able to increase their their users' NPS by 16 points, which is a huge achievement, incredibly remarkable for the team. And that is a lot of how they measure their success and their ROI. And so sometimes I think, you know, looking at some of those other aspects can help complement the picture of what you're looking for. It can give you additional color of, are you actually, you know, helping people at the end of the day to do their jobs better.
[00:41:14] Unknown:
Yeah. Absolutely. That's such a good point, Bar, about making sure that you're focusing on the right things and not just the things that are easy to measure and optimize. For example, an easy thing to measure and optimize is query cost. And, that might be something that feels good to optimize in the moment because, hey, we're saving money because we have made this thing run much more quickly. But how much effort went into optimizing that query, and is this something that is being used actively in the business? Right? Like, asking those questions upfront and and and making sure that the things that you're focused on are actually very closely aligned with, with what the business is trying to do.
[00:41:52] Unknown:
And that also brings up questions of predictive maintenance of I see that this query is costing this much time. It's producing this much value. If we we we definitely don't wanna delete it, but maybe we wanna optimize it. But what is the actual cost to do that optimization versus the overall cost that is going to take over over a given period of time? And what is the point where it, tips the balance to actually being worth that investment?
[00:42:17] Unknown:
Exactly. I I don't like to speak in absolute terms, but I think this is 1 of those examples where it's very rarely an ROI that is worth the investment, rather than focusing on something new, if it works, if it's not delaying the insight that's being produced, the effort of optimizing that query versus versus just budgeting for it can actually take you away from from being more proactive and helping the business think ahead.
[00:42:53] Unknown:
And another interesting wrinkle in recent days is the massive uptick in generative AI and these natural text interfaces, which can be used both for, speeding up the development process as well as making some of the self serve capabilities a little bit more natural and easier to understand as well as having the potential for giving massively wrong results. And I'm wondering what you see as the net benefit or net negative of generative AI in the context of data teams and organizational usage of data and, some of the unrealistic expectations that it might produce.
[00:43:36] Unknown:
This is me flexing my my fingers as I I as I get ready to tackle this. So 1 of the things that I think folks are most excited about when it comes to, AI and data work workloads is, oh, cool, this is going to make the simple questions really simple and I can focus on the complex stuff. But the thing about, large language models, the way that they're designed today, is that without perfect precision, you can't actually give an end user who isn't extremely comfortable with SQL, who isn't extremely comfortable with debugging, that solution, an 80% finished product because that will lead to outcomes like incorrect analysis. That will lead to outcomes like folks needing to debug it and actually ask you more questions than if it were to do it yourself. So I think about this as really enhancing the data team's experience and and development. For example, can you automatically generate documentation for datasets that you're producing?
Can you think about building the scaffolding for a particular, data model so that you don't have to type out everything and then work on refining it more so than it is passing off some of the easiest questions. I think there's a lot of opportunity in that space, and I, I've I've talked about this before in in in different forums, but I think what we're missing on on the data side for this to, like, really work very well is, a semantic representation of what's happening in your data warehouse in order for some of these models to become, like, incredibly useful. And once you once you start bringing those pieces together, then you're gonna be able to do really great things like, enable people to ask questions in a very user friendly way and get answers from data that they can trust. And I know that there are lots of really great companies that are working on that today, so I'm excited about that future. But I think right now, today, the thing that I'm most excited about is, the efficiency that this can give data teams themselves in doing more in in less time.
Barb, what do you think?
[00:45:47] Unknown:
Yeah. So, I've I've to flex my, as well. So, look, I think in general, you know, generative AI is definitely what I would call sort of the new data driven in quotations. I think almost every data leader right now is being asked by someone in the organization, hey. What are we doing on generative AI, And how quickly can you deliver that? I think it has become, for many, sort of the the priority. However, there are many questions that those data leaders need to answer before they're able to even get started, which is, 1, what are what's the what is the customer problem that I'm looking to solve here? Whose life is I'm making am I making easier? Which use case will actually add the most value?
What does that mean for our team's ability to deliver on that? What tools do we have to actually deliver on it? There's a lot of questions that folks have to answer before they're actually able to do that. Obviously, we're very excited about it as well since it you know, a lot of a lot of what this means is that the accuracy and the quality of that data is more important. And so, you know, we're seeing every single customer that's doubling down on generative AI is more worried about the reliability and accuracy of the data even more because the stakes are just higher both for the data itself and for for the data teams. You know, I think Frank Flutman, CEO of Snowflake, had a really interesting comment around how data needs to be reliable and trustworthy to actually realize the investment in LLMs. And I think that's very, very well said, and I see our our customers, you know, worried about that as well. I think in terms of the specific areas that it will impact in the short term, there are 2 core areas that I'm seeing being disrupted. The first, which is, basically data engineering productivity.
I believe that in the same way that, you know, sort of engineering productivity has been improved with solutions like Copilot and others, data engineers productivity will improve as well. And then the second area that I'm excited about is BI being disrupted in some way. I think the way in which we'll ask questions and get answers and think about, making data accessible to nontechnical users will change, and, you know, there's already sort of advancements in that area that I'm excited about. So I think those are kind of the most immediate areas that we're seeing, being impacted.
And I actually think the question of ROI is even more important in this context. If we're building a lot of things with generative AI but we can't tie that to what's the business value, we'll be in big trouble, later. And so I do encourage for organizations who are doubling down on that, making sure, 1st and foremost, that the outputs of what you're building is highly reliable so that you can use it. Like, the number 1 problem that folks have I don't know if you all have this, but, like, I go into chat GPT, and I'm asking, you know, for some itinerary, you know, for a trip that I wanna take this weekend, and it's giving me, like, suggestions for places that are closed or, like, in the totally wrong hallucinations or you know, there's a lot of output, and then it completely ruins the experience. I'm like, okay. I can never actually use that to create an itinerary. I might as well go do it myself. And so Mhmm. The reliability and the accuracy of did that data is more important than ever. And the second is I actually need to care about the question that I'm asking about. So if I'm not even going on a trip, I don't even need that itinerary to be available. And so we need to make sure that what we're building is actually answering a real, question that's top of mind for our end users. And in your experience of
[00:49:10] Unknown:
working in the space, talking with the teams that you work with, you have the customers of the products that you're building. What are some of the most interesting or innovative or unexpected ways that you've seen them thinking about this question of ROI, both in terms of how to calculate it and how to use that to drive the work that they do.
[00:49:29] Unknown:
I have 1 example that I think is relatively new, and this is something that DBT does internally, actually, DBT Labs, and that is the level to which, we can optimize internal business processes with, with data and with some of our tooling. And a really good example of this is business system sprawl, and this happens a lot in, like, people teams and and, people processes. You have 1 tool for, payroll. You have another tool for recruiting. You have yet another tool for permissions, and, you have kind of networks of integrations that need to all talk to 1 another in order for those tools to effectively pass information along.
And each tool is a source of truth for that particular thing. So you can end up in situations where you have information that isn't the same and there's a lot of manual effort that goes into, reconciling that for your, business functions, for your people functions, for your security and and IT functions. And if you think about a source of truth differently, if you think about maybe passing some of that information through the data warehouse in secure ways, all of a sudden, you can have different tools, relying on, kind of the same grounding and the same source, of truth, to be able to sanity check some of that work. And so 1 of the things that we're seeing and we're talking a lot more to customers about is how to start leveraging your existing data workflows for more of these new and, interesting I but I can really see the opportunity in terms of efficiency.
An example that I really love is, we've been able to shave down the amount of time it takes to close books for the business because we are starting to integrate all of these things in the data warehouse. It now takes 4 to 5 days, and, you know, an industry standard is a lot longer. And that obviously didn't come for free, and it took a lot of effort, but it is going to continue to pay dividends for years years years.
[00:51:44] Unknown:
Barb? Yeah. I love those examples. I'm not sure I have a ton to add. I mean, I would say 1 specific you know, there's, The New York Times, I think, comes to mind here. Again, it's had a very interesting kind of journey from, ad business to subscription model, and they have a very interesting way where they were actually looking at their experimentation platform as a way to, to drive ROI. And I've actually never seen that that done before, and I think that's that's pretty interesting. It that's an example that's more tied on the revenue side of the house. And so, you know, I think that can have very significant impact because it's very, very focused on a particular revenue goal, around, subscriptions and being able to tie what exactly can help drive subscription to the level that you can say, you know, am I looking for users who are frequenting, you know, reading the newspaper less often but spending more time on it, or those that visit it very often but don't spend a lot of time on it and being able to actually, like, ask very, very targeted specific questions about user behavior that ultimately drive additional revenue. So I thought that was sort of a a different take on ROI that I appreciated.
[00:52:54] Unknown:
And in your experience of working on these problems and working with people who are trying to tackle this question of ROI, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:53:06] Unknown:
I'm not sure this is unexpected, but I would say, you know, I think there's something about starting the journey. I think oftentimes data teams are like, oh, this is so hard. We're just not gonna do it. Whatever. We'll, like, fly by the seat of our pants. And what I would say is it's unexpectedly hard, and yet, actually, you can go a really long way by just starting. So actually asking yourselves, what's important? What do we care about? Am I aligned to what the business needs? Putting some framework and thought around that will get a long way. So I would say it's it's it's expectedly hard, but there's unexpected value from getting started.
[00:53:44] Unknown:
I think that's a really good way of putting it. And, maybe the the only other thing I can add to that that is sometimes unexpected is that, as you start digging deeper, you might be surprised by the kind of feedback that you're getting around the work that you're doing. So, as an example, it's very common in an aggregate NPS score from your stakeholders, to see no progress in that score. But, if you break it down to different business functions and you align that with what you're actually investing time in, you will see a really different story. Because 1 of the things that is universally true about data teams is that there can always be more people doing more things. And so you have to prioritize, and whenever you prioritize, that means, some folks are gonna get more of your time than others and some folks are gonna be happier with you than other folks are. And that's okay.
As long as you are communicating outwardly and you're making sure that you are, focused on the most impactful things, and you keep abreast of the the distribution of your investments across the business. And you work with folks to to paint a picture for how you can help them and work with you on on getting those investments so that you can collaborate together.
[00:55:04] Unknown:
I would just add to that. 1 thing that I think folks are surprised by time and time again is how critical storytelling and narrative is in this whole thing. And it's it's funny because as data people, we should know better. Right? Like, you can take any really any kind of data and tell, you know, tell an entire different story based on, like, what you wanna accomplish and what you wanna achieve. I think this was, like, kind of like a existential break down moment for me when I real I realized that. I was like, oh, no. Like, this is the same data. You're just telling a different story. How is that possible? But then I think, you know, as you work more and more with data, you understand that that's also the power of data, and there's different ways to to look at data. But I think remembering that even when you are making the case or building an ROI story internally, the narrative, and your ability to tell a story is very important because at the end of the day, it is people on the other side of the table that you're communicating to. And so thinking putting yourself in their shoes, having empathy to the decision that they need to make, and thinking about how do you align with that is really critical. There's lots of sort of, you know, resources around storytelling and and such for for data teams. I'm surprised time and time again by how important that is and how we, you know, often, you know, neglect
[00:56:16] Unknown:
that part. And for teams who are starting to think about investing in these ROI calculations, are there any situations where it's not the right choice or maybe just that investing the time and figuring out that information is not going to produce any new value? And for people who have decided, yes. I do want to invest in calculating ROI, what what are some references that you recommend they look to for figuring out how to start tackling that problem?
[00:56:46] Unknown:
It's an interesting question. When do you not measure this? I think there are periods in your in a data team's development and an organization's development where it's much more important to have clear expectations upfront rather than post hoc measurement of ROI, because those are 2 pieces that work hand in hand. Sometimes the most important thing is to be really clear about what you're doing with the limited numb amount of time that you have and, staying in close contact. And that awareness allows you to back into measuring ROI in the future when the team grows, when the organization grows, but having a clear understanding upfront of what you're working towards and, aligning on those expectations with your stakeholders and, your management or executive layer is probably the most important thing. And if you don't have that, then it doesn't really make sense to measure ROI just yet. Are there any other aspects of this question of ROI
[00:57:49] Unknown:
or the processes around actually collecting and calculating it or the ways that that information is used to drive action that we didn't discuss yet that you would like to cover before we close out the show? No. I think we you know, maybe just the only other thing to reemphasize is,
[00:58:03] Unknown:
as data teams, what we do matters. Remembering that and leaning into that, is most important. I can't think of an industry or an organization that's that's more important at this moment in time in history, And I'm just excited to be a part of this and looking forward to to what's more to come.
[00:58:25] Unknown:
Same. I think that the only other thing I might add is just a little bit of advice for other data leaders. This is a time when you're going to get pushed a lot on cost and ROI and optimization. And this is actually also the same time that it's the biggest opportunity for you as a data team to have impact. So, balance those things carefully is is probably my advice.
[00:58:53] Unknown:
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're up to, I'll have you each add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:10] Unknown:
I don't know that this is a gap in any particular tool or layer, but I think the the thing that I'm looking forward to as the data space continues to mature is, interoperability and better developer experience. I think that is something where we're we're starting to think about it, but, there are, lots more places to go in terms of both interoperability between tools as well as, the just novelty of some of the solutions that folks are using, relative to what has existed in the software engineering space for a really, really long time. So we kinda have to remember that we're still early in this process and early in this journey.
[00:59:57] Unknown:
I would agree, and, I'm biased, but, I obviously think data trust is is the number 1 problem that folks have. No matter what you're building, in what realm, in what industry, you know, if the data is shit, excuse my language, then I'm not sure what we're doing here. And so continue to see that most important. And so making data and AI, more reliable, that covers, you know, CICD testing, observability, the whole suite of of things that engineering teams have for a long time,
[01:00:29] Unknown:
I think is critical for data looking forward. Well, thank you both very much for taking the time today to join me and share your thoughts on this important question of measuring the ROI for data work. It's definitely a very, interesting and complex problem space. I appreciate you both helping to try and bring some sense of, order and understanding to it. So thank you again for taking the time today, and I hope you enjoy the rest of your days.
[01:00:54] Unknown:
Thank you for having me. Thanks, Tobias.
[01:00:57] Unknown:
Thanks, Anna. It was fun to be on a panel with you.
[01:01:00] Unknown:
It's really fun to be on a panel with you too, Barb. Thanks for thanks for the invitation.
[01:01:04] Unknown:
Yeah. For sure. Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introductions
Bar Moses' Journey into Data
Anna Filipova's Journey into Data
Understanding ROI for Data Teams
Tracking Investment and Return
Heuristics for Measuring Return
Using ROI to Guide Data Team Focus
Frameworks for Data Team Investments
Optimizing Data Pipelines and Developer Experience
Impact of Generative AI on Data Teams
Innovative Approaches to Measuring ROI
Lessons Learned in Measuring ROI
When Not to Measure ROI
Biggest Gaps in Data Management Tooling
Closing Remarks