Summary
All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data.
Interview
- Introduction
- Can you describe what Stemma is and the story behind it?
- Can you give me more context into how and why Stemma fits into the current data engineering world? Among the popular tools of today for data warehousing and other products that stitch data together – what is Stemma’s place? Where does it fit into the workflow?
- How has the explosion in options for data cataloging and discovery influenced your thinking on the necessary feature set for that class of tools? How do you compare to your competitors
- With how long we have been using data and building systems to analyze it, why do you think that trust in the results is still such a momentous problem?
- Tell me more about Stemma and how it compares to Amundsen?
- Can you tell me more about the impact of Stemma/Amundsen to companies that use it?
- What are the opportunities for innovating on top of Stemma to help organizations streamline communication between data producers and consumers?
- Beyond the technological capabilities of a data platform, the bigger question is usually the social/organizational patterns around data. How have the "best practices" around the people side of data changed in the recent past?
- What are the points of friction that you continue to see?
- A majority of conversations around data catalogs and discovery are focused on analytical usage. How can these platforms be used in ML and AI workloads?
- How has the data engineering world changed since you left Lyft/since we last spoke? How do you see it evolving in the future?
- Imagine 5 years down the line and let’s say Stemma is a household name. How have data analysts’ lives improved? Data engineers? Data scientists?
- What are the most interesting, innovative, or unexpected ways that you have seen Stemma used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stemma?
- When is Stemma the wrong choice?
- What do you have planned for the future of Stemma?
Contact Info
- @mark_grover on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Stemma
- Amundsen
- CSAT == Customer Satisfaction
- Data Mesh
- Feast open source feature store
- Supergrain
- Transform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder. Your host is Tobias Macy. And today, I'm interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data. So, Mark, can you start by introducing yourself? For sure. Thank you for having me, Tobias. It's great to be back. I am
[00:01:36] Unknown:
Mark Grover. I am the cofounder and CEO of Stemma. And Stemma is a managed data discovery and metadata platform. I also created the leading open source data discovery and metadata platform called Amundsen at Lyft. And when I did that, I was a PM at Lyft. And prior to that, I worked at Cloudera as a software developer, and I worked on Hive and Spark and Apache Bigtop. Super excited to be here today.
[00:02:05] Unknown:
And you've been on the show, it's been a couple of years ago now, talking about the Amundsen project around the time that it was open sourced. And so now you've built the Stemma business to help continue the work that you've done on Amundsen. I'm wondering if you can just give a bit more of an overview about what it is that you're building at STEMA and some of the story behind how you decided to create a company around the Amundsen project and what it is about this overall space that has inspired you to spend so much time and energy on continuing to bring it forward.
[00:02:38] Unknown:
Absolutely. So the story starts at Lyft. And when I got to Lyft, Lyft was in this heavy growth phase where they were doubling every year, including the number of data users, and Lyft had a very data driven culture. The problem at Lyft wasn't didn't have the right ingestion streams to bring data in, that we didn't have the right data in the warehouse, that we didn't have a warehouse. You know? The problem wasn't that we didn't have the tools in order to consume data like we had. We had airflow set up for doing derived data processing and generating, you know, derived data from raw data.
We had internal streaming platform that was bringing in both events from the Lyft web application, Lyft services, and, most importantly, Lyft mobile applications. We had a warehouse that was built off of Presto, an ETL engine that was running Hive and Spark. We had tools to consume analytical data, Tableau, Looker, Mode, superset, and then we had some internal tools for consuming data for ML purposes. So the problem wasn't that we didn't have the tools to create data or use data. The big problem at Lyft was we had so much data that no 1 had any idea what data existed, where was it, is it trustworthy, can I use it for my use case, and how do I use it? Right? These were all the questions that were bogging down both data analysts and data scientists as well as data engineers.
And I remember this key moment when I first got there talking to a data scientist, and they were trying to optimize ETAs, the time a driver takes to pick you up when you order a Lyft ride. And if you've taken a Lyft ride, you know, you open the app. I tell you, like, hey, Tobias. Your ride is 2 minutes away. And then you go through this request funnel and that ETA number changes. Unfortunately, it goes up sometimes. Right? So we measure, like, ETA 5 times in a session all the way from the first time you open the app to the actual ETA when the driver shows up to you, and the status this data scientist was looking for the actual ETAs so they could compare the new predictions with the past actual ETA.
And to make things worse, we often had these models that were running in shadow mode. Right? So, like, there are 2 models running when you open the app. 1 is showing you the ETA results, but the other 1 is just, like, running and logging some results, but it's never showing to the users. So the long story short is, like, we had a bunch of past models that had persisted data in the data warehouse. We had shadow models that we didn't quite know which 1 was in production, which 1 was not being shown to the users, and we measure and So what happens is your warehouse now has, I don't know, 200 odd columns that have something to do with ETA and you're like, well, which 1 of these is a source of truth? Right?
The canonical way to solve this problem in the past has been, oh, well, Alice works in the ETA team, so I'm gonna give, like, Alice and responsibility to, like, tag the right column as the source of truth for ETA. And the problem with that is, like, in a fast growing organization, a, Alice has real responsibilities of being a software developer, data scientist, data engineer, b, things are changing so often that you can't keep that source of truth always up to date. Right? And so these past curated ways of data curation were not gonna work out at Lyft, and that led me to look around at various different solutions, commercial, open source, and internal in other companies around an automated data catalog. Right? Something that can use information about how is this dataset used by others, by processes, by systems, how often is it generated, are those people in my same team, how many dashboards are built on top of it, all this information to determine what could be useful for you. Right? So we'll never be able to tell you, like, oh my god, like, I have 100% guarantee this is the right thing for you, and there are still constructs to doing that. But maybe for the 80% case, the 90% case, we can get there and say, like, hey. Like, 90% of the people are using this dataset and here's all the ways they're using this particular dataset. And then you can determine, like, yeah. Maybe that's good enough for me. Right? And that led me to create Amundsen, which is an open source automated data catalog. I will also say I use the term data catalog, data discovery tool and metadata engine interchangeably and I can talk a little more about how these terms get used in the industry and why they are different and why I use them interchangeably separately.
But Amundson is data catalog. I created that a lift. It was super successful there. It is, till date, the highest c sat scoring data product at Lyft. It has 750 weekly active users, 80% of the data engineers, data scientists, and data analysts use it every week at Lyft. And then we open source this product. There's 40 companies that use this product in the open. There's Instacart, Brex, Asana, Square, Workday, I n g, and many more. Right? And the answer to like the story behind stemma is twofold, right? 1 is that I believe that this trust in data problem is 1 of the key problems to solve in the data community And if we have to solve this problem in the larger number of enterprises, simply having an open source solution and having them deploy it isn't gonna be effective for them.
And that was 1 of the primary reasons I started Stemma, is to solve this problem for the larger market. The second 1 is as we start solving this problem, what happened at Lyft was started to solve this problem from a trust angle for the data scientists and data analysts, consumers of data, then realize that there are producers of data, data engineers that also need help in actually slightly different ways, but the same product can provide them the help. And that help is usually around, like, migrations and who's using my data and debugging a failure that happened in my job. We'll talk more about that today too. So that's the second persona, data engineers. And the third use cases are around the company. CCPA rolled in in 2020, late 2019, lifted a whole amount of effort to understand what data was out there, where was the personal information stored, and how we're gonna, like, classify it, and how we're gonna handle when somebody requests deletion.
And the same metadata was very useful for enabling those privacy and compliance needs. And so as I look forward in stemma, I see 2 places of work. 1 is enabling this and solving this problem of trust in data for various different personas starting with data analysts, data scientists, then data engineers, and then thirdly, business users, and secondarily, solving this problem of data privacy and classification in in this more and more heavily regulated data space for the larger enterprises.
[00:10:02] Unknown:
In terms of the sort of overall landscape of data workflows and the tools that are available to data engineers and the problems that they're being faced with and the growth in sort of data discovery and data catalogs as a category of tools. Can you just give your perspective on what you see as being Stemma's place and position in this overall ecosystem and some of the ways that it fits into the workflow for data engineers and data producers and the data consumers that they're supporting?
[00:10:36] Unknown:
So I think it's great to dig into some of the key data engineering workflows. Right? So the first key data engineering workflow is creating a new dataset. Right? And that comes from when you are working with a data analyst or a data scientist and you are maybe instrumenting a new metric. So to make a concrete example, say like you're a business like Lyft and you're launching in like bikes and scooters in a new city. Right? So what you need to do is you need to have a dataset for bikes and scooters, and the way this works is that you gotta make sure that the scooters are instrumented, and there's a product engineering team often who's, like, actually instrumenting the scooters and then sending events to the data warehouse for that. Right? And then you have to understand, like, what events are being logged, what do they mean, when do these events get triggered.
You may have questions for the product engineer who worked on on this. You may actually ask the data scientist a bunch of questions because, sometimes, they know more about the domain and what an event may mean or what a particular column in an event may mean. So there's a question of, like, what data do I use? Do I understand this? A question of, like, if you have multiple events, question around, like, trusting that event. Then there's the real hard work of, like, actually building a pipeline. So this may involve, like, writing some code that could be a DAT code, could be, like, a DBCT style like parametrized SQL script, and then there's work around orchestrating this in something like Airflow or Prefect or DBT Cloud and then, like, sharing it with your stakeholders both on the product side and the data side and getting feedback from it. So, like, that's 1 of the key workflows.
The second key workflow and and I will say that there's another parallel workflow here of, like, data that gets replicated from the production data warehouse. It's pretty similar, but there, you are talking to somebody who's creating, like, a Dynamo table or a MySQL or Postgres table in the app database, and that's getting replicated over and you are sometimes you're even responsible for configuring the replication system and making sure it's up to date. The second workflow is, let's say bikes and scooters have been launched for a year, right? Now, you had to maintain this data set because, you know, life evolves. So, some changes happen, like somebody deprecated a column.
If you're lucky, they told you that they were going to deprecate a column in the upstream store. If you're unlucky, which is mostly the case, you find out when someone wakes you up, either a pager and that's also lucky part, but most likely somebody else like a data scientist or, you know, some exec saw some dashboard and be like, oh my god. This dashboard looks off. I'm pretty sure, like, x y z metric wasn't, like, 10% of what it was last week. So I'm certain there's something wrong in the data, and then you get into a fire drill. Right? So this fire drill involves 2 things. It it involves you debugging what the hell is going on, and then the second thing involves is, like, involves you fixing how to change that. Sometimes it fixes in your hands. Sometimes it fixes in somebody else's hand. So these are the 2 sort of main workflows. And to talk about the question that you asked, where does stemma fit into each of these workflows? I'll answer that now. So in the first workflow where you're creating or instrumenting a new dataset, new metric, stemma fits in the first part where you are understanding what events are out there related to bikes and scooters.
What do they mean? How often do they get triggered? Are they still up to date? What column metrics are looking like for every single column in that particular event? Who else is using that? Has someone else already built an ETL in this area? Has someone else, like, built a dashboard in this area that I can learn from? Those are the kinds of questions around context, understanding, trusting that get answered in that first workflow of creating a new table, a new event, or instrumenting a new metric. On the second side, which is, okay, this is data maintenance, right? The tables exist, the pipelines exist, and you have either been notified by a human that something is wrong or something will be wrong if you don't change something or you have been notified by a pager that your job failed. And this requires 2 things. This requires you looking upstream and seeing, okay. I understand that this particular table is off and usually a particular column in that table is off, and I need to look up to know has anything awkward happened in the things that this stems from. Right? And so you're looking at what tables or fields does this come from, have there been any issues in those tables or fields, so on and so forth. That's 1. Once you have figured that out, maybe you have to make a change to your existing paradigm and accommodate that. An example is, like, oh, we only supported, like, 2 kinds of OSes in the past. Right? IOS and Andor. And let's say a new hypothetical OS launches, and we have to support that. You were only expecting 2 fields in a column, but now there are 3 fields in a column, and you gotta update your ETL for some reason to accommodate that. Right? Now you gotta notify all your downstream people that that's a change that's happened and when you have to do that, you have to do 3 things. You have to do, okay, who are the people who ad hoc query this data? I will notify them. Who are the people who have built dashboards on this stuff? I would need to not notify them. And who are the people who have built further derived ETL on this stuff? I would notify them. So the place where Astemma fits in here is to give you insight into the upstream and the downstream information so you can, a, use it for debugging and, b, use it for notification in this data maintenance workflow.
[00:16:12] Unknown:
In terms of the actual data discovery workflow, there have been a number of new tools and offerings that have come up in the past couple of years because of this issue around trust in data and the growth of data sources and just the complexity of the workflows that are involved. And I'm wondering how the evolution of the landscape and the particular areas of focus that some of the different players have settled on has influenced your thinking about what you're building at STEMMA and how Amundsen is able to address the overall space of problems and being able to adapt to situations that you hadn't encountered yourself or that some of the uses of Amazon hadn't encountered, but that are being focused on by some of these other players?
[00:16:59] Unknown:
I would classify, like, the space in terms of competition in 3 categories. The first biggest competition, in my opinion, is organizations doing nothing about this. Right? And this manifests itself if you take the example of a data engineer notifying downstream users that this particular thing is gonna change, this manifests itself in a spray and pray mentality. And so what happens is our data engineer will, like, spam everybody, like, oh, all of this stuff is gonna change here. All analysts and data scientists and data engineers, be aware. Right? You do enough of these and within a week, like, people will stop paying attention to these emails. So they're as good as useless. The biggest competitor, in my opinion, is doing nothing.
The second biggest competitor is there is a tendency to build something from scratch. Right? And I think that there are different cultural reasons in organizations why they do that. Sometimes there are, like, customization reasons that are more to do with, you know, very, very high security needs where you have to do something absolutely on prem and walled off or something like that. That leads to it. But in majority of the cases, like, I think the concepts and paradigms established in these products that exist in the market already, both open source like Amundsen and, you know, other commercial offerings, is that the concepts are there, like, this table and you can extend them to models, which is what's happened in in Nomminson as well, and I'll share more of that later on.
There's, like, very little need to do this thing from scratch. Right? So that's, like, the second kind of competition. And the third are there are, like, a few companies in the space. 2 of them have existed for a while, Alation and Collibra, who provide data catalog and data discovery solutions. The problem with those is that those solutions almost always fail, and there are 2 reasons why they fail. The first 1 is there is this mindset of curation. Right? They rely on this army of data stewards who will go in and tag the column as ETA and the source of truth and make sure, like, it actually flows and stays up to date over time. The problem is any, like, modern fast growing companies don't have an army of data stewards. Right? And you can't have volunteers keep the thing up because it gets out of date. These volunteers have other jobs. Right? The second reason why I think it's hard for them to see it is that there's a tendency to build a product that's too big and behemoth. Right? So not only can you discover data and understand data, but you can now also, in these tools, query data. You can have conversations about data. You can write wikipedia style articles about data and these tools. On the first glance, this looks great, right? Like you're like, oh my god, I can do this all in 1 single place, Right? But it's a terrible thing. Why? Because there are best of breed tools that already exist to do this stuff. So for example, coring data, you already have a bi tool potentially that or your snowflake query data. So now you have 2 tools where the both the users can query data from as well as the data team now has to maintain both these tools. You have Slack for having conversations where you're trying to, like, figure out whether I should have them in my data catalog or in Slack.
You have Confluence for writing Wiki articles, and you're figuring out, should I write them in Confluence or in my data catalog. Right? So I think the 2 main reasons, and I'm publishing a blog post on this today, that catalogs fail is because they either are relying too much on curation and require an army of volunteers or dedicated personnel, steward data stewards, to maintain that information or because they are too bulky and broad and they lead to fragmentation in the data organization. And our approach at Stemma, the way we are different, is that we focus on automation. So can we get as much metadata as possible from automated means by integrating with your Airflow, your Snowflake, your query logs and Snowflake, your dashboarding's API system, your conversations in Slack.
And then the second 1 is that we are lean. Like, I have no desire to build, like, a bulky data catalog that's a crappy suite of products. Right? My desire is to integrate with the best of breed products. Right? So if you use Slack for conversations, like, we will integrate with your Slack and link the conversations that are happening in Slack to the data catalog pages so they are related and don't get lost. Right? But no desire to, like, keep piling on a crappy suite of, like, oh, you can write Wiki page articles or you can, like, have conversations in this or you can query your data. I don't think that's the right way to solve this problem.
[00:21:41] Unknown:
Another interesting element of this space is that we have been using computers for decades, but as humans for even longer than that. And yet we still keep running into this issue of trust in the data and the analysis that we're building on top of it. And I'm wondering why you see that as still being such a momentous problem despite the levels of sophistication that we're able to achieve across so many different areas.
[00:22:09] Unknown:
I agree. This problem still exists. I think the severity of this problem has gotten worse over the years. Right? And if you look back, like, it's no surprise that it's gotten worse. Why? Because we've done a ton of innovation in ingesting data. Right? Fivetran, Stitch enable you to get data from various different sources into 1 centralized place. We have built a lot of great technology to store all this data in 1 centralized place and Snowflake, BigQuery, Redshift are examples of that. Then we have built technologies that help us process and further derive data from it, like Airflow, Prefect, DBT, or examples of that, then we've built technologies that, like, really help us analyze the data, which is, like, Tableau, Mode, Looker, things of that sort. And what we have done additionally is, like, we have ingrained a culture of data driven decision making in the companies and really hired people both who have like a very heavy analytical skill set and that's their only job, data scientists and analysts, or hired executives, product managers, engineers who have an analytical mindset and while their primary job is doing something else, they have a huge appetite for data. Right? Okay. So what's happened is there's a bunch of investment in getting data into the organization, there's a bunch of people hungry for data and they have a bunch of tools to actually use this data, but the data lakes, data warehouses are so huge that nobody knows what's out there and what could be trusted. Right? So it's no surprise looking back that the severity of this problem has increased tenfold, if not more, over just the last few years.
[00:23:50] Unknown:
As far as Stemma and Amundsen, so Stemma is building on top of the Amundsen project, which is open source, and people are able to take it and use it and modify it for their own purposes. So what is it that you are adding to Amundsen or building alongside it to help people gain more trust in their data and streamline some of these workflows and the coordination and collaboration between data producers and data consumers and the overall business?
[00:24:20] Unknown:
Absolutely. Yeah. So Amundsen is the open source project I cocreated at Lyft, and that's almost 2, 000 person community. Stemma is a managed version of that product and has 3 things that are additionally available on top of Amundsen. So the first 1 is there's a managed offering with enterprise grade security and 2 different deployment models that are offered based on your preference of, like, data residency. And the second thing is this category of intelligence or further automation. Right? And so this includes us, for example, parsing your query logs and understanding what are the common ways this data gets joined or filtered.
And that's because once you have established that, okay, I want to use this data set, your question is, okay. Well, I've got this bikes and scooters data, but I wanna link it to the region the bike ride was taken. But the region mapping is like it's a lookup table that's like some other table, but I don't know where that table is. Right? But the thing is, like, everybody who's done that analysis in the past already knows where that table is because, presumably, they've done that. And instead of you documenting all the foreign keys and creating, like, these ER diagrams, like, all of this information is in your query logs. Right? So things that Stemma does in this intelligence category is, like, parse out your query logs and make suggestions on, like, what are the most common join and filter conditions based on what everybody else is using. There's some intelligence here around, like, linking Slack conversations with a special Slack bot. That list goes on, but the whole idea is, can we reduce further the need for you to curate information in the data catalog? And that's the category that I'm talking about here in number 2 category called intelligence.
And the 3rd category is, at the end of the day, your data catalog is changing behavior. Right? Your data engineer has to remember to look at the data catalog to understand and notify the downstream consumers instead of just spamming everybody on the Slack channel. Your data analyst has to remember to, like has to have this habit of, like, looking at the data in the data catalog instead of just the shoulder tapping techniques that they've been using. And there's a bunch of, like, organizational learning that I've had, that we at Stem have had, that enables to make sure, like, which personas to prioritize as primary users for the data catalog first, How do you actually embed this in their workflow?
Where do you integrate first? And how do you make sure that the data catalog gets adopted and is like a high CSAT product for all data users at the company? And that's the 3rd category of, like, organizational, like, adoption and success learnings that comes with stemma that we deploy. Some of that is in the product, and some of that is, like, organizational that we work with our customers to deploy in the organizations.
[00:27:09] Unknown:
Now that you have this platform in the form of Stemma to build on top of and to help organizations establish trust in their data workflows and the analyses that they're building on top of it. What are some of the additional opportunities for innovation and ways to streamline the communications and workflows between those data producers and data consumers and the business users within the organization who don't necessarily have the background as a data professional to be able to do the sort of deep critique of the information that they're interacting with.
[00:27:48] Unknown:
I put the these personas in 4 categories. Like, there's the product engineers who are creating production data or, like, instrumenting events. Then there's the data engineers getting this data into the warehouse, building pipelines, and delivering derived data. Next 1 is data consumers of the derived data, so your analysts, data scientists. And then the 4th 1 are business users, and that term is rather vague. So when I say business users, I use the examples of, like, a finance analyst who's maybe savvy with Excel, but not quite as savvy with SQL, a marketing analyst, customer service analyst, and some companies could be your CFO. Right? Examples are those people. And so depending on the kind of data we are talking about, the producers and consumers are different. So if you're talking about raw data, the producer is product engineer, the consumer is data engineer, sometimes an analyst, but mostly a data engineer.
If you're talking about derived data, producer is data engineer, consumer is data scientist, data analyst. If you're talking about insights or dashboards, then the producer is analyst or data scientist, consumer is a business user. Right? So to answer your question, like, there are these gaps that exist between each of these verticals and the questions at a high level are similar, like, what's out there? What information do I have that I can trust? And what information I don't have so I know I don't have? Right? And there's this concept of data mesh that's been talked about quite a lot, which all aims to, like, bridging the gap and clarifying the ownership between these producers and consumers.
So the opportunity here lies in having a crisp understanding of what are the assets here in question that we are talking about in producers and consumers, and I'm talking within the organization, and which ones do I need to prioritize first in terms of gaps. And then having metadata and a shared understanding of that metadata be there, ideally something that is not manually curated. Right? Ideally as much automated as possible. Now I'll give you examples where automation is gonna fail. Right? So if you look at the producer and consumer at the far right end of the spectrum, these are your data analysts, data scientists producing metrics or dashboards or insights and your business user consuming them.
What does revenue mean at a company? What does attrition mean or churn mean at a company? Those are like problems that you need to talk to a few people align like somebody is considered churn if they don't show up for the last 30 days and won't show up in the next 15, right, recurring revenue and these kind of things are included, these kind of things are not included. So those things, for example, like, you can't automate them. And I think what's important then is to hook into the time when the metric is being created, to hook into the time when it's in the data scientist DRAM. Right? And when they're, like, writing or instrumenting that metric, when they're creating that dashboard, you're, like, asking all the right questions in that flow, and you're ingesting that metadata into a data catalog, which is then exposing that into the business users who can quickly figure out, like, what are my key metrics for the company or my team? How are they changing?
And where are they represented in dashboards? Right? So those are some of the ways that I think about it. I'm happy to dig into any of these areas if that's interesting to you. Yeah. I think that it's definitely interesting to talk about some of the sort of social aspects of
[00:31:29] Unknown:
trust in data because a lot of it is in the sort of organizational sense and not so much in the technical qualities of the processing and delivery of the data. It's more in the sort of social and contextual understanding around what the purpose and meaning is of the data, and that's why I think it continues to be such an issue as far as being able to establish and maintain this trust.
[00:31:55] Unknown:
Yeah. Absolutely. I'll share another example of this around, like, the social issues that you were talking about. Right? Like, often there's a desire for let's say you create an event in the phone app, and this event needs to be like, it's maybe a pro buff schema or a segment schema, and this event needs to be documented. Right? And what happens is, like, you don't often encourage this product engineer to document this stuff. So this event flows to your data warehouse. Your data engineer has to use it. They don't have any clue as to what this means. Right? Or when it's populated, who owns this, all that stuff.
And 1 way is to, like, actually understand, like, where is this coming from, like, figure out the right owner within the organization and get them this information. Right? Another way is, like, when they were actually creating that event, that product engineer has a lot of context in their head. Right? Sometimes they write a spec. Sometimes they're writing a protobuf definition, and they may write, like, comments in the protobuf definition that are actually like comments on the fields. And I think there's a lot of sort of organizational stuff that gets in the way. But 1 of the things that we did at Lyft was super successful is that we said, like, certain tables, the descriptions and column descriptions on those would be read only because the source of truth is the protocol file where they were defined by the product engineer. And we built some tooling to, like, encourage and force the product engineers are putting that information in there. And I think it was 1 of the best decisions for getting that information. Right? So the learning we've had is that it's best to get this information in the flow of the users, and, also, it's best to show this information in the flow of the users. And so exposing the information in the data catalog into various different tools that a data analyst, data scientist, data engineer would use is, like, super meaningful for a product in this category.
[00:33:52] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? What about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your DBT models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census today to get a free 14 day trial.
And so for organizations that have built up a level of sophistication around how they're processing data, and they have the contextual information about the purpose and intent of the data, what are some of the areas where you still see friction come up in terms of the workflows around the data or the requests for analysis that continue to be challenging to fulfill despite the understanding that might already be established in the organization?
[00:35:05] Unknown:
So I see 2 constant sets of friction that come up that I don't think we have solved and I have an opinion on them, so I'll share them. The first 1 is around analysis. So a PM may come to a data analyst or a data scientist and be like, I would like you to help me find, like, how this particular budget is being used. And the thing is analyst time is very valuable, right, like all knowledge worker time. And the interesting question that doesn't get asked often is what decision are you gonna make from this information and what are the options for that decision? Right? Because if you are looking for something just as FYI, in a fast growing company like that may not be a reason, but your important reasons may be to actually make a decision.
And having that level of clarity, like, oh, I'm looking for this metric. If this is more than x percent than the norm, then I would like to change the product in this way. And if it's less than that, then I like to keep it or change the product in this other way. Right? That's a very good answer. And so I think that level of conversation is something we're not having in the analytics world. Now on the data engineering world, which may be more relevant for this podcast, is that there's still a tendency to hire your way out of creating derived data models. Right? So it's like every area in the company will have a data engineer who's writing pipelines and maintaining pipelines.
And I find that that approach is not scalable. Right? You are much better off in having your data engineers write pipelines and data for certain core parts of the company. So let's say, like, maybe 20% of the company, and they are doing that for maybe core financial metrics or the key company wide metrics that everybody's gonna use. Right? But then you are better off democratizing, like, all the various, like, you know, marketing data scientists, marketing data analysts to, like, write derive models from them. Why? Because, a, they have all the domain context. Right? And, b, it's just a matter of, like, them picking up a few skills or you building some technology or tools or deploying some technology or tools that enable them to do that. Right? If I were to back up and think about, like, the data engineering workflow, there are, like, 2 parts to this in terms of what are the skills, like, what's the taste that's required in order to be a good data engineer. 1 skill that's required to be a good data engineer, and I may be off here, so if anyone disagrees with this or thinks differently about this, please comment on the podcast.
1 skill is this taste of modeling. Right? What is the breadth of my table? What's the grain of my table? What's the depth of my table? So on and so forth. Right? Where do I split them apart? The second thing is like, okay. I've done the modeling. Now I need to figure out, like, how do I actually implement this in an efficient and effective way so it runs on time and it's partitioned properly, so on and so forth. Right? I find that the first category of modeling is a very taste thing. Like, this is something you need a human to do. I don't think this can be, like, automated in any means. Right? And I think this is a skill that's unique to data engineers, but I think there is opportunity to democratize this particular modeling skill to the larger technical data consumers, those mostly being data scientists and data analysts.
And data scientists and data analysts also have a lot of domain expertise. So a marketing data analyst may actually know a ton about marketing. And in some ways, if you just, like, help them understand some of the modeling best practices, they can craft a really, really good data model. Right? So that part can be democratized. The second part around like creating efficient jobs that are partitioned well and run well, that part can be solved with tooling. Right? And to the extent where historically, what we've tried to do is we've tried to shove analysts and data scientists to do, like, the all the high of, like, configuration properties and what parameters for dynamic partitioning you need to use for what kind of jobs. All of that stuff is, like, bogus. Like, we need to stop doing that. We need to elevate the abstraction so that the data analysts and data scientists can actually start once the model is figured out, they can start writing their own pipelines. And dbt, for example, has gone a long way in moving that level of abstraction up, but I think there's more to do here. Right? And that is where I see this going.
[00:39:34] Unknown:
Another interesting element of this overall conversation about trust in data and data catalogs and data discovery, particularly in the past 6 months, year, somewhere thereabouts, has been focused on the use of data for analytical purposes with the focus being on things like analytics engineers, end users of the business, interactions with business intelligence and dashboarding tools, and where the data is largely being sourced from a data warehouse that has, you know, a decent amount of structure to it. And I'm wondering what your experience has been as far as the applicability and opportunities for data catalogs and specifically to Amundsen and what you're doing at Stemma for more machine learning and AI oriented workflows and sourcing data from unstructured sources or data lakes?
[00:40:26] Unknown:
Yeah. I think you are right. They were started serving more analytical workloads then moved towards more data engineering, like migration heavy or consumption heavy, like notification heavy sort of workloads. And what we are finding now is it's like we're beginning this journey of helping, like, the ML users. Right? So, for example, 1 thing that's happened in Amazon is, like, we are in version 2 of having feature discovery as a part of Amazon. And the first version was done by this company called Get In Data, and what they did was they took all the data model for tables in Amazon, and they extended that to include features and specific feature attributes.
But it was a little bit of, like, sort of shoehorning that concept in the table concept. And what's happening now is that Amazon now has, thanks to the team at Lyft, the second concept of second variant of this ML feature discovery in Amazon. And so there, you could easily extend it. So the first version, like, integrated with things like Feast, and the second version is, like, much more flexible and can integrate with a variety of different feature stores. And that's the way I see this evolving. Right? Any data catalog has to index and automate a bunch of different resources. What resources you choose depends on what persona you are catering to. So the initial resources were from, like, data warehouses and data lakes, structured SQL analytical data, and indexing like Tableau and Mode and Looker, that kind of thing. And now we're moving on to, like, resources that are more like notebooks and then features that cater to more of the ML users. I see that evolving as a product goes on, and there's a much larger sort of road map around what kind of resources we want to index at what time based on which persona is the 1 that we are interacting with. The next up is, like, the business user. Right? The thing that's happening with business users is that there's this marketplace of insights between the data scientists and data analysts and the business users.
Data scientists, data analysts produce these insights. Business users consume them. But there's no, like, standard way of defining what a metric is. Right? And usually requires, like, a bunch of organizational stuff, then you tell the data scientists, like, okay. This is a metric. The data scientist instruments it, shows up puts it in a dashboard. It shows up, in a dashboard, and then you have to, like, maybe you build v 2 off the dashboard and people are still using the v ones. Pretty bad. Right? And so there are 2 so there's no standard for defining metrics, and there are 2 companies, start ups, that are trying to standardize on a platform for defining metrics and be able to serve that to the business users and the data scientists and data analysts.
And those companies, like, 1 of them is called Supergrain. I believe the website is supergrain.com. And then the second 1 is called Transform Data, which is, I think, transform data dot io. And they're both trying to define, like, what is the standard way of defining metrics. So the point where those become standard way of dividing metrics, like a data catalog then becomes a read only view of sharing what the metric definitions were that are defined there. And I think these are the kinds of innovations and problems that we need to solve in the future. In terms
[00:43:38] Unknown:
of the applications of Amundsen and Stemma now that you've launched it, what are some of the most interesting or innovative or unexpected ways that you've seen it
[00:43:47] Unknown:
used? Yeah. Some of the very interesting things that are happening here are using a data catalog to do classification and auditing of sensitive data in the warehouse. So Square is an example of a company that does this with open source, Amundsen, and they have taken Amundsen, which whose original intent was to cater to this data discovery, data trust for producers, data engineers, consumers, data scientists, data analysts. And they've taken this and they've started tagging using automation from Google DLP, similar to AWS Macie, on what data is sensitive and they are tagging columns with PII name, PII email, and then if the confidence is low, then the data catalog becomes a place where they go approve or reject that. That is the data owner who does that. And once this is all there, then you are alerting, notifying when PII shows up in places where it's not supposed to show up. And that's like 1 of the most interesting use cases that's come up and that's 1 of the things that I was alluding to earlier is that the same metadata that's being powered for the discovery and cataloguing use cases can be used to power adjacent use cases like privacy and classification and CCPA, GDPR, which is exactly the direction that Square is headed. And in your experience
[00:45:09] Unknown:
of building the business of STEMA and building on top of the Emmonson project and continuing to be engaged with that community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:45:22] Unknown:
1 interesting thing that's been for me is that coming in to start the business, I thought that everything was on the cloud and I guess it is still true that everything is on the cloud. What I didn't realize was that there were a lot more conversations about what's on my cloud and what's on your cloud. That was something that was unexpected to me, and we've learned and we've changed the product and the deployment models based on that. But it was a really great learning. The second thing on a product perspective, I would say, is that there's still a lot to be done here. Like, there are 4 personas and the organizational work. So there's product engineers, data engineers, data scientists analysts, business users.
And each of these personas have distinct needs, so there's a lot of thought on, like, which of these personas we're gonna cater to. By the way, like, stemma caters first and foremost to the the middle ones, data scientists, data analysts, and data engineers. And so, like, that doesn't mean we don't have products for those, but those 2 remain our first focus. Right? And so there's a lot of work that needs to be done even for you pick these 2 personas, you're like, man, I can go as deep and still, like, have a ton of work to do in 5 years. Right? So a lot of work to do for each of these 4 personas as well as for the organization who wants to, like, make sure that it tracks where sensitive data is in the company, that it's locked down, that there's auditing and alerting when it shows up in places it doesn't.
And that is, like, another place we need to keep investing.
[00:46:53] Unknown:
For people who are experiencing issues with data trust and they're investigating the offerings for different data catalogs and data discovery platforms, what are the cases where Stemma is the wrong choice?
[00:47:05] Unknown:
I think the place where stem is the wrong choice for these people is if you want a very, like, high control data catalog. Right? So imagine a world in which you want everything to be curated, and more importantly, you want only, like, the owner to update the notification update the descriptions. You wanna get a notification when someone or request to update their description. In a world where you want to tightly control and manage entirety of your data, Stemma is the wrong tool for you. Because Stemma is built from the perspective where you don't have, like, an army of data stewards. You want to democratize data within the company, and your company has, like, these pockets of excellence of certain domains and you want to use those pockets, mostly automation with curation on the top 20% to enable them to be better at doing data driven decisions. Right?
And so if you want a high control environment for your entire data data with workflows for approval and rejection on simple things like description update, then Stemma isn't the right product for you. And so as you continue to build out Stemma and interact with your customers and engage with the Amundsen community, what are some of the things that you have planned for the near to medium term and any projects that you're especially excited for? Yeah. So I talked about the 4 personas in the organization a moment ago. Talking about those 2 personas. Right? Specific for data engineers, I'll dig into just that. The full answer to this question is a little longer for this podcast. But for the data engineers, I think there's, like, a lot of work we are doing and continue to do in making their job both in terms of trusting this new product data that I'm gonna use in order to derive pipelines, but more interestingly and recently, like, migrations and how can I help derisk, speed up migrations by looking at data and how it's being used, by understanding and figuring out what clicks of data need to move together, by understanding what are the sources that this data is coming from because maybe they need to be migrated first before I move this? And so, like, that's what we are planning in the future for Stemma for data engineers.
More broadly, each of these 4 personas, like I said before, have different needs around, like, understanding what assets they usually use, whether they are trustworthy. And our goal is for each of them, get to a point where we are doing 80% automation and then 20% curation for the most heavily used data assets. Right? And those are the areas we're investing in. The second thing we wanna do is enable the organization to know very easily where their sensitive data is and then be able to classify, understand, and derisk that information.
[00:50:05] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I would just like to briefly get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:22] Unknown:
Change of who's writing pipelines in the company. And, historically, data engineers have been the only people who can write the pipeline and now more and more so, the skill is getting democratized and analysts, data scientists can write pipelines. I think the part that I still see missing is the data modeling part. So back to the 2 skills that I think are unique to a data engineer, the data modeling is a very tasteful exercise, and I think, data analysts and data scientists have a huge domain expertise. And I think to the extent where we can democratize this data modeling skill, then data engineers can focus on higher impact, higher value problems only for that 20% data within the company, like the core financial metrics or company wide metrics or more platform infrastructure things to bring data into the warehouse, out of the warehouse, and enable,
[00:51:17] Unknown:
like, more leverage for a data engineering role if we are able to democratize that. So that's, like, 1 trend I'm seeing and there's 1 sort of gap that I see related to that that I wanted to share with you before we end. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing with Amundsen and and the business that you're building with Stemma. It's definitely very interesting and important problem domain. So I appreciate all the time and energy you've put into it, and I hope you enjoy the rest of your day. Thank you, Juarez. Had a really good time. Thank you for having me. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Mark Grover and Stemma
Challenges at Lyft and the Birth of Amundsen
Stemma’s Role in Data Workflows
Data Discovery and Catalog Landscape
Trust in Data: Persistent Challenges
Opportunities for Innovation in Data Workflows
Social Aspects of Data Trust
Data Catalogs for Machine Learning and AI
Unexpected Uses and Lessons Learned
Future Plans and Exciting Projects