Build Trust In Your Data By Understanding Where It Comes From And How It Is Used With Stemma

Hello, and welcome to the Data Engineering Podcast, the show about modern data

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack is the smart customer data pipeline.

Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse,

enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.

Your host is Tobias Macy. And today, I'm interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data. So, Mark, can you start by introducing yourself? For sure. Thank you for having me, Tobias. It's great to be back. I am

Mark Grover. I am the

cofounder and CEO of Stemma. And Stemma is a managed data discovery and metadata platform.

I also created the leading open source data discovery and metadata platform called Amundsen

at Lyft. And

when I did that, I was a PM at Lyft. And prior to that, I worked at Cloudera as a software developer,

and I worked on Hive and Spark and Apache Bigtop. Super excited to be here today.

And you've been on the show, it's been a couple of years ago now, talking about the Amundsen project around the time that it was open sourced. And so now you've built the Stemma business to help

continue the work that you've done on Amundsen.

I'm wondering if you can just give a bit more of an overview about what it is that you're building at STEMA

and some of the story behind how you decided to

create a company around the Amundsen project

and what it is about this overall space that has

inspired you to spend so much time and energy on continuing to bring it forward.

Absolutely. So the story starts at Lyft. And

when I got to Lyft, Lyft was in this heavy growth phase

where they were doubling every year, including the number of data users, and Lyft had a very data driven culture.

The problem at Lyft wasn't

didn't have the right ingestion streams to bring data in,

that we didn't have the right data in the warehouse,

that we didn't have a warehouse. You know? The problem wasn't that we didn't have the tools in order to consume data like we had. We had airflow set up for doing

derived

data processing and generating,

you know, derived data from raw data.

We had internal streaming

platform that was bringing in both events from the

Lyft web application,

Lyft services,

and, most importantly,

Lyft mobile applications.

We had a warehouse that was built off of Presto, an ETL engine that was running Hive and Spark.

We had

tools to consume analytical data,

Tableau,

Looker, Mode,

superset,

and then we had some internal tools for consuming data for ML purposes.

So the problem wasn't that we didn't have the tools to create data

or use data.

The big problem at Lyft was we had so much data that no 1 had any idea

what data existed,

where was it,

is it trustworthy,

can I use it for my use case, and how do I use it? Right? These were all the questions that were

bogging down both

data analysts and data scientists

as well as data engineers.

And I remember this key moment when I first got there

talking to a data scientist,

and they were trying to optimize

ETAs, the time a driver takes to pick you up when you order a Lyft ride.

And if you've taken a Lyft ride, you know, you open the app. I tell you, like, hey, Tobias. Your ride is 2 minutes away.

And then you go through this request funnel and that ETA number changes. Unfortunately, it goes up sometimes. Right? So we measure, like, ETA 5 times in a session all the way from the first time you open the app to the actual ETA when the driver shows up to you, and the status this data scientist

was looking for the actual ETAs so they could compare the new predictions with the past actual ETA.

And to make things worse, we often had these models that were running in shadow mode. Right? So, like, there are 2 models running when you open the app. 1 is showing you the ETA results, but the other 1 is just, like, running and logging some results, but it's never showing to the users.

So the long story short is, like, we had a bunch of past models that had persisted data in the data warehouse.

We had shadow models that we didn't quite know which 1 was in production, which 1 was not being shown to the users, and we measure

and

So what happens is your warehouse now has, I don't know, 200 odd columns that have something to do with ETA and you're like, well, which 1 of these is a source of truth? Right?

The canonical way to solve this problem in the past has

been, oh, well,

Alice works in the ETA team, so I'm gonna give, like, Alice and responsibility

to, like, tag the right column as the source of truth for ETA.

And the problem with that is, like, in a fast growing organization,

a, Alice has real responsibilities

of being a software developer, data scientist, data engineer,

b, things are changing so often that you can't keep that source of truth always up to date. Right? And so these past curated ways of data curation

were not gonna work out at Lyft,

and that led me to look around at various different solutions,

commercial,

open source, and internal in other companies

around an automated data catalog. Right? Something that can use information

about

how is this dataset used by others, by processes, by systems, how often is it generated,

are those people in my same team,

how many dashboards are built on top of it, all this information to determine

what could be useful for you. Right? So we'll never be able to tell you, like, oh my god, like, I have 100% guarantee this is the right thing for you, and there are still constructs to doing that. But maybe for the 80% case, the 90% case, we can get there and say, like, hey. Like, 90% of the people are using this dataset and here's all the ways they're using this particular dataset. And then you can determine, like, yeah. Maybe that's good enough for me. Right? And that led me to create Amundsen, which is an open source

automated data catalog. I will also say I use the term data catalog,

data discovery tool and metadata engine interchangeably

and I can talk a little more about

how these terms get used in the industry and why they are

different and why I use them interchangeably

separately.

But Amundson is data catalog. I created that a lift. It was super successful there. It is, till date, the highest

c sat scoring data product at Lyft.

It has

750

weekly active users, 80% of the data engineers, data scientists, and data analysts

use it every week at Lyft. And then we open source this product.

There's 40 companies that use this product in the open. There's Instacart, Brex, Asana, Square, Workday, I n g, and many more. Right?

And the answer to like

the story behind stemma is twofold, right?

1 is that

I believe that this trust in data problem is 1 of the key problems to solve in the data community

And if we have to solve this problem

in the larger

number of enterprises,

simply having an open source solution and having them deploy it isn't

gonna be effective for them.

And

that was 1 of the primary reasons I started Stemma, is to solve this problem

for

the larger market. The second 1 is as we start solving this problem, what happened at Lyft was

started to solve this problem from a trust angle for the data scientists and data analysts, consumers of data, then realize that there are producers of data, data engineers that also need help in actually slightly different ways, but the same product can provide them the help. And that help is usually around, like, migrations and who's using my data and debugging a failure that happened in my job. We'll talk more about that today too. So that's the second persona, data engineers. And the third use cases are around the company. CCPA rolled in in 2020,

late 2019,

lifted a whole amount of effort to understand what data was out there, where was the personal information stored, and how we're gonna, like,

classify it, and how we're gonna handle when somebody requests deletion.

And the same metadata was very useful

for enabling those privacy and compliance needs. And so as I look forward in stemma, I see 2 places of work. 1 is enabling this and solving this problem of trust in data for various different personas

starting with data analysts, data scientists, then data engineers, and then thirdly, business users, and secondarily,

solving this problem of data privacy and classification

in in this more and more heavily regulated data space for the larger enterprises.

In terms of the sort of overall landscape of data workflows and the tools that are available to data engineers and the problems that they're being faced with

and the growth

in sort of data discovery and data catalogs as a category of tools. Can you just give your perspective on what you see as being Stemma's

place and position in this overall ecosystem

and some of the ways that it fits into the workflow for data engineers and data producers and the data consumers that they're supporting?

So I think it's great to dig into

some of the key data engineering workflows. Right? So the first key data engineering workflow

is

creating a new dataset.

Right? And that comes

from when you are working with a data analyst or a data scientist

and you are maybe instrumenting a new metric.

So to make a concrete example, say like you're a business like Lyft and you're launching in like bikes and scooters in a new city. Right? So what you need to do is you need to have a dataset for bikes and scooters, and the way this works is that you gotta make sure that the scooters are instrumented, and there's a product engineering team often who's, like, actually instrumenting the scooters and then sending events to the data warehouse for that. Right? And then you have to understand, like, what events are being logged,

what do they mean, when do these events get triggered.

You may have questions for the product engineer who worked

on on this. You may actually ask the data scientist a bunch of questions

because, sometimes, they know more about the domain

and what an event may mean or what a particular column in an event may mean.

So there's a question of, like, what data do I use? Do I understand this? A question of, like, if you have multiple events, question around, like, trusting that event.

Then there's the real hard work of, like, actually building a pipeline. So this may involve, like, writing some code that could be a DAT code, could be, like,

a DBCT

style like parametrized

SQL script,

and then there's work around

orchestrating this in something like Airflow or Prefect or DBT Cloud and then, like, sharing it with your stakeholders both on the product side and the data side and getting feedback from it. So, like, that's 1 of the key workflows.

The second key workflow and and I will say that there's another

parallel workflow here of, like, data that gets replicated from the production data warehouse.

It's pretty similar, but there, you are talking to somebody who's creating, like, a Dynamo table or a MySQL or Postgres table in the app database,

and that's getting replicated over and you are sometimes you're even responsible for configuring the replication system and making sure it's up to date.

The second workflow is, let's say bikes and scooters have been launched for a year, right? Now, you had to maintain this data set because, you know, life evolves.

So, some changes happen, like somebody deprecated a column.

If you're lucky, they told you that they were going to deprecate a column in the upstream store. If you're unlucky, which is mostly the case, you find out when someone wakes you up, either a pager and that's also lucky part, but most likely somebody else like a data scientist or, you know,

some exec saw some dashboard and be like, oh my god. This dashboard looks off. I'm pretty sure, like, x y z metric wasn't, like, 10% of what it was last week. So I'm certain there's something wrong in the data, and then you get into a fire drill. Right? So this fire drill involves 2 things. It it involves you debugging what the hell is going on, and then the second thing involves is, like, involves you fixing how to change that. Sometimes it fixes in your hands. Sometimes it fixes in somebody else's hand. So these are the 2 sort of main workflows. And to talk about the question that you asked, where does stemma fit into each of these workflows?

I'll answer that now. So in the first workflow where you're creating or instrumenting a new dataset, new metric,

stemma fits in the first part where you are

understanding what events are out there related to bikes and scooters.

What do they mean? How often do they get triggered? Are they still up to date? What column metrics are looking like for every single column in that particular event?

Who else is using that? Has someone else already built an ETL in this area?

Has someone else, like, built a dashboard in this area that I can learn from? Those are the kinds of questions around context, understanding, trusting

that get answered in that first workflow of creating a new table,

a new event, or instrumenting a new metric.

On the second side,

which is, okay, this is data maintenance, right? The tables exist, the pipelines exist, and you have either been notified by a human that something is wrong or something will be wrong if you don't change something or you have been notified by a pager that your job failed. And this requires 2 things. This requires you looking upstream

and seeing, okay. I understand that this particular table is

off and usually a particular column in that table is off, and I need to look up to know has anything awkward

happened in the things that this stems from. Right? And so you're looking at what tables or fields does this come from, have there been any issues in those tables or fields, so on and so forth. That's 1. Once you have figured that out, maybe you have to make a change to your existing paradigm

and accommodate that. An example is, like,

oh, we only supported, like,

2 kinds of OSes in the past. Right? IOS and Andor. And let's say a new hypothetical OS launches,

and we have to support that. You were only expecting 2 fields in a column, but now there are 3 fields in a column, and you gotta update your ETL for some reason to accommodate that. Right? Now you gotta notify all your downstream people that that's a change that's happened and when you have to do that, you have to do 3 things. You have to do, okay, who are the people who ad hoc query this data? I will notify them. Who are the people who have built dashboards on this stuff? I would need to not notify them. And who are the people who have built further derived ETL on this stuff? I would notify them. So the place where Astemma fits in here is to give you insight into the upstream and the downstream information so you can, a, use it for debugging and, b, use it for notification in this data maintenance workflow.

In terms of the actual

data discovery

workflow, there have been a number of new tools and offerings that have come up in the past couple of years

because of this issue around trust in data and the growth of data sources and just the complexity of the workflows that are involved. And I'm wondering how

the

evolution of the landscape and the particular areas of focus that some of the different players

have settled on has influenced your thinking about what you're building at STEMMA and how Amundsen is able to

address the overall space of problems and being able to

adapt to situations that you hadn't encountered yourself or that some of the uses of Amazon hadn't encountered, but that are being focused on by some of these other players?

I would classify, like, the space in terms of competition in 3 categories.

The first biggest competition, in my opinion, is

organizations doing nothing about this. Right?

And this manifests itself if you take the example of a data engineer notifying downstream

users that this particular thing is gonna change,

this manifests itself in a spray and pray mentality.

And so what happens is our data engineer will, like, spam everybody, like, oh, all of this stuff is gonna change here. All analysts and data scientists and data engineers, be aware. Right?

You do enough of these and within a week, like, people will stop paying attention to these emails. So they're as good as useless. The biggest competitor, in my opinion, is

doing nothing.

The second biggest competitor

is there is a tendency

to build something from scratch.

Right? And I think that

there are different cultural reasons in organizations

why they do that. Sometimes there are, like, customization

reasons that are more to do with,

you know, very, very high security needs where you have to do something absolutely on prem and walled off or something like that. That leads

to it. But in majority of the cases, like, I think the concepts and paradigms established in these products that exist in the market already, both open source like Amundsen

and, you know, other

commercial offerings,

is that the concepts

are there, like, this table and you can extend them

to models, which is what's happened in in Nomminson as well, and I'll share more of that later on.

There's, like,

very little need to do this thing from scratch. Right? So that's, like, the second kind of competition. And the third are there are, like,

a few companies in the space. 2 of them have existed for a while, Alation and Collibra,

who provide data catalog and data discovery solutions.

The problem with those is that those solutions

almost always fail, and there are 2 reasons why they fail. The first 1 is

there is

this mindset of curation. Right? They rely on this army of data stewards who will go in and tag the column as ETA and the source of truth and make sure, like, it actually flows

and stays up to date over time. The problem is any,

like, modern fast growing companies

don't have an army of data stewards. Right? And you can't have volunteers keep the thing up because it gets out of date. These volunteers have other jobs. Right? The second reason why I think it's hard for them to see it is that

there's a tendency to build a product that's too big and behemoth. Right? So not only can you discover data and understand data, but you can now also, in these tools,

query data. You can have conversations about data. You can write wikipedia style articles about data and these tools. On the first glance, this looks great, right? Like you're like, oh my god, I can do this all in 1 single place, Right? But it's a terrible thing. Why? Because there are best of breed tools that already exist to do this stuff. So for example, coring data, you already have a bi tool potentially that or your snowflake

query data. So

now you have 2 tools where the both the users can query data from as well as

the data team now has to maintain both these tools. You have Slack for having conversations

where you're trying to, like, figure out whether I should have them in my data catalog or in Slack.

You have Confluence for writing Wiki articles, and you're figuring out, should I write them in Confluence or in my data catalog. Right? So I think the 2 main reasons, and I'm publishing a blog post on this

today, that catalogs fail is because

they either

are relying too much on curation and require an army of volunteers or dedicated

personnel, steward data stewards, to maintain that information or because they are too bulky and broad and they lead to fragmentation in the data organization.

And our approach at Stemma, the way we are different, is that we focus on automation. So can we get as much metadata as possible from automated means by integrating with your Airflow, your Snowflake, your query logs and Snowflake, your dashboarding's

API system, your conversations in Slack.

And then the second 1 is that we are lean. Like,

I have no desire to build, like, a bulky data catalog that's a crappy suite of products. Right?

My desire is to integrate with the best of breed products.

Right? So if you use Slack for conversations,

like, we will integrate with your Slack and link the conversations that are happening in Slack to the data catalog pages so they are related and don't get lost. Right? But

no desire to, like, keep piling on a crappy suite of, like, oh, you can write Wiki page articles or you can, like, have conversations in this or you can query your data. I don't think that's the right way to solve this problem.

Another interesting

element of this space is that

we have been

using computers for decades, but as humans for even longer than that.

And yet we still keep running into this issue of trust in the data and the analysis that we're building on top of it. And I'm wondering

why you see that as still being such a momentous problem despite the levels of sophistication that we're able to achieve across

so many different areas.

I agree. This problem still exists. I think the severity of this problem has gotten worse over the years. Right? And if you look back, like, it's no surprise that it's gotten worse. Why? Because we've done a ton of innovation

in ingesting data. Right? Fivetran,

Stitch

enable you to get data from various different sources into 1 centralized place.

We have built a lot of great technology to store all this data in 1 centralized place and Snowflake, BigQuery, Redshift are examples of that.

Then we have built technologies that help us process and further derive data from it, like Airflow, Prefect,

DBT, or examples of that, then we've built technologies that, like, really help us analyze the data, which is, like, Tableau, Mode, Looker, things of that sort. And what we have done additionally is, like, we have

ingrained a culture of data driven decision making in the companies

and really hired people both who have like a very heavy analytical skill set and that's their only job, data scientists and analysts,

or hired executives, product managers, engineers who have an analytical mindset and while their primary job is doing something else, they have a huge appetite for data. Right? Okay. So what's happened is there's a bunch of investment in getting data into the organization,

there's a bunch of people

hungry for data

and they have a bunch of tools to actually use this data, but the data lakes, data warehouses are so huge that nobody knows what's out there and what could be trusted. Right? So it's no surprise looking back that the severity of this problem has increased tenfold, if not more, over just the last few years.

As far

as Stemma and Amundsen,

so Stemma is building on top of the Amundsen project, which is open source, and people are able to take it and use it and modify it for their own purposes. So

what is it that you are adding to Amundsen or building alongside it to help people

gain more trust in their data and streamline some of these workflows and the

coordination and collaboration between data producers and data consumers and the overall business?

Absolutely. Yeah. So Amundsen is the open source project I cocreated at Lyft,

and that's almost 2, 000 person community.

Stemma is a managed version of that product and has 3 things that are additionally available on top of Amundsen. So the first 1 is there's a managed offering with

enterprise grade security and 2 different deployment models that are offered based on your preference of, like,

data residency.

And

the second thing is this category of

intelligence or further automation. Right? And so this includes

us, for example, parsing your query logs

and understanding what are the common ways this data gets joined

or filtered.

And that's because once you have established that, okay, I want to use this data set, your question is, okay. Well, I've got this bikes and scooters data, but I wanna link it to

the region the bike ride was taken.

But the region mapping is like it's a lookup table that's like some other table, but I don't know where that table is. Right? But the thing is, like, everybody who's done that analysis in the past already knows where that table is because, presumably, they've done that. And instead of you documenting all the foreign keys and creating, like, these ER diagrams,

like, all of this information is in your query logs. Right? So things that Stemma does in this intelligence category is, like, parse out your query logs and make suggestions on, like, what are the most common join and filter conditions based on what everybody else is using. There's some intelligence here around, like, linking Slack conversations with a special Slack bot. That list goes on, but the whole idea is, can we reduce further the need for you to curate information in the data catalog? And that's the category that I'm talking about here in number 2 category called intelligence.

And the 3rd category is, at the end of the day, your data catalog is changing

behavior. Right?

Your data engineer has to remember to look at the data catalog to understand and notify the downstream consumers

instead of just spamming everybody on the Slack channel.

Your data analyst has to remember

to, like has to have this habit of, like, looking at the data in the data catalog instead of just the shoulder tapping techniques that they've been using. And there's a bunch of, like, organizational

learning

that I've had, that we at Stem have had, that

enables to make sure, like, which personas to prioritize as primary users for the data catalog first, How do you actually embed this in their workflow?

Where do you integrate first? And how do you make sure that the data catalog gets adopted and is like a high CSAT

product for all data users at the company? And that's the 3rd category of, like, organizational,

like,

adoption and success learnings that comes with stemma that we deploy. Some of that is in the product, and some of that is, like, organizational that we work with our customers to deploy in the organizations.

Now that you have this platform in the form of Stemma to build on top of and to help organizations

establish trust in their data workflows and the

analyses that they're building on top of it. What are some of the

additional opportunities for innovation

and ways to streamline the communications and workflows between those data producers and data consumers

and the business users within the organization who don't necessarily

have

the background as a data professional to be able to do the sort of deep critique

of

the information that they're interacting with.

I put the these personas in 4 categories. Like, there's the

product engineers who are creating production data

or, like, instrumenting events.

Then there's the data engineers

getting this data into the warehouse, building pipelines, and delivering derived data. Next 1 is data consumers

of the derived data, so your analysts, data scientists.

And then the 4th 1 are business users, and that term is rather vague. So when I say business users,

I use

the examples of, like, a finance analyst who's maybe savvy with Excel,

but not quite as savvy with SQL, a marketing analyst, customer service analyst, and some companies could be your CFO. Right? Examples are those people. And so

depending on the kind of data we are talking about, the producers and consumers are different. So if you're talking about raw data, the producer is product engineer, the consumer is data engineer, sometimes an analyst, but mostly a data engineer.

If you're talking about derived data, producer is data engineer,

consumer is data scientist, data analyst. If you're talking about insights

or dashboards,

then the producer is analyst or data scientist, consumer is a business user. Right? So to answer your question, like, there are these gaps that exist

between each of these verticals and the questions at a high level are similar, like,

what's out there? What information do I have that I can trust? And what information I don't have so I know I don't have? Right?

And there's this concept of data mesh that's been talked about quite a lot, which all aims to, like, bridging the gap and clarifying the ownership between these producers and consumers.

So the opportunity here lies in having a crisp understanding of

what are the assets here in question

that we are talking about in producers and consumers, and I'm talking within the organization,

and which ones do I need to prioritize first in terms of gaps. And then

having metadata

and a shared understanding of that metadata

be there, ideally something that is not manually curated. Right? Ideally as much automated as possible. Now I'll give you examples where automation is gonna fail. Right? So if you look at the producer and consumer at the far right end of the spectrum, these are your data analysts, data scientists producing metrics

or dashboards

or insights

and your business user consuming them.

What does

revenue mean at a company?

What does attrition mean or churn mean at a company?

Those are like

problems that you need to

talk to a few people align like somebody is considered churn if they don't show up for the last 30 days and won't show up in the next 15,

right, recurring revenue and these kind of things are included, these kind of things are not included. So those things, for example, like, you can't automate them. And I think what's important then is to hook into the time

when the metric is being created,

to hook into the time when it's in the data scientist DRAM. Right? And when they're, like, writing or instrumenting that metric, when they're creating that dashboard,

you're, like, asking all the right questions

in that flow, and you're ingesting that metadata into a data catalog, which is then exposing that

into the business users who can quickly figure out, like, what are my key metrics for the company or my team? How are they changing?

And where are they represented in dashboards? Right? So those are some of the ways that I think about it. I'm happy to dig into any of these areas if that's interesting to you. Yeah. I think that it's definitely interesting to talk about some of the sort of social aspects of

trust in data because

a lot of it is in the sort of organizational sense and not so much in the technical

qualities of the processing and delivery of the data. It's more in

the sort of social and contextual

understanding

around what the purpose and meaning is of the data, and that's why I think it continues to be such an issue as far as being able to

establish and maintain this trust.

Yeah. Absolutely.

I'll share another example of this around, like,

the social issues that you were talking about. Right? Like, often there's a desire for

let's say you create an event

in the phone app, and this event

needs to be like, it's maybe a pro buff schema or a segment schema,

and this event needs to be documented. Right? And what happens is, like, you don't often encourage this product engineer

to document this stuff. So this event flows to your data warehouse.

Your data engineer has to use it. They don't have any clue

as to what this means. Right? Or when it's populated, who owns this, all that stuff.

And 1 way is to, like, actually

understand, like, where is this coming from, like, figure out the right owner within the organization and get them this information. Right? Another way is, like, when they were actually creating that event,

that product engineer

has a lot of context in their head. Right? Sometimes they write a spec. Sometimes they're writing a protobuf definition, and they may write, like, comments in the protobuf definition

that are actually like comments on the fields.

And I think there's a lot of sort of organizational

stuff that gets in the way. But 1 of the things that we did at Lyft was super successful

is that we said, like, certain tables,

the descriptions and column descriptions on those would be read only because the source of truth is the protocol file where they were defined by the product engineer. And we built some tooling to, like, encourage and force the product engineers are putting that information in there. And I think it was 1 of the best decisions for getting that information. Right? So the learning we've had is that

it's best to get this information

in the flow of the users, and, also, it's best to show this information in the flow of the users. And so exposing the information in the data catalog into various different tools that a data analyst, data scientist, data engineer would use is, like, super meaningful for a product in this category.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? What about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your DBT models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census

today to get a free 14 day trial.

And so for organizations

that have built up

a level of sophistication around how they're processing data, and they have the contextual

information about

the purpose and intent of the data, what are some of the

areas where you still see friction come up in terms of the workflows around the data or the

requests

for

analysis

that continue to be challenging to fulfill despite the understanding that might already be established in the organization?

So I see 2 constant sets of friction that come up that I don't think we have solved and I have an opinion on them, so I'll share them. The first 1 is

around analysis. So a PM

may come to a data analyst or a data scientist and be like,

I would like you to help me find, like, how this particular budget is being used.

And

the thing is analyst time is very

valuable, right,

like all knowledge worker time.

And the interesting question that doesn't get asked often is

what decision are you gonna make from this information

and what are the options for that decision? Right? Because if you are looking for something just as FYI,

in a fast growing company like that may not be a reason,

but your important reasons may be to actually make a decision.

And having that level of clarity, like, oh, I'm looking for this metric. If this is more than x percent than the norm,

then I would like to change the product in this way. And if it's less than that, then I like to keep it or change the product in this other way. Right? That's a very good answer.

And so I think that level of conversation is something we're not having

in the analytics world. Now on the data engineering world, which may be more relevant for this podcast,

is

that there's still a tendency

to

hire your way out of creating derived data models. Right? So it's like every area in the company will have a data engineer

who's writing pipelines

and maintaining pipelines.

And I find

that that approach is not scalable. Right? You are much better off in having your data engineers write

pipelines and data for certain core parts of the company. So let's say, like, maybe 20% of the company,

and they are doing that for

maybe core financial metrics or the key company wide metrics that everybody's gonna use. Right? But then you are better off democratizing,

like, all the various, like,

you know, marketing data scientists, marketing data analysts to, like, write derive models from them. Why? Because, a, they have all the domain context. Right? And, b, it's just a matter of, like, them picking up a few skills or you building some technology or tools or deploying some technology or tools that enable them to do that. Right? If I were to back up and think about, like, the data engineering workflow, there are, like, 2 parts to this in terms of what are the skills, like, what's the taste that's required in order to be a good data engineer.

1 skill that's required to be a good data engineer, and I may be off here, so if anyone disagrees with this or thinks differently about this, please comment on the podcast.

1 skill is this taste of modeling. Right? What is the breadth of my table? What's the grain of my table? What's the depth of my table? So on and so forth. Right? Where do I split them apart?

The second thing

is like, okay. I've done the modeling. Now I need to figure out, like,

how do I actually implement this in an efficient and effective way so it runs on time and it's partitioned properly, so on and so forth. Right? I find that the first category of modeling is a very taste thing. Like, this is something you need a human to do. I don't think this can be, like, automated in any means. Right?

And

I think this is a skill that's unique to data engineers,

but I think there is

opportunity to democratize this particular modeling skill

to the larger

technical data consumers, those mostly being data scientists and data analysts.

And data scientists and data analysts also have a lot of domain expertise. So a marketing data analyst may actually know a ton about marketing. And in some ways, if you just, like, help them understand some of the modeling best practices,

they can craft a really, really good data model. Right? So that part can be democratized.

The second part around like creating efficient jobs that are partitioned well and run well,

that part can be solved with tooling. Right? And to the extent where historically, what we've tried to do is we've tried to shove analysts and data scientists to do, like, the all the high of, like, configuration properties and what parameters for dynamic partitioning you need to use for what kind of jobs.

All of that stuff is, like, bogus. Like, we need to stop doing that. We need to elevate the abstraction

so that the data analysts and data scientists can actually start once the model is figured out, they can start writing their own pipelines. And dbt, for example, has gone a long way in moving that level of abstraction up, but I think there's more to do here. Right? And that is where I see this going.

Another interesting element of this overall conversation about trust in data and data catalogs and data discovery,

particularly

in the past 6 months, year, somewhere thereabouts,

has been focused on

the use of data for analytical purposes with the focus being on things like analytics engineers,

end users of the business, interactions with business intelligence and dashboarding tools,

and where the data is largely being sourced from a data warehouse that has, you know, a decent amount of structure to it. And I'm wondering

what your experience has been as far as the applicability and opportunities

for

data catalogs and specifically to Amundsen and what you're doing at Stemma for

more machine learning and AI oriented workflows and sourcing data from unstructured sources or data lakes?

Yeah. I think you are right. They were started serving more analytical workloads then moved towards more data engineering, like migration heavy or consumption heavy,

like notification heavy sort of workloads.

And what we are finding now is it's like we're beginning this journey of helping, like, the ML users. Right? So, for example, 1 thing that's happened in Amazon is, like, we are in version 2

of having feature discovery as a part of Amazon.

And the first version was done by this company called Get In Data, and what they did was they took all the data

model for tables in Amazon,

and they extended that to include features and specific feature attributes.

But it was a little bit of, like, sort of shoehorning that concept in the table

concept. And what's happening now is that Amazon now has, thanks to the team at Lyft, the second concept of second variant of this ML feature discovery in Amazon.

And so there, you could easily extend it. So the first version, like, integrated with things like Feast, and the second version is, like, much more flexible and can integrate with a variety of different feature stores. And that's the way I see this evolving. Right? Any data catalog has to index and automate a bunch of different resources.

What resources you choose depends on what persona you are catering to. So the initial resources were from, like, data warehouses

and data lakes, structured SQL analytical data,

and indexing like Tableau and Mode and Looker, that kind of thing. And now we're moving on to, like, resources that are more like notebooks and then features that cater to more of the ML users. I see that evolving as a product goes on, and there's a much larger sort of road map around what kind of resources we want to index at what time based on which persona is the 1 that we are interacting with. The next up is, like, the business user. Right? The thing that's happening with business users is that

there's this marketplace of insights between the data scientists and data analysts and the business users.

Data scientists, data analysts produce these insights.

Business users consume them. But there's no, like, standard way of defining what a metric is. Right? And

usually requires, like, a bunch of organizational stuff, then you tell the data scientists, like, okay. This is a metric. The data scientist instruments it, shows up puts it in a dashboard.

It shows up, in a dashboard, and then you have to, like, maybe you build v 2 off the dashboard and people are still using the v ones.

Pretty bad. Right? And so there are 2 so there's no standard for defining metrics, and there are 2 companies, start ups, that are trying to

standardize on a platform for defining metrics and be able to serve that to the business users and the data scientists and data analysts.

And those companies, like, 1 of them is called Supergrain. I believe the website is supergrain.com.

And then the second 1 is called Transform Data, which is, I think, transform data dot io.

And they're both trying to define, like, what is the standard way of defining metrics. So the point where those become standard way of dividing metrics, like a data catalog then becomes

a read only view of

sharing what the metric definitions were that are defined there. And I think these are the kinds of

innovations and problems that we need to solve in the future. In terms

of the applications of Amundsen and

Stemma now that you've launched it, what are some of the most interesting or innovative or unexpected ways that you've seen it

used? Yeah. Some of the very interesting things that are happening here

are

using

a data catalog

to do classification

and auditing

of sensitive data in the warehouse.

So Square is an example of a company that does this with open source, Amundsen,

and they have taken Amundsen, which whose original intent was to cater to this data discovery, data trust for producers, data engineers,

consumers, data scientists, data analysts.

And they've taken this and they've started tagging using

automation

from Google DLP,

similar to AWS Macie,

on what

data

is sensitive and they are tagging columns

with PII name, PII email, and then if the confidence is low,

then the data catalog becomes a place where they go approve or reject that. That is the data owner who does that. And once this is all there, then you are alerting, notifying when PII shows up in places where it's not supposed to show up. And that's like 1 of the most interesting

use cases that's come up and that's 1 of the things that I was alluding to earlier is that the same metadata that's being powered

for the discovery and cataloguing use cases can be used to power adjacent use cases like privacy and classification and CCPA, GDPR,

which is exactly the direction that Square is headed. And in your experience

of building the business of STEMA and building on top of the Emmonson project

and continuing to be engaged with that community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

1 interesting thing that's been for me is that coming in to start the business, I thought that

everything was on the cloud and I guess it is still true that everything is on the cloud.

What I didn't realize was that there were a lot more conversations

about

what's on my cloud and what's on your cloud.

That was something that was unexpected to me, and we've learned and we've changed the product and the deployment models based on that. But it was a really great learning.

The second thing on a product perspective,

I would say,

is that there's still a lot to be done here. Like, there are 4 personas

and the organizational work. So there's product engineers, data engineers, data scientists analysts, business users.

And each of these personas have

distinct needs, so there's a lot of thought on, like, which of these personas we're gonna cater to. By the way, like, stemma caters first and foremost to the the middle ones, data scientists, data analysts, and data engineers. And so, like, that doesn't mean we don't have products for those, but those 2 remain our first focus. Right? And so there's a lot of work that needs to be done even for you pick these 2 personas, you're like, man, I can go as deep

and still,

like, have a ton of work to do in 5 years. Right? So a lot of work to do for each of these 4 personas

as well as for the organization who wants to, like, make sure that it tracks where sensitive data is in the company, that it's locked down, that there's auditing and alerting when it shows up in places it doesn't.

And that is, like, another place we need to keep investing.

For people who are

experiencing issues with data trust and they're investigating

the

offerings for different data catalogs and data discovery platforms, what are the cases where Stemma is the wrong choice?

I think the place where stem is the wrong choice for these people

is if you want

a very, like,

high control

data catalog. Right?

So imagine a world in which you want everything to be curated,

and more importantly, you want

only, like, the owner to update the notification

update the descriptions. You wanna get a notification

when someone or request to update their description.

In a world where you want to tightly control

and

manage

entirety of your data,

Stemma is the wrong tool for you. Because Stemma is built from the perspective

where

you

don't have, like, an army of data stewards. You want to democratize data within the company,

and

your company

has, like, these pockets of excellence of certain domains and you want to

use those pockets,

mostly automation with curation on the top 20%

to enable them to be better at doing data driven decisions. Right?

And so if you want a high control environment for your entire data data with workflows for approval and rejection

on simple things like description update,

then Stemma isn't the right product for you.

And so as you continue to build out Stemma and interact with your customers and engage with the Amundsen community, what are some of the things that you have planned for the near to medium term and any projects that you're especially excited for? Yeah. So I talked about the 4 personas in the organization a moment ago. Talking about those 2 personas. Right? Specific for data engineers, I'll dig into just that. The full answer to this question is a little longer for this podcast. But for the data engineers, I think there's, like,

a lot of work we are doing and continue to do in making

their job both in terms of trusting this new product data that I'm gonna use in order to derive pipelines,

but more interestingly and recently,

like,

migrations and how can I help

derisk,

speed up migrations

by looking at data and how it's being used,

by understanding and figuring out what clicks of data need to move together,

by understanding

what are the sources that this data is coming from because maybe they need to be migrated first before

I move this? And so, like, that's what we are planning in the future for Stemma for data engineers.

More broadly, each of these

4 personas,

like I said before, have different needs around, like, understanding what assets they usually use, whether they are trustworthy.

And our goal is for each of them, get to a point where we are doing 80% automation and then 20% curation for the most heavily used data assets. Right? And those are the areas we're investing in. The second thing we wanna do is enable the organization to know very easily where their sensitive data is

and then be able

to classify, understand,

and derisk

that information.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I would just like to briefly get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Change of who's writing pipelines in the company. And, historically, data engineers have been the only people who can write the pipeline and now more and more so, the skill is getting

democratized

and analysts, data scientists

can write pipelines.

I think the part that I still see missing

is the data modeling part. So back to the 2 skills that I think are unique to a data engineer,

the data modeling is a very tasteful exercise, and I think, data analysts and data scientists have a huge

domain expertise.

And I think to the extent where we can democratize

this data modeling skill,

then data engineers can focus on higher impact, higher value problems only for that 20% data within the company, like the core financial metrics or company wide metrics or more platform infrastructure

things to bring data into the warehouse, out of the warehouse,

and enable,

like, more leverage for a data engineering role if we are able to democratize that. So that's, like, 1 trend I'm seeing and there's 1 sort of gap that I see related to that that I wanted to share with you before we end. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing with Amundsen and and the business that you're building with Stemma. It's definitely very interesting and important problem domain. So I appreciate all the time and energy you've put into it, and I hope you enjoy the rest of your day. Thank you, Juarez. Had a really good time. Thank you for having me.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links