Solving Data Discovery At Lyft

Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof And for your machine learning workloads, they just announced dedicated

CPU.

And for your machine learning workloads, they just announced dedicated CPU instances.

So go to data engineering podcast.com/linode,

that's

linode,

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And finding the data that you need is tricky, but Amundson will help you solve that problem.

And as your data grows in volume and

foundational

principles that you can follow to keep data workflows streamlined.

Mode, the advanced analytics platform that Lyft trusts, has compiled 3 reasons to rethink data discovery.

You can read them at mode.com/lift.

That's mode.com

/lyft.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

the Open Data Science Conference, and Corinium Intelligence.

Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graf Forum.

Go to dataengineeringpodcast.com

slash conferences to learn more about these and other events and take advantage of our partner discounts when you register.

And go to dataengineeringpodcast

dotcom to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self-service data access access at

Lyft. So, Mark, can you start by introducing yourself? For sure. Thank you for having us, Tobias. I'm Mark. I'm a product manager at Lyft. I have previously worked

at Cloudera as an engineer working on big data systems, having,

contributed and a committer to many open source projects. At Lyft, I focus on,

data products,

and including products for data discovery and data trust, which is the core of the conversation today. And do you remember how you first got involved in the area of data management?

Yeah. It was by chance. It was,

I've been working in the big data space for a while and I came to Lyft and

wasn't this general space of data products, which is like what products can you build to make data users that look more effective and productive.

And during the course of user interviews and the speed at which the business was growing, that's how I really got into the space of data discovery and data quality. And, Tao, can you introduce yourself as well? Sure. Thanks for having me. So my name is Tao. I'm an engineer at the data platform team. So I primarily working on Amazon projects, also on Apache Airflow, which is a workflow managers,

system.

So before my time at Lyft, I worked at Lingyin

on infrastructure performance and data related projects. And do you remember how you got involved in the area of data management?

Mostly,

my my time to get involved in data management is that we see a pain point in the data discovery and data

managements, and we see a need to build something for the user

to solve this pain point and and lift. And so

the Amundsen project has recently been released as an open source project, and you both have presented on it at a couple of different venues. And so I'm wondering if you can just start by explaining a bit about what it is and the problems that it was designed to address at Lyft. Yeah. For sure. Let's start with the problem. I think what was happening at Lyft

about a few years ago was that the traffic on the Lyft app was increasing exponentially.

That led to an increase in the amount of data that the organization had to store as well as to process. So that was the first problem of scale, the scale and the modern data. The second problem of scale was the number of employees and people who were using data day to day to make decisions to drive the business forward was also

increasing exponentially. So there was a point in time where a lift doubled in size every year. And that those 2 combinations

of scale

led to a problem where people who have been around for a long time contains tribal knowledge in their head around various different data systems, data sources in the company.

Newer employees who were still very smart, effective individuals

weren't able to

have that context to do their jobs well. Right? So that was 1 problem.

So then people started solving this problem through

the various ways of figuring out answers to questions people have. The most hated way is to use Slack. So productivity of people went down because people are using Slack to ask these questions like, oh, I just joined. I want to optimize ETAs for Lyft. Where do I find the source of truth for ETA? Right?

And so someone will tell you something. Hopefully, a bunch of people will tell you the same answer, but many times different people will tell you different answers. And then you had to figure out the much harder question is, is this data trustworthy?

Right? And once you have figured out this data is trustworthy, then you had to figure out, okay, what's the right model for this table? Like, is it ETA at what time? Right? Because you get ETA is shown on your app multiple different times. You get shown before you request the ride, after you request the ride, and so on and so forth. So you had to build this model of the data and figure out, do I join with something else? What keys do I use to join? And there was this whole problem, first of discovery and trust, and second of understanding. And that really drove the productivity of our data users

down, and that was the problem we're trying to solve. We wanted to really have an experience,

the equivalent of what Google is on the web where you see relevant results really well on the top and you have a very quick swift experience for search and discovery

as you go. And we wanted to have that same experience

at Lyft, and that led to a product called Avanxin.

So if I may, I'll go into

describing in just a very high level what that form factor is now, and then we can talk a little more about it later. So the form factor is you go in and there's just a search box where you can search for any data asset at Lyft and you search,

say you search for ETA as an example I was sharing earlier, you get a list of different data assets. Now these assets can be tables. So these are tables or views in various different databases. And soon they will be dashboards as well. So you can see work that's been done in dashboards and other analysis or notebooks,

in the system. But later on in the road map, you will find it in various different other ways like streaming applications or Kafka topics and so on and so forth. But focusing on the table experience,

you click on the table,

you, see all the information around this table. So the schema, the, the name of the table, the descriptions, and the columns, and the profile of the columns, so the min and the maxes and so on and so forth. The frequent users of this data,

and then you see a preview of the data as well as a skeleton query on this data to kinda get you started. And the key point we wanted to keep in mind is that we wanted all as much as possible to be automated metadata. Right? So no 1 is going and saying this is the right table to use. And we we put a lot of thought into what kind of metadata and what kind of opinions do we need to form on top of this metadata

in order for building that model of trust and serving that model of trust right from the very get go. And 1 of the things that I'm curious about coming out of that description there is some of the

internal vetting or quality control that goes into determining

how a dataset ends up being listed in the Amazon catalog and then any sort of ranking information or user feedback for being able to determine what the correct placement is based on any type of query that you might be issuing to find a given dataset? Yeah. Fantastic question. Right? And I think we are on this this scale. Right? I consider

curation where a BI engineer or a data engineer is curating every single

data model that's coming out

and making sure it's high quality and it remains high quality all the time. To a place of complete chaos where you have so many, so much data. It's growing in a more democratic fashions. Not everyone is aware of

what all the places to go to are and so on and so forth. And I would say most companies are somewhere in that spectrum. Right? Hopefully, you're not on the chaos side, and I don't think the, growing company can be on the on the curation side. So the examples that you chose, Tobias, for example, about the ranking,

what Amazon does,

is that it uses

2 axes for ranking. 1 is popularity and the other 1 is relevance. So when you type ETA,

it matches ETA with a bunch of different fields within the table. The name of the table, the description of the table, the column names, and the column descriptions,

and the tags to surface all the the things that match with the term ETA. But then it also

ranks them based on popularity.

Popularity being the amount of SQL querying that happens on that table. So tables that get queried more show up higher in the search results

than table to get queried less. And we also attach different weights to automatic queries like an ETL job versus an ad hoc query like human querying a table. And that's how we started to build this this intelligence around

what is a good proxy for trust. And in this example, we just quoted, it is the popularity or the amount of SQL queries that get written against the table as, as a measure of trust for that. And you said that before Amundsen

came into

production use at Lyft, a lot of the way that data discovery happened was just by word-of-mouth

or knowing who to ask.

And I'm wondering,

at the time that you first started building Amundsen, what the landscape looked like as far as products or systems that were available for solving this data cataloging discovery problem,

and what was lacking in those existing options that led you down the road of building something from scratch?

Yeah. For sure. So at Lyft, there were a few attempts to solving this problem. There was a Wiki page which had a a page per table and somebody kept it up to date, but that was very hard to enforce and make sure it was up to date. The most common way was twofold. So for discovery, people would just ask on Slack, and it really depended on

how mature your team had been formed, if the person who was leading that team or worked on the data for that team is still around. So if you're a new scientist, you will go ask other scientists in the team. Sometimes you gotta figure out how this is instrumented, so you will go to the engineering team. There's just a lot of time being wasted

in figuring that stuff out and the tools people used were 1 of 3. They used this outdated Wiki page. They used Slack to ask these questions. And the last thing is they would go to various different sources of metadata themselves. So they'll go read the ETL code themselves or look at the airflow DAX and try to make the connection between the tables and the corresponding airflow tags. And all that was super painful. Like, what we learned is that people were spending 30 to 35%

of their,

upwards of that of their entire workflow. And, you know, these people are hired to write ML models or write SQL queries.

They spend all that time actually just finding out what is the right thing for me to use, what is the right thing for me to join.

And in terms of the overall data ecosystem that Amundsen is fitting into,

what are some of the source systems and downstream systems that you are integrating with?

And I'm also curious how it compares to some of the other work that's being done in parallel.

Most notably that comes to my mind is the Marquez project from WeWork. Yeah. So I'll I'll help answer the first part of that question, and Tao,

has a lot more thoughts on the on the latter. In terms of the integration,

Lyft is a very heavy s 3,

user, and this s 3 data is then stored in the Hive Metastore.

So we use Hive and Presto to access this data. So the the

topmost integration for us was with Hive and Presto and essentially anything that used Hive, Metastore, and that could be Spark and so on. That that is definitely the case I left. So that was the first integration we built.

We also, as a part of the second integration, we can talk about more about this later, we built integration of people. So, essentially, we have a graph of data

and have nodes

and edges in this graph. And then we have people added to this graph very recently. And so we integrate with our HR system to get that information

in this,

table in this graph as well.

And, in terms of what has happened more recently is after we've open sourced it, a lot of community members have started to use it and have contributed to various different, data sources. So now we have integration with Postgres, which brings an integration with things like Redshift because it follows the same,

same model. And then we also have integration with, BigQuery from Google. And both of these are thanks to the amazing community members that we have. Yeah. I could take the questions

to, say, compare Amazon with with your work,

on on its project, like, markets. Before going to that,

let me do a brief introduction on the overall architecture of Amundsen and then see the comparisons.

So Amundsen is more focusing on the data discovery, perspective. So it has 3 microservices,

like, 1 front end services for user to data dis, do the data discovery,

and you power,

back by 2 other back end services. 1 is called metadata services. The other is search services. Metadata services is

very, modularized and plugging a book, so you could talk to any,

persistent layer. By default, we ship with Neo 4j as a persistent layer, and the community contribute,

Apache Atlas proxy to be the persistent layer for this metadata.

And the search service is the

search services, which power the search query,

run by the front end services.

To compare with Marcus from WeWork, So my understanding of Marcus project is, Marcus is a metadata project focusing on, metadata,

data governance, and it's more

line in the scope, like, similar to Apache Atlas. So

in this case, if we had could have a proxy layer to talk to Marques, it's possible to use Amazon with the Marques as a back end engines.

And

you mentioned the overall architecture of how Amazon is currently implemented, and I'm wondering how it has evolved since you began working on it and some of the primary assumptions that you had going into the project that have been challenged or updated in the process of building and using it. To talk about, like, how Amazon architecture has, evolved,

we started this project, like, last last year around April or May.

And

Amazon, initially, we have done a lot of architecture discussion, and we've come down into a stream microservice architectures

like the front end services, the metadata service, and search services. And we decided to use a poor approach

to get the metadata.

We for that, we do a generic data ingestion library called DataBuilder,

which, could,

we implement the interface and which we could talk to a different kind of metadata source. It could talk to a BigQuery, Postgres,

or HIVE. And we start with HIVE because HIVE is the most used data system in Lyft.

And since then, the overall architecture has stayed the same, but the implementation changed quite a bit. For example, initially, we want to keep in sync

between Amazon,

the metadata stored in Amazon with the upstream source. What it means is that we, we pull the metadata from high meta store, for example, about the table name, column name, and column description, and then we persist in Neo4j graph as and expose

by metadata and used by the front end.

And when our user try to modify the description, for example,

you will not only persist in our neo4j graph, we will also persist,

back to high metastore.

We found

a a cup, this kind of coupling has a lot of limitations.

For example, it doesn't work with full rebuild table because, like,

if user,

try to do a full rebuild table again,

all this modified discussion will be be lost because I,

the original discussion, it persists in some GitHub file.

And second change we did is, like, initially, we start with a pool model to get the metadata.

Now we're involved into a pool and push,

mixed model.

For example, like, to pool model is great to get started to get metadata.

But once we currently reach a scale, let's say, a lot of team, a lot of organization we live want to push metadata

into Amazon. It's hard to build a different

index of crawler job for everyone. So we start to leverage Kafka to build a put metadata pushing model

to

for for these purposes.

And

for being able to

populate the metadata story, you said that you're using this Kafka engine. I'm wondering if you can just talk through the overall life cycle of maybe a new table being created

or a new dataset being published, and then how that would flow through Amundsen

to be discoverable,

and then the workflow on the other side of somebody using Amundsen to be able to find it. And then

any assistance that Amundson provides in terms of being able to specify the connection information for somebody to be able to then just start querying that information?

So,

the the workflow at Lyft is typically is that wherever analysis, for example, they try to create dataset or data engineer, they try to create dataset.

They will start with certain prototyping on their personal schema, write their own SQL using certain BI tool like Mo,

Tableau, or Superset to get some, get the SQL query

running, get the expected data format more data model

going. Once they satisfy the the data model, they will create Afro normally, they will create

a AfroTec,

which our ETL workflow management system use a lift

to populating this, table in a daily or

cron job,

cronetype or batch job fashions.

Once this table has been populated,

this table is created under certain schema, for example,

it could be work, core schema or,

important schema

use

lib. Amazon has another

index job called,

data builder, which is run inside a echo deck, which pull this metadata from High Meta Store in,

twice per days fashions.

For example, once this table has been created and show us a record inside Highmeta Store,

once we our data builder is running,

it will prove this record from Highmeta Store and index

persist this information into Neo 4j. So Neo 4j is a graph database.

It will create certain graph node. For example, you have, for table name, then you create a table node. For column, you create column that builds certain relationship,

so on and so forth. And once this table has been created, you will view a search index to

make sure this table could be,

searchable

from from the front end service. Once this index job has been finished,

meaning, like, this table is,

available for user for Amazon user to consume.

When user go to Amazon UI, they search their table based on their relevancy and popularity.

It will show up in the search result.

And now that Amundsen is in general use at Lyft, I'm curious what types of feedback you have received from your teammates and people in other data oriented teams as far as how it has impacted their overall workflow and productivity

versus what the state of affairs was before it became,

generally available and accessible? Yeah. For sure. So we've noticed and, this is to qualitative surveys that,

the productivity of the analysis workflow has increased by about a third,

primarily because we've reduced that large chunk of time that used to be spent on data discovery

down to almost nothing compared to what it was before.

To back it up with some data, and this actually ties into your previous question around,

assumptions that have been challenged. We built this tool mainly for the data scientists and,

how at a company

lift size at the time, there were about 200 or so people in the science team,

and that's what we were targeting for, like, really savvy,

heavy data users who would wanna use tables and views as a first pass.

What we have learned is Amazon now has weekly active users

of over 700 people. Right? So these are people who

have, because of us democratizing data,

wanted to use more and more data. And that's been a huge assumption

challenge for us in a good way. And we are seeing that

that adoption number as a good proxy for people loving this tool. We also measure CSAT both through within the tool as well as through a quarterly survey, and that CSAT

has been high over 9 out of 10. People have just been loving this tool, and that's really,

shown in their increased productivity as well. As I mentioned at the beginning, you recently open sourced the Amundsen project

and have made it publicly available for other people to be able to run their own instances.

And I'm wondering what the motivation was for releasing that publicly, and how much effort was involved in cleaning up the code base to make it accessible to the public without having too many internal

assumptions

about how it was going to be deployed and the systems that it was going to be interacting with? Yeah. For sure.

Actually, this ties back to how we made the decision to build Amazon from scratch, and I think that's a good topic to cover as well. When we were looking, we were really looking for an experience where you can discover and trust data really quickly.

And we looked at vendor products, things that had been built in the past to and that were available off market. So examples were Alation,

Cloudera Navigator and so on. We looked at closed source tools that companies had built,

for solving similar problems. In that category, it was Airbnb's data portal.

Facebook had a tool called IDA.

We looked at open source tools in the same space as well. So that was Apache Atlas

as well as, Marquez pretty early on. Actually, Marquez didn't exist back then. It it only started much later. But Apache Atlas as well as LinkedIn's warehouse. So we looked at all these tools,

and we were looking for that experience where you could discover and have very little curation

human curation of data required

and figure out really quickly and nimbly what it's that you're gonna trust

and then have a vision for including people in the graph as well as dashboards and and Kafka topics

and schemas and so on. And we found that that experience was there was much to be desired in that experience and in all those options. And that's how we decided to build something.

We also

have learned and gotten a lot from open source, both in our careers as well as in our organizations.

And we wanted this to be an open source standard for doing data discovery. So we knew from very beginning, we wanted to build Amazon not just for Lyft, but for

everybody else who's solving the same problem. So we wanted to do do it in open source. And that's why when Tao talks about microservice architecture and the repositories,

we do all our development in the open. We don't have a fork within Lyft and outside of Lyft that we try to cap. We have an overlay of repository that we do just for our custom configuration.

The important point being

that we really reduced the amount of cleanup that was required later on when we wanted to go open source with it. It. And do you have any sense of the level of coverage that you have for the overall data that's available within Lyft and what's represented within Amundsen,

and I'm wondering what

the process is for being able to find any other remaining datasets that aren't present, or if there have been any issues as far as the data,

not being able to be easily represented or accessible to Amundsen for cataloging

and,

surfacing for other people to discover?

Yeah. So broadly speaking, we can define the data out of Lyft in 2 categories. 1 is the online stores, which power the Lyft apps, that you see. These are powered by databases like Dynamo and Mongo and so on.

Then there is the offline store,

the world that,

we all hang out more in. So these are your analytical systems and off those we have Hive and Presto.

We have an old Redshift cluster.

We have Druid. We have,

Bitcoin installation.

And so those are the systems that,

are the in the offline store. So,

Amundsen has historically focused more on the offline world.

So it doesn't cover

databases, the NoSQL style databases in the online world.

In the offline world, we have built integrations as, as we were talking about earlier. We now have integrations for Redshift, Hive, and Presto, the s 3 world as well as

for BigQuery.

And we have indexed all those data.

We are sometimes selective if we know some schemas,

are just temporary schemas or personal schemas. We are very selective about them, and we never bring them to Amazon.

But in general, we have a pretty large footprint. I I don't know if I have a percentage number for you to quote, but that's that's the way we think about the coverage of him. So, Taro, do you wanna add anything? Sure.

Yeah. So, as as Mark mentioned, is, like, Lyft has a lot of heterogeneous,

data sources like BigQuery, Druid,

Postgres,

Hive, or Redshift. So we try to build, the our our vision is I try to build a comprehensive data map for all the data sources, also, like, bring relevance to the user. Like, for us, we try to index all the relevant,

tables

for users.

So also for compliance purposes, like, we currently we already index all the managed schema within Hive

so that you could be,

not only used by the user, also using by compliance auditing purposes.

And that brings up another interesting question as far as how you determine whether or not a given dataset should be surfaced to somebody based on compliance or regulatory

reasons or just read what the general access control is for that dataset.

Because if somebody is searching for something and their role

is not going to grant them access to it, but then they see it listed in Amundsen. I'm wondering what the just overall process is for being able to integrate and surface that information at the appropriate time.

That's a great question, and we have an opinion about that. We feel that

discovery of datasets

at a high level should be available to all.

Accessing some higher level of high granularity

metadata around datasets. So this would be example of

distinct values or seeing a preview of the data or accessing the data, of course, depends on the dataset and should be limited to some, those who only have access. So the purview the approach that Amundsen has taken is that it will get metadata and allow you to search for whatever you wanna search for regardless of whether that's privileged or not. And then once you get to a table that's privileged that you don't have access, it integrates with it delegates that that access control to another component,

superset for us. And only then are you able to see based on your access controls

the preview of the data or be able to query that data. And the nice benefit we've seen from this is historically what used to happen is people

would want to see if this is the table they should ask for access for,

but they can't know that until they request access. So their request access find that that's not the right table for them to use, and then they start discovering more tables and then getting access and trying to figure out if now this new table is the right table for them to use. So that became a chicken and egg problem. And we try to solve that with Amundsen

by saying discovery. You can figure out that there is a table out there which is described this way. But if you do want to go further and actually start using this table, this is now when you start requesting access.

And what have been some of the other types

of feature requests, either internally or externally,

that you

been most intrigued by and some of the ones that you have consciously decided not to implement and that are out of scope for Amundsen?

Yeah. So the core feature set was around discovery

and trust of data, and that has gone really well both at Lyft and the open source community.

We see a few feature requests in that application

and those are around additions to the graph. So as I was talking earlier,

we've added

that node. Previously, there were nodes for tables. We've added nodes for people. And that was a very big request within Lyft because when you join a team if I join Tau's team, I gotta look up Tau and I can see what tables does Tau access every day,

which ones has the bookmark,

and which ones does he frequently use. Right? And so that information is all again populated automatically,

and it helps me get up to speed really fast. So that was a pretty popularly requested feature, app lift. Few features have been requested in the open source community and are also relevant to app lift are lineage. So you can see various kinds of use cases come through,

for this. An example would be you can figure out if 2 tables are exactly the same thing or if you're gonna change 1 table, which tables is gonna impact downstream. So a lot of data engineering use cases showing up. Then the next 1 in the line is dashboards. So

you can

currently search for tables and people, but people wanna see what previous analysis has been done. So maybe because they don't, they can learn from that work or maybe they don't even have to do the work because someone else has already done that. And that's the next feature that's currently in scoping,

phase right now and we wanna add. Well, I wanna stress that that's just 1 application, that application of data discovery and trust. And our vision for this project is that we've actually gathered a whole bunch of metadata, which has become holy grail for a lot of applications.

And Tau was referring to another application we are building on Amazon, which is compliance.

So what we we've tried to build the Amazon application, gathered a bunch of metadata, and now we are seeing that we can use this for a lot of other interesting use cases. So we're built we we wanna build a compliance application on top of it, and then later on, build, like, data quality and ETL style applications on top of it. So we actually improve the quality of the data, not just help with discovery or compliance.

For somebody who is interested in getting Amundsen running on their own infrastructure, I'm wondering what the overall process looks like of getting it deployed and integrated into their systems to start being able to gain value from it.

Sure. So

first of all, Amundsen, we want to be easily accessible or, for everyone. So when we open source, we we create a lot of document about how to install,

Amundsen and using our quick start

guideline. Like and we provide, like, sample sample script for user to,

ingest certain dummy data

and ingest and persist into the Neo4j graph showing into the front end to keep people feeling. Once user once user is,

understand, like, this whole system, like, have 3 microservices

and a data builder ingestion library, they quickly change this, setup. Like, for example, the sample loader script to using,

based on the metadata you

based on that data environment, let's say, if they're using Redshift mostly, they could change to use a Redshift extractor to get the metadata for Redshift

and persist into the,

Neo4j graph. And for Quickstart, we we provide, like, a doc, docker,

compose, which allow you to boost using Docker container,

but you could easily

kind of deploy and install using others like AWS ECS or

directly deploy in native

EC 2 instance, that's also doable.

And what have you found to be some of the most challenging aspects of being able to build and maintain Amundsen

and support it as you integrate it into more systems and start feeding more data and workflows through it? Sure.

So,

when building

Amazon,

so there's a lot of, like, challenging,

design discussions. So first of all, as Mark mentioned, we build or design Amazon,

with open source in mind from the day ones. So we are thinking, like, how to make every interface to be generic,

which work with Lyft,

also could work with other system. For example, like, the data builder library just talked about. So it has 4 phases, a extractor, transformer,

loader, and publishers.

So extractor mostly is is used. I extract the metadata from different source.

And we build, like, high metadata

extractor,

But that interface allow user or community, for example, later on, contribute a big query extractor as well post web spreadsheet extractor. And secondly,

is that

we we think very hard to say

how to make this project work with lift internal,

infrastructure

as well as good work in externally.

So,

when we start this project, we for every service, for all these 3 microservice, we have 1 repo mostly for open source,

we have another repo which

ingest or,

include all the the configs for this open source repo as well as those department code. So that once we open source, it's easily

to kind of split the leaf specific section only in still

live inside the prime repo

while we continue to open source,

make the proper repo open source. And another challenge we, we have seen is, like, when we design Amazon is how to make the search result more balanced. For example, you say there are there could be many different tables,

like, for for example, ETA table or RISE table.

Different forms could have different,

could have different copy. Right? How to make this search result more relevant to to user? We we think very hard to to see how to keep the result relevant. When we, for example, initially, when we design that search algorithm,

we only take into account, say,

we only take into account, say, like, search based on,

column name, table name, table name, and tag,

so on and so forth.

And we we found that, actually,

when user try this table hasn't been used a lot. It doesn't have much usage. When they type the raw table name to search,

it didn't show up in the search result in the first few pages because, like, we initially ranked all the result based on we initially ranked all the result based on,

their usage.

Then we improve the search

ranking, saying if the user try to search with the actual

real table name,

the search ranking will be pulled up so that, like, even this table has been seldomly used, it will still show up in the first page so that user could, find out itself, like, the later page in the initial versions. Another, like, challenge you where we build Amazon is, like, because we use Neo4j graph database

to, as a persistent layer, How do we design a data model

to,

to fit into this data discovery? For example, like, we we have, like, table node. As Mark mentioned before, we have table node. We have a column node. Later on, we add Amazon people, which we have a user node.

How do we design, let's say,

make column node as a separate node instead of putting

inside as an attribute for this table node? We made a lot of this kind of

design decision.

Now we are,

the reason we have different separate node in this case is because we want to make the graph traversal much faster.

For example, we we wanna say once we,

go to a table node, we wanna figure out what are the column nodes is just 1 hop to see all the columns.

And we if we want to say, well, which user is, using this table, we go to also 1 hop to base on the graph relationship to find out all the user. And what have been some of the most interesting or unexpected lessons that you've learned in the process of building and using Amundsen and seeing others use it and some of the

unexpected

ways that users have put it to use? 1 of the goals for Amazon was to take this tribal knowledge off from people's minds and put that in a centralized place. And we thought just because we removed the friction from adding this information

that people will start doing it and all of a sudden overnight,

we would have all these comments and rich metadata in the tool. Turns out,

that was overly optimistic of us, that

just because there is no friction,

doesn't mean that people are gonna do that. In fact, like, there's a whole bunch of sort of cultural and social cues that need to be used in order to either get ownership of the data or just encourage people to put the descriptions in there,

which is 1 of the very few pieces that are human curated.

So that's been something that's been an insight for us. And from a product perspective, we're looking at what kind of incentives or social

behavior we want to

inculcate in the tool,

so people are actually putting in,

in descriptions and so on in the table. The second thing we we saw is that the use for compliance in other applications

was,

a surprise to us. We

totally start wanted to solve the data discovery and trust problem. It was only along the way that we figured out,

with the heat of CCPA here in California, which is GDPR equivalent,

here in California.

And that having this comprehensive data map that we could start using this for governance and compliance, and that was a really

pleasant surprise for us. And that's something that we've we've been using.

Members of the open source community have also been using. For example, Square is a member of the Amazon community, and they're using Amazon for compliance.

The rest of the other folks, there are about a 150 people in the open source community with companies the

like of ING,

Square,

WeWork.

I'm sure for forgetting a whole bunch of what they Workday. Yes.

And all most of the companies are using it for data discovery and trust while Square, for example, is using compliance and that's been really good. Another thing that I was thinking of as you were discussing

the

use cases

is

the

overall process of user experience design that has gone into how you build and use Amundsen and what the interface looks like for people to be able to find it approachable because you can build a fabulous tool that does everything that you want. But if it's got a terrible interface, then nobody's going to actually take use of it. And so I'm just curious what the feedback has been on that front

and any modifications or updates that you had to make after you first launched it. Yeah. That's a good question. I

I I think that's actually a really good cue to discuss something broader as well where a lot of these data tools are 2 ways. 1, they don't pay enough emphasis

to that experience. So many of them end up having a clunky UI or really terrible experience.

And 2 and and they are not opinionated enough sometimes in how they want to,

structure things or represent that experience or what the experience should be. And that's something we are lucky,

in the team at Lyft that we got help with from the very beginning. So we had designers,

work on it,

who,

worked on it as a full time job, and they they did a bunch of user interviews to actually make sure their design was in line with what users were expecting. And then we are also very opinionated about,

search ranking, for example, that the idea that we will use

querying

activity on a table as

proxy for trust is important. And what weights would we choose for you to use,

ad hoc query versus,

an ETL query.

That is something I think we don't do enough in the data space where we need to build opinionated tools,

so folks can have an experience that's,

easier to maneuver and get gets them more productive.

That's 1. In terms of answering your more direct question around

surprises, I think while the main design has remained same,

there are a few smaller things that we have changed along the way. For example,

when you click on a column name in Amazon, it shows you profile of the column. So this is mean and standard deviation of that column from a recent partition.

It shows mins and max for integer columns or where it applies

for,

for initial, like, averages, where it applies, initial string length. And if it has less than a certain number of distinct values and you have the access to it, it'll

show you distinct values.

However, it didn't show what date that,

metadata was based off of, and that was something that users wanted to see. So we actually ended up adding like a small tagline underneath. This was last calculated on this date. So small changes like that. In terms of big changes that are evolving in the experience,

we're finding that,

on currently, we have a lot of data around schema and then a lot of data around behavior of the table. So we have frequent users and owners and which airflow DAG generated this table.

And we have a link to the lineage metadata. So you could see, like, what are the downstream jobs. We have a link to the preview button and a skeleton query. And we're finding that this data is like overloading that page and making it very hard to digest. So we're doing a lot of sort of quote and quote scaling of the experience exercises where we're moving this metadata around. So it's actually like the most important metadata

first and then the less relevant

later on,

mostly evolutionary stuff for the project. Having the sample query, I think, is definitely a good

way to make the overall tool accessible as well because somebody might be able to see the table and understand what the different columns might be. But then having that snippet already ready to go where they can just run it and then start experimenting with it and adding to it, I think definitely

adds a lot more

value for somebody who may not necessarily be as comfortable at the outset with a new dataset.

Yeah. Absolutely. And I it actually reminds me 1 more thing.

These it's amazing how small little things help so much. We put on the bottom right corner of Amazon, we have this message box. Essentially, you click on it. It's a bright pink button, of course, and all lit colors. And you click on it, you're easily able to send some feedback in the tool. And that skeleton query, thing that you mentioned was not a part of Amundsen. We didn't know about it. We didn't think about it. But it was like 1 of those requests that you get from a user, you know, like clicking on that button. They're just saying like, hey. It'd be really cool to do this. Right? And I think sometimes having those channels open so you can have a very low friction way of

passing that feedback,

to the people who build this tool plays a huge role. And that that 1 is a prime example of that. It's definitely a great example

of the fact that

while we're building tools that aren't necessarily meant for external consumption

to the people who are paying the business money, it's still a customer interaction where we're building something that is providing value to somebody else

and being able to have those user feedback cycles to improve the overall

utility of the tool and its effectiveness for the people who,

who it's intended to serve is valuable no matter what segment of the business you're in and whether the tools are internal or external.

Yeah. 100%.

So what do you have planned for the future of Amundsen?

Yeah. So

I I I classify this in various applications that are built on top of metadata.

The first application

is data discovery and trust, and we have lots to do there. We have we currently have

users as well as tables in that graph. We are working on adding dashboards in that graph. And as time goes on, we wanna add more streaming features too. So the same problem of data discovery that exists in the analytics world of tables and views and dashboards, it exists in the streaming world too, and it would only get worse with time. So you wanna be able to discover trustworthy streams

and topics and schemas and so on. We want to go in that territory and solve that discovery and trust problem for all data components.

But it doesn't stop there. We are building and want to build other applications on top of this. An example of an application that Tau earlier mentioned was compliance. So how can you use all the metadata we have since we already know what are all the tables at Lyft,

and who's using them? We can tag them as PII or NPI and be able to figure out if there is anomalous activity happening based on this and alert the right people. So that's like an application that we wanna build. And then

down down the road, we have applications around downstream impacts. So if you were a data engineer wanting to change a column type or add a new call column or trying to make a backwards

incompatible change to a table, you can figure out who to notify and who to keep posted of those changes.

But overall, if we were to step back, I think what's really missing from the space today

is like a data pool, like that 1 place where you where you go as a data user and it is that place where you can get all your information about

what changes are happening to their tables. So think of Facebook feed for your data. Right? So if I

commonly use an a table and Tau just made a change to it yesterday,

I'll get a notification saying this change was made. And then I also have a Google like experience in the same thing where if I want to discover a new area

of work, I can just type in ETA and find a trustworthy source for me to,

use that information and learn from what work has been done in the past and so on. Are there any other aspects

of the Amundsen project itself or the ways that it's being used at Lyft and in the open source community or the engineering work that has gone into it that we didn't discuss yet that you'd like to cover before we close out the show? So for example, like,

1 thing I'm not sure if have been mentioned before is that initially,

we, the metadata service is only serving for the it's only serving for the front end,

service for the metadata request. Now have we have been involved in a place like the metadata also serve as a standalone service, not only serving the front end services

as well as serving some of the other,

external service being lived for their metadata request or either get or put.

1 use case will be compliant. The other use case is that Some of the other teams, they want to ingest certain relevance,

metadata. For example,

1 example could be, like, the feature services

being Lift Machine Learning Platform. They want to they want to allow the feature table

to be more easily discoverable

and grouped into certain category. They have their feature service directly ingest the relevant,

metadata, like the tagging, the team name, which inform, which tag created this table

directly in, call our metadata API and ingest this metadata information into our graph. So that later on, you for the machine learning user, they could they could once they search their feature table, they go already see actually, this has come from which deck which team

is created

and is created,

for what purposes?

Well, for anybody who wants to follow along with the work that you're doing or get in touch or provide any feedback on the tool that you've built in the form of Amundsen. I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get the perspective of each of you on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So, Mark, if you wanna go first and answer that. Sure. Yeah. I think the big gap is having a standard for how we discover and trust data. This

is crucial to the experience

of data users that they are always up to date on what's happening in the organization. They have a trustworthy way to figure out if this is the right source for them

to rely on for making their decisions.

And then expanding on this metadata,

engine

and building applications and making metadata this holy grail of whether it's discovery

or compliance or downstream impact analysis, like making metadata the holy grail for this experience.

Yep. So for me, I 1 1 big gap I could see is, like,

once we have for example, once we have all the different heterogeneous

metadata index in

in our data managed system like Amazon,

how to make the search relevant

for across all these heterogeneous

system. For example, initially,

we we start with, like, high table, and we got some

tape user,

assess log and get the user

assess where we would see for this Hive table. Later on, for once, for example, if there are other heterogeneous table like postgres, they don't have usage informations.

How do we still make sure that search where we see is, applied to those post web tables

or other query table

or big query tables.

So this is something we need to think about and see how to adjust that. Thank you both very much for taking the time today to join me and discuss the work that you've put into the Amundsen project. It's definitely an interesting problem space and 1 that is absolutely necessary for the continued success of data platforms and data teams as the

overall complexity of our systems continues to grow and evolve. So I appreciate the work that you've both put into that, and I hope you enjoy the rest of your day. Thank you very much. Thanks for having us,

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links