Solving Data Lineage Tracking And Data Discovery At WeWork

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline. But what happens when it gets to the data warehouse?

Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics.

Their web based transformation tool with built in collaboration features lets your analysts own the full life cycle of data in your warehouse.

Featuring built in version control integration,

real time error checking for their SQL code, data quality tests, scheduling,

and a data catalog with annotation capabilities.

It's everything you need to keep your data warehouse in order.

Sign up for a free trial today at dataengineeringpodcast.com

/dataformandemailteam@dataform.co

with the subject Data Engineering Podcast to get a hands on demo from 1 of their data experts.

And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corineum Global Intelligence,

ODSC,

and Data Council.

Upcoming events include the Software Architecture Conference, the Strata Data Conference, and PyCon US.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today.

Your host is Tobias Macy. And today, I'm interviewing Willie Lulchuk and Julian Ladem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem's metadata. So, Willie, can you start by introducing yourself?

Yeah. Sure. So,

Willie, I'm a software engineer at WeWork, and I've been with the company for just over just over a year now. Since joining WeWork, I've been working on the Marquez team in San Francisco.

But previously, I worked on a real time streaming data platform that was powering behavioral marketing software.

And before that, I designed and scaled sensor data streams at Canary,

which is an IoT company,

based in New York City. And Julian, how about yourself?

Hi. I'm Julian.

I've been at WeWork for about 2 years. I'm the principal engineer for data platform, which means that I focus more on the architecture side of the data platform.

And before that, I've been at,

Yahoo, then Twitter, then Dremio.

And going back to you, Willie, do you remember how you first got involved in the area of data management?

Yeah. So I feel my involvement has been a bit unconventional.

So what I mean by that is I owe a lot of my understanding of data management to Julian.

You know, I draw a lot of my inspiration about the topic from our earlier the conversations that we had. So before Marquez was really a thing, Marquez was this really thin data abstraction layer on a diagram,

that Julie and I, talked discussed. And the cut really, the it cut across multiple concerns. So you think about ingest, you think about storage and compute,

and how interactive these component.

So back then, we called it the metadata layer. I know the name wasn't as cool, but this abstraction layer would eventually become,

and be called Marquez and become a critical core component of WeWork's data platform. So, you know, now over a year later since we've had that discussion, you know, we have the opportunity

to tell others about our journey, you know, why organizations

invest in those tools, tooling around data management,

and what we've learned, building Marquez at WeWork. And, Julian, do you remember how you first got involved in the area of data management?

Yes. So, 12 years ago, I was working at Yahoo and, building platform on top of Hadoop. So that was, the very beginning of the Hadoop ecosystem and building batch processing on top of it. And so that was very interesting. We built schedulers

and some new things.

After that, I started contributing to open source project like Pyg,

and I joined Twitter. At Twitter, I work in the data platform. I also got involved with building metadata system to improve

how we share data.

And also I built Parquet when I was over there.

And,

that was the beginning of

having to deal with, how do we scale the organization? How we,

manage data at scale and building platforms on top of it? And

so that's how I got to,

finally join WeWork to work on the architecture for the data platform and thinking about those data management

problem and get it right,

from the beginning.

And for anybody interested, you are actually on a previous

link to that in the show notes as well. And so

as we've mentioned, we're talking about the Marquez engine that you've both been working on now. So I'm wondering if you can just start by describing a bit about what Marquez is and some of the problems you were trying to solve by creating it?

So Marques

is,

metadata management,

metadata

storage layer. And what it is about, it's really about capturing

all the jobs, all the datasets,

and for each job, what dataset it reads and writes to. And this is really about understanding

operations, how

which version of my job consume what version of a dataset and produce wide version of a dataset?

And helping with,

is it taking longer and longer over time? Who do I depend on? Who is depending on me? And, you know, this problem of data freshness, data quality, all of that,

having better visibility and capabilities to,

ensure you have good quality.

And around that,

that also enable a bunch of use case around data governance,

around data discovery,

data catalog.

And so it's really about capturing the the state of your

data environment.

So that's kind of, like, the basic of what Marcus is, and it's really about the data lineage, but really from this big graph perspective of jobs and datasets.

And what was missing in the existing

solutions for metadata management

that were available at the time you first began working on this project that you felt you could do a better job of addressing with Marquez rather than trying to maybe build just some some supplemental resources to tie into those existing engines?

So some I think if you look at tools side, we work we use Airflow, for example, which is 1 of the, main, scheduler open source scheduler around here.

And Airflow focuses a lot on the job lineage and doesn't know much about dataset. And if you look at other things like Atlas, they know a lot about data lineage and focus more on governance,

but they don't really have this precise model of connecting jobs in dataset. So there's kind of the operation side of things and really having a precise model,

of those dependencies

is missing. And, that's kind of why we started markets. Right? You also have things like,

the Hive metastore, which knows about all the datasets

and their partitions. And they focus a lot on the dataset, not too much on how,

jobs depend on that asset and how people depend on each other. So there's there's a lot of those I think a lot of components exist that touch around the metadata,

but they don't really connect all the dots together. So it's kind of what we were trying to achieve with markets. And so

in terms of the capabilities that you have built into it, I'm wondering if you can give a bit of compare and contrast to some of the other tools and services that build themselves as data catalogs or metadata layers and

maybe talk a bit about some of the ways that it's being used such as at in the Amundsen project from Lyft that we had on the show previously.

Yeah. So yep. So before we can kinda compare and contrast the differences and similarities between, you know, features enabled by Marquez, we first have to ask ourselves, right, why do organizations take on the state engine engineering challenge to build their own in house data catalog solution. Right?

So for example, you know, we have, you know, Uber. They have their own internal,

data,

catalog called Databook,

Lyft,

which I think they were previously.

They were on a previous episode.

They have a Munson, and then LinkedIn recently, they they open sourced Data Hub. But mainly, these solutions focus on 3 core

features. So you can think about data lineage,

which is how do you track the transformation of your dataset over time, you know, what are those intermediate processes that touch that data and also derive datasets?

The other core component is data discovery. So how do you democratize data? How do you get to a point where,

you know, employees within your organization can trust your data and know how to, if they want to access a dataset, how do they connect and pull that data?

The other component is data governance. So really understanding who can access what data and do they have the right privilege,

right privileges to interact with that data. So, you know, in a Venn diagram, if you if we take, like, a, you know, a a few steps back and look at the intersection of those features,

Marquez is at the center. Right?

But the the unique thing that we built out in Marquez is this versioning capability. So both for datasets

and also for jobs.

And that's really you know, when I when I talk about Marquez, that's the real differentiator

and sort of the versioning logic that we built in to support,

for example, for datasets,

we version you know, inversioning ensures historical logs of changes of datasets.

And for example, you know, with Marquez,

if your the schema for a dataset changes, we tie that to a dataset version.

If a column is added to a table or a column is removed,

that's important and we wanna track that. Similarly for for jobs, you know, if the business log business logic changes, so instead of, you know, maybe you're adding a filter to a dataset or you're replying, you know, additional

joining logic, we wanna capture and keep a unique reference to a link in source code a link to source code, that allows us to reproduce

the the actual artifact of the job from the source code, itself.

V z visualization

part of,

the metadata

management

data management.

And, Marquez is focusing more on the operation

operational lineage of data and jobs. And so we we actually had a hack quick project when we connected the 2 as a proof of concept of using them together. And so I think that's an interesting,

things we could approach in the future and see how those communities can collaborate,

and we can build on top of each other.

Yeah. Ex exactly. So, you know, before,

Amundsen was open source, we actually had an opportunity to speak with the Amundsen team at Lyft. So, you know, it was this amazing in person jam session where, you know, we we talked about metadata and it really ended with a deep technical whiteboard discussion

on how these those efforts can be combined.

So, you know, if we scan the features of a Munson, it supports

associating associating owners to datasets,

data lineage powered by Apache Atlas.

It also supports,

data discovery, which is backed by Elasticsearch.

So, you know, for for, Marquez, we do have our own UI that we that we use to search for datasets and explore the metadata

that has been collected by our APIs.

But the the cool thing with the Munson and something that Julian touched upon

was that they have an API contract, which makes, you know, pulling metadata from a back end metadata service in the in the Munson, UI very easy. So that becomes a pluggable

component in their architecture.

And 1 of our goals is to provide Marquez as a pluggable pluggable back end,

for for Munson.

And what are some of the other integrations

that you're currently using on top of Marquez and some of the ways that you're consuming the metadata and maybe some of the downstream effects of having this available that has maybe simplified or improved your capabilities for being able to identify and utilize the these datasets for your analytics?

Yeah. Sure. So,

as Julian mentioned, at WeWork, Airflow has quickly become an important component of our data platform

that's powering billing as well as space inventory.

So, internally, naturally, we've prioritized adding airflow,

support for Marquez.

So the integration allows us to capture metadata for our workflows,

manage and schedule by airflow,

enabling,

you know, data scientists and data engineers to better debug problems as they come up. 1 answer that a lot of our

data scientists and analysts really care about is that also common question but really hard to answer is why is my

why was my workflow failing? And

allowing, you know, 1 1 solution to this and the 1 key key feature of Marquez is the data lineage graph, that it's maintained on the back end. So the integration allows us to,

checkpoint the run state of a workflow,

understand the run arguments to the pipeline itself, and conveniently a pointer to the workflow definition and version control.

The some of the other integrations that we've been focusing on is,

with Iceberg. So it's a really exciting project that was open source by Netflix,

and that now I think it's incubating in,

the incubating as an Apache project.

And Iceberg is is a table extraction on that, table extraction for datasets

that are stored across multiple partitions in a in a file system.

So with, with that, you know, Iceberg does allow us to begin to version files in s 3 and capture metadata around,

around file systems.

And as far as the

capabilities that are unique to Marques, I know that you have mentioned some of this idea of linking the jobs that produce given datasets to the datasets themselves and being able to version them together.

And I'm wondering

if you can talk through some of the

just overall benefits that that has as far as being able to consume datasets and ensure the health of the data and ensure that you have some visibility into maybe when a schema mismatch occurs as far as a job being produced or some of the other information that you're able to obtain by using Marquez as this unifying layer across all of your different jobs and datasets?

Yeah. So there there are a couple of use cases where that becomes very handy. So 1 is,

of course, when something goes wrong. Right? I think a lot of what when you see

data processing

in companies, a lot of those

framework environment are very designed with the best case scenario in in mind. People know what happens if the job is successful and you produce data and you trigger

downstream processing.

However, when something goes wrong, then it becomes hard to debug.

Or if you need to reprocess something, it becomes hard to debug. So Marques is capturing very precise metadata about when the job run, what version of the code run,

what version of the dataset was right, especially if you use a storage layer like iceberg or a delta lake,

where you have precise,

definition of each version of the dataset.

And so when your job fails

or it's taking too long or, the job is successful, but the data looks wrong, you can start looking at what changed. Right? You can see if for your particular job,

does a version of the code change since last time you tried,

or is the did the dataset

shape of the input,

changed? Right? You could use things like

great expectations, which is an open source framework for defining,

declarative,

properties of your dataset and verify that they're still valid or that it didn't change significantly.

And you could look at that not only for your job, but for all the upstream jobs because you understand the dependencies.

So often,

you have simple thing happening, like why is my job not running? Well, it's not running because your input is not showing up, and your input is not showing up because the job that's producing it is not running. Right? So you can walk that graph upstream

until you find the source of your problem. And it may be

that

there's some input data that's wrong. It may be that the there's a bug that got introduced.

And you can figure out what's going on. Right? So first, you have a lot of information

depending what's happening.

And second, since you have a precise model and you know for each run what version of a dataset,

it ran on. If you need to restate a partition

in a dataset,

you can improve your triggering. You know exactly what jobs need to rerun.

So I think the state of the industry

is often that people have to do a lot of manual work when they need to restate something and rerun all downstream jobs. And the first

capability that is required is

having visibility and understanding all the dependencies. Right? What to rerun.

And in the future, you could even imagine using that very precise model to trigger automatically

all the things that need to be rerun.

Or, if something is too expensive to be rerun and it's not worth it, you could flag it as something that doesn't you know, the data is dirty and should not be used or something like that. So there are a lot of aspects like this that are important.

And I think in the world where you see a lot and more machine learning jobs happening on data,

having this information of that particular training set training job run on this version of the training set using those hyperparameters

and producing that version of the model that was then used in that experiment with an experiment ID and tying everything together

has a lot of usefulness. Right? Because people need to be able to reproduce

the same model. So capturing this information,

or if the model is drifting over time, having the proper metrics and being able to get back to that version of the training set

or understand what has changed, whether in the data or in the parameters,

is really important. So that's some of the, you know, specific

things we have in mind where we're looking at this very precise model of

jobs and dataset and what's running.

Yeah. And and and if I could add to that, you know, the you know, a a lot of what happens, you know, as a data engineer, you you you work on a pipeline and you deploy changes periodically.

But really, you know, if, you update the logic of your pipeline,

usually what happens about a a week or so later is really when you start seeing downstream

issues with your dashboards. It's like, hey. You know, I is this is the data wrong? Why is the you know, I see a sudden drop in my graph or my dashboard? And that could be related to a number of things. So with with Marquez, you have this highly multidimensional

model which allows you to say, okay. Which job version? At at what point was this introduced, this bug? And also, what were the downstream jobs,

that were affected

by the output of this particular job version,

which allows you to

really, you know, make,

backfilling a lot more, I think, straightforward than kind of what we what we see now. And, really, I think a lot of data engineering teams tend to avoid that and say, oh, yeah. Let's just write it off as something we could address,

when the pipeline runs again. Yeah. Being able to identify some of the downstream

consumers that are gonna be impacted by a job changes, I can see as being very valuable because it might inform whether or not you actually want to push that job to production now or maybe wait until somebody else is done using a particular version of a dataset or at least, as you said, having that visibility into

what are all the potential impacts. Whereas if you're just focusing on the 1 job, it can be easy to ignore the fact that there are downstream consumers of the data that you're dealing with. And then in terms of the consumers of the data that you're dealing with. And then in terms of the

inputs to Marquez, we've been talking a lot about some of the sort of discrete jobs and batch oriented work flows, but I'm curious too if there is any capability for being able to

record metadata for things like streaming event pipelines where you have a continuous flow of data into a data lake or a given table or that might be fed into a batch job that's maybe doing some sort of windowing functions and how the breakdown

falls as far as batch versus streaming workloads?

So we we do have

that in the model. So the the core entities

are this notion of jobs and dataset. Right? And they're attached to a namespace,

and that's our modeling for ownership and, multi tenancy,

like jobs and, dataset fully in a namespace,

who's producing them.

And and then for each jobs and dataset,

we do have types attached to them.

And depending on the type, we capture slightly different metadata.

And so on the dataset types, we have the

the batch address dataset that could be,

iceberg, Delta Lake, you know, usually stored in a distributed file system like s 3 or something similar. And we have the more, table data set, like if you use a warehouse, like Redshift or Snowflake or Vertica. In that sense, we have a less precise model because we can't really pinpoint

a particular version of a dataset. We can't go back to a specific version of the table,

but we can version the changes in the schema. So we do capture that. And then the 3rd type is a streaming dataset, so typically something like a Kafka topic,

which has a schema as well if you're using the schema registry with Avro like we do. And, so we can version that.

And, similarly, we don't have, like, that precise pinpointing on the version because because the job is continuously running instead of having those discrete runs than a batch dataset has.

So we have those 3 types of dataset at the moment, whether it's more like SQL table in warehouse, streaming dataset in Kafka, or batch dataset

in s 3. And then on the job side, similarly, you have batch jobs and streaming jobs. And, a batch job has discrete runs. And for both types, we capture,

you know, the version of the code and when the job started, when the job stopped.

And for batch jobs, you have, like, discrete runs that are tied to a version of a dataset. And for streaming job,

you still have runs because the streaming job starts and ends, but you have fewer of them. Right? And they're more continuous.

And so you have less of this

you don't have this tracking of versions of dataset. But we do track when the schema evolved if you update your streaming job, for example, and you added a field to the output. So we do capture those different type of information.

And so they're the higher level model.

And then depending on the type of dataset or the type of jobs, we can we try to be more precise

in, what we capture depending on each environment. And I'm wondering if you can dig a bit more into the specifics of the data model for Marquez. I know you mentioned the sort of different entities as far as datasets and jobs, And I'm wondering both what are some of the lowest common denominator

as far as the attributes that are necessary for it to be useful within the metadata repository,

and if there's any option for extending the data models for use cases outside of what you are in particular concerned with at WeWork. So the the we have this notion of

job and dataset, and I think maybe job is a little bit of an overloaded term. But when you define system like this, you always

always have some terms that are using a specific meaning in 1 in 1 area and a different meaning in another area. So by job, we really define something that consumes

and produce data. And so the the common denominator

is really this notion of inputs and outputs,

and having jobs that consume

and produce

data. So

thing that's always common is you have inputs and outputs

and,

you have

a version of the code that was deployed,

and you have parameters.

And for a dataset, there's a physical location,

an owner,

to it, same as for the job. Right? So this notion of ownership and dependencies is common to everything. And then what we do is we do specialize

in the model, we have specialized tables for each type of dataset and job

to capture a little bit of

what

when we can be more precise in 1 environment. Because what we capture in a streaming environment versus a batch environment is not the same. So they're a higher level model that's similar with the input and output.

And some of the other things we've been thinking about

because,

of course, upstream from your data processing,

there are services that depend on each other as well, but the model is slightly different. So in our model, you always have this notion of something consuming datasets and producing datasets. So you always have the datasets in between

dependencies between

components

and, artifacts that people build.

And in the service world, usually, it's directly service to service dependencies.

So it's something we haven't

really spent a lot of time on, but

that

people start asking sometimes is how you connect both worlds and having the

dependency tracking, which often people do with open tracing, things like Jaeger,

Zipkin

in the service world. How do we connect the dot? Because there are a lot of there there's like a jewel between the data processing

and the service world, and there are a lot of those concept that align.

And so how do we,

connect the dot between those things? And can you talk a bit about how Marques itself is actually implemented and some of the overall overall system architecture and maybe some of how that's evolved since you first began working on it? Yeah. Sure.

So Marques

itself is a modular system. So when we first designed

the the original source code and also the the back end,

data store,

we wanted to make sure that the, first of all, the API and also the the back end data model was platform agnostic.

So, you know, if we when I think of Marquez, I always kinda talk about 3 system components. So first, we have our meta repository and the repository itself

stores,

you know, all dataset and job metadata but also tracks the complete history of dataset changes. So, you know, you can think of when someone does when a when a system or a a team updates their schema, we wanna track that. So we keep we keep a complete history of that, as well as when a job runs,

it also updates the the dataset itself. So Marquez on the back end, creates those relationships.

The other component is the, you know, the the rest API itself.

And, you know, if you if I can talk a little bit about the stack itself, you know, it's written in Java.

We do use DropWizard pretty extensively on the project to expose the rest API but also

interact with the the back end database itself.

And really the API drives

the integration. So, you know, for what 1 example

that we talked about is is the airflow integration that we've done.

And then finally, we have the UI itself, which is used to

explore datasets and discover datasets as well as, you know, explore the dependencies between jobs themselves

and, allows our end users, you know, at at WeWork to navigate,

different sources that we've collected,

as well as the datasets and jobs that Marquez has cataloged.

And when I was going through the documentation, it looks like the actual underlying storage engine,

at least for your implementation,

is postgres. I'm wondering

what the motivation was for relying on a relational database

for this,

any other supported back ends that you have, and what the benefits are for using a relational engine versus a document store or a graph store for this type of data?

Sure. You know, for for us, you know, Postgres gets us pretty far. You know, you know, when we when we whiteboarded the data model for Marquez, it was a relational model. So we kinda went with that. You know, there there is going to be a point where a relational database cannot get us to the scale that we need. But we when we when we designed the the system, we wanted to make sure that it was simple to operate and there was limited depend you know, there wasn't too many dependencies that you had to pull in to get up and running.

So, you know, as we see,

more and more usage of Marquez internally,

we will naturally kind of

transition to a graph database because that gives us more rich relationships and allows us to kind of pinpoint in a node, in a graph,

you know, the what are the relationships between a job and a dataset.

But that doesn't mean Marquez doesn't have a,

a graph database. We actually do. It's called Kaley, which is open source by Google, and that's what we use to drive the data lineage graph that,

is is a key component and really a huge,

a huge, feature of of the API itself. A document store, I think that would be a little hard. I mean, for us, if you look at what we're trying to model,

a document store would require I mean, if you think of, you know, DynamoDB,

you know, you do have to do a lot of prefetching and filtering yourself within the application

or you push that down to the actual, NoSQL database itself. So for us, naturally, it just made sense to use Postgres

and then transition over to a graph database as we scaled out. And I think 1 1 of the obvious pieces

where you can help scaling that model

is, since we capture all the runs of a job.

And when people look at what's happening, they're mainly interested in what has been happening recently. Right? So you can

archive all the old runs to a more key value store,

type model that would scale easily to storing all the historical runs of all the jobs and all the old versions of datasets. And it's we're still talking about metadata here. So they're kind of

it's not that much data, but it does accumulate over time.

And so from that perspective, I think the relational database gets you pretty far from the number of all your datasets,

right, encamping the metadata for that. And we can add as we see people using it on larger and larger environments

and,

data ecosystems.

You can start archiving

the historical runs of the jobs to a secondary storage that scales better,

in volume

and for something that you may want to look at,

more in aggregate or something like that. And for somebody who's interested in using Marquez, can you talk through some of the overall workflow of getting it set up and getting it integrated into a data platform and maybe some of the work involved in actually populating it with the different metadata objects and records?

Yes. So, you know, Marquez, it's a open source. So you you do have the option of just building the JAR itself. So if you have a a running Postgres instance and you wanted to apply the the,

the Marquez data model, you just point it to that database and Marquez will run the migration scripts that we have that applies the schema,

to that database. So that's 1 option. The other 1 is we, you know, at WeWork, we are heavily invested in Kubernetes.

So that is 1 option as well. We do use a helm chart to deploy the UI

as well as the, the back end, API itself.

So those are 2 options that, our end users do or, you know, someone who wants to get up and running with Marquez has. We also publish a docker image. So if your, you know, your organization is a an environment that runs containers and manages through Kubernetes or some other,

container management system, you can get up and running that way.

And then as far as getting the job information and everything, I know that there are airflow connectors and you have native clients for Python,

as well as a integration

that I noticed is a fairly recent addition. So I'm wondering if you can just talk through some of the other work as far as, once you've got it up and running, just just the overall work of actually integrating it into the rest of a data platform to record metadata and job and dataset information. And then also on the downstream, setting up consumers for being able to take advantage of that information?

Right. So as you mentioned, we do have a Python client. We do also have a Java client, and we're working on a Go client as well,

because there's a lot of applications that are written in Go lang at WeWork.

So really the integration itself

requires this Java client these clients that

really implement the rest API. So a lot of,

when when we do integrations

with our internal platform components or integration with, open source project like Airflow, what we end up doing is using the rest API. So we have an API for registering source metadata,

data

metadata around datasets,

but also an API around, around jobs. So really, it comes down to just understanding

when your pipeline is running or when,

your your application is running. What are the friction points? So really what we care about is when does someone,

when does your application access data and also when does it write write data itself. So Zulu is the 2 key integration points that we care about.

Yeah. And as those,

integration are contributing to contributed to the project,

really, they become there's less and less work to do,

for people to integrate. So today, if you use Airflow,

you have the Airflow support,

right away available.

But some other companies use a scheduler called Widgy.

So currently, we don't have Luigi support. So someone who wants to use Luigi with Marquez would have to write a a Luigi integration to send the same information.

But once that is done, everybody using,

the Luigi scheduler would benefit from it. And so the same applies to Spark. So we have integration for the

Snowflake, Redshift,

SQL,

and that's

something that everybody can leverage. And, really,

it's something that the more that's 1 of the reason for open sourcing

markets is really it's something that becomes more valuable

the more it's used in the open source. Right? Because people contribute those integrations.

And then the more we have, the more it's easy for anyone to use it right away without much work.

And so that's kind of

the advantage of open source in this kind of project. Yeah. And, you know, in terms of kind of, like, continuing on that, so the, you know, the the 1 exciting

integration that we've done with Airflow is, you know, we do provide a SQL parser. So a lot of the time what we see is Airflow is used for ETL,

workloads, mainly sort of reading from s 3 and then writing to,

writing to, your warehouse.

So we what we ended up doing was we we have this built in SQL parser that really understands what are the tables that are part of your SQL SQL statement, what are the tables that are part of your join, and also what what are what tables are you writing to. And, you know, the the key thing when we were looking at, integrating with Airflow, we want it to be really easy. Just drop and play. And if you just have to do a 1 line change to modify

your,

your library in terms of what what input you're using,

we wanted to make that really, really simple. So it's just a 1 line change and,

by default, you get all of this rich metadata sent to Marquez.

And, by default, you get a lineage graph that sort of cuts across multiple airflow instances if, you're you're doing,

depending on your deployment, you could do a multiple,

multi tenancy deployment in airflow or you could have single instances.

So there there is that opportunity to,

you know, stitch together the interdependencies

between, workflows.

And in terms of the actual separation there, do you have a different deployment of Marquez for production versus preproduction

workflows? Or do you have it all in 1 UI so you can view the entirety of your datasets across all of your environments?

Yeah. So we we follow, a fairly standard deployment process. So we we do have a staging environment for

for Marquez.

And most of that really is,

your our sort of dummy data, but also if someone's testing out a new pipeline,

we do have that reported to the Marquez

back end. But and then we also have a deployment process for production.

We sometimes do sync,

metadata from production just to kinda see,

you know, to provide a more populated

metadata

in staging. So that way we can start querying. Okay.

We added this new field. Does it really make sense? Should we drop it? Does it really answer the question that we we've been trying to ask?

But, yeah, we we

we hooked into,

CI and we have a continuous deployment to both staging and then also production.

And as far as

the assumptions that you made and the ideas that you had going into this project, what are some of the ways that those have been challenged or updated as you've actually started using it in production and exposed it to other organizations

that have started employing it for their environments?

1 of the,

other metrics

for the

success of Marquez is looking at coverage

of lineage.

So when you we look at that,

sometimes it's a little bit of a moving target. Right? Because

in the Airflow integration

so we integrate with Airflow, and we have multiple instances of Airflow for multiple teams.

So right away, as you deploy the Airflow integration, you see all the jobs.

But you may not see all the lineage right away because to capture the lineage, then we have extractors

that figure out the lineage for each type of operator people are using inside of Airflow.

And so when we define targets

in terms of we need to cover all the operators that people are using, and we start working on that. Meanwhile,

of course, people keep innovating

and using more or take operators. And so making sure you define a more standardized

way of working together and, making sure

as we include more operators,

we don't have more and more,

that needs to be integrated

is a challenge that we've seen,

in the past.

And so it's kind of important to work

with your users since kind of having

having,

how do you make sure that your coverage of lineage target

doesn't be become a moving target. Right? And you keep,

the more lineage you the more coverage you add, the more coverage you need to have.

And at the beginning, it was a bit challenging, but as soon as you start paying attention to it, it actually works pretty well.

We've seen some effort, like people where you starting using DBT

to have lineage information

in their jobs.

But then they have, like, lineage information for inside the team. Right?

And, markets gives you lineage information across the entire organization. And so just working together

has been important

and making sure we have, like, aligned goals

on how we we build that. So that's been a a little bit challenging from that aspect. Yeah. And, you know, it's funny. We we do version our d b,

the schema that we do have from Marquez. So I think we're on version maybe, like, 21.

But if you look back at what we initially had, it was it was just, I think, 3 entities where you had job, datasets, and runs.

And, you know, if you fast forward to where we are now, we have a a far richer,

data model where we we capture,

not only the run logs, but also we we capture,

the context around the job itself. So recently,

with our Airflow integration, we wanted to capture the SQL. So and that way, we could display it on the Marques front end. So we added this job context field, which is just a key value pair that allows you to store additional information about the job itself.

When we first started, I think the most and

most, like, tricky part for me was was to really understand how we were going to provide this extensive

metadata model that allows us to version datasets.

It was always theoretical, but once we kind of got it running in production and our first integration with Airflow allowed us to really expand and implement that versioning logic, which, you know, kinda looking back now, it's a it was a far more bigger task than I thought it would be. And right now, it's just fairly simple versioning functions

depending on the dataset itself.

And, also, we we didn't we did kind of expand on ownership of metadata, so with a namespace. So a namespace allows you to,

group metadata by context.

So initially, we we tracked it at the job level, but then we kinda move that up 1 level where we now tie ownerships to datasets and jobs. So,

really, there was just so many additions and

modifications that we've made,

in the past year,

from our first, whiteboard session and the 1st data model that we have for for Marquez. Yeah. I think it's really important to have those entities and their relationship right.

Because from that, then it's really easy to add more metadata around each entity.

But they're

evolving the relation the entities themselves and the relation between them is, a bit harder, especially once you're in production. And so having this notion of job, job version runs, dataset, dataset version,

and inputs and outputs, and really having their their right modeling

of how the what the world looks like enables a lot of this. Yeah. And 1 1 last thing, you know, the when when

we thought about the meta meta repository, we didn't really want to store schemas. We didn't wanna become

a schema registry,

that stored all the all the dataset fields, but what we ended up seeing was the need for that. So Marquez now is able to,

version fields of a dataset and tie those to a version. So we care when we when we capture metadata for a dataset, we also capture its fields. So we have the name, the type, and also description itself,

which is, I think a direction that I didn't think we would take. But, man, you know, it's really kinda paying off and, we're seeing some really cool usage,

based off that. And in terms of the description, I know that 1 of the most valuable aspects of having a metadata repository and a data catalog is being able to capture the context of the datasets so that you can understand what their intended purpose is and

some of the

information that went into the decisions as to how it was produced and some of the schema that was formed. And I'm curious

what level of additional annotation is possible beyond just a free form description field or some of the interesting ways that you've seen that leveraged?

So we have some tagging features

and,

it can be

used to leverage to, you know, to implement privacy

or,

security aspect or

encoding SLAs. Right? Is my data experimental? Is my data

production ready?

Those kind of aspect that people can use it for.

Other aspect is adding data quality metrics in the dataset.

So we've been experimenting with, great expectations

to do this. And you then

people can decide usually, it's the it's used in 2 ways, whether when you're producing the data,

just having some declarative,

properties and force in your dataset and fail. You know, you don't want to let anybody see that dataset if it's the code may run and not declare any errors, but the result is not correct.

And so that can be used as a, you know, circuit breaker to not start the downstream jobs and never

not publish this dataset.

Other ways people use it is,

actually, the consumers may have different opinions of what

the data quality should be for them to run their job. So they can also use as a pre validation check,

like enforcing certain data quality metrics

before consuming

a job and preventing,

you know, bad data to

percolate through the system. Right? Because then it can be expensive or,

have impact in production,

especially if you're doing machine learning or recommendation engine or things like that. If you have beta bad data going in,

then you have bad recommendation coming out. Right? And that has a real impact on the production systems.

So those are some of the,

ways people are using it. So

there are always 2 aspects. Either you you have a more generic

generic tagging or

a flexible type of metadata adding to an existing entity,

Or if it's something that

can benefit that's

from being including in the core model, then it can become, like, an actual attribute

or an entity in the model. Yeah. And and the 1 way we we plan on using descriptions is for our search results. So if someone's searching for a dataset and they happen to provide a description for a dataset, we wanna reward,

the owners of those datasets by moving those datasets up the the search results.

Because we do we do make dead descriptions optional, but we, like I said, we do wanna reward our end users for putting the extra effort to annotate their datasets.

And we've talked a couple of times about the health of a dataset. And you mentioned, Julian, the idea of using something like grid expectations for being able to populate some of these data quality metrics. And I'm wondering what are some of the other useful signals as to the overall health of a dataset?

And then also things like the

last updated field for indicating,

when something might be

stale or when you might want to get some additional information about why it's not up to date or why it's in a particular state as far as the health of the quality? So, data freshness is often a a property of data

that you see. So, yes, things like to me, data freshness is really more an attribute of

the pipeline producing the data. Right? Like, people look at data freshness when they all they see is their dataset.

And they say, like, oh, when was the last time a dataset was updated? But, really, other thing you can look into is, is it taking longer and longer to produce this dataset?

Right? And

it it does it retry? Does the system fail and retry a couple of times before

working?

And those are all attributes of

the jobs producing the data.

And so that's kind of,

part of

the importance of understanding that graph. Right? And a lot of those data transformation, they are not linear. Right? Most people, they start with the dataset size and as they're being successful,

their input size will grow and grow.

And the job may that consumes that data, it does something with it, may take longer and longer. Right? To join is not a linear time operation. The bigger your dataset, the the time it takes is not proportional to the input. And so those are kind of things

that you will have to maintain your pipeline as you go. Something that was working early on in the life of your product,

may not work later

just because,

the processing time doesn't scale linearly with the size of your input. And so that's 1 basic 1, you know, like, data freshness and understanding why it takes time to do something.

Also, as you

get more users or more data source, like, the the shape of the data may change, right, the distribution

of values. And that's also can impact

processing or data quality. So, you know, great expectation is 1 way,

to get more information about the size of your input. Another 1 is looking at how long does it take to process the data. If you have failures,

it's important to correlate with how is the code changing because you may have changed an algorithm and, you know, added some functionality, but break something else.

And so how

all those changes as your organization grows and more and more people

are involved in modifying the pipelines,

the more you have different conflicting changes

that may have impact on the overall system. So

several of those

are interesting

attribute of the data in the data freshness, data quality.

And, sometimes it's important to just look at the, like, business metrics also that derive from it, not just, like,

the data property itself. But how what are the metrics

of, like, if you do a recommendation engine based on that data,

just having great expectation metrics on how is the distribution of a column evolving may not be sufficient. Right? You may want to track

metrics

downstream from that is how does that affect the user engagement

in some way and connect that all the way

to how the input

dataset change.

And what are some of the

interesting or unexpected or challenging aspects of building and maintaining the Marquez project that you have learned in the process of going through it?

Yeah.

There's been some growth, you know, so Willie mentioned before how we evolve the model,

to how do we get to this,

precise

and,

good model of those entities

and starting the integrations.

I think once you have this good model, then you can start having

more integrations in parallel. Right? Because

once a model is more stable, it's easier to build

more integrations.

And whether it's schedulers

or

processing frameworks like Spark and Flink

and,

Kafka and all those things.

So that's 1 challenge. The other challenge is about 1 thing we did early on is make sure we talk to other companies to validate the use cases

and validate the model,

and so in starting building that community. And the second as aspect is talking to other companies is whether they want to use them use the,

open source project. And then the next level is, do they want to contribute to the project? And so making sure,

that we are all on an equal footing, building that community. Right? So it's kind of, like so we started with having this design doc in the open and validating the use cases, validating the model, working with other,

people at other companies,

and, working with others, trying it out, how we work together, and making sure we do all the development in the open so that everyone feels on a we all on an equal footing,

building that project.

So I think it's part of the challenging. Right? How do we make sure

this project,

which is going to become more valuable the more people use it,

we all,

feel an,

we all have a feeling of ownership of it,

and it's really a community driven project.

And so Marquez definitely looks like it provides a lot of value and utility for being able

to manage the health and visibility of different datasets across an organization. But what are the cases where it's the wrong choice and you'd be better served with a different solution?

So 1 thing we we keep mentioning in this model, right, is

there's this strong notion of

jobs in datasets. Right? So it's kind of Marquez relied on this notion that you have things

that depend on each other through datasets. So this this, like, asynchronous

type of communication

where you produce a dataset whether it's streaming or batch dataset, and someone else will consume that dataset. Right? So that's how we model dependencies.

So that works well for any kind of batch and stream processing

type of jobs. Right? It's called this whole data ecosystem kind of works like that, and that's the model. And, so we capture this information.

If you're in an environment where you have, you know, every request looks different and, like, depending on the request, you may be sending an event to a lot of different things

and or you talk to different type of services,

then there's not necessarily the best model for it. You know, like, if you look at things like open tracing

or, you know, the projects like Jaeger, Zipkin,

other other projects that are similar that look at how do requests

flow through a system.

They may not look the same depending on the request. Right? And they may, like, you may have a lot of dependency between the microservices.

Then Marcus is not necessarily the best model. So we'll definitely look in the future how we connect those 2 worlds because there's a lot of,

interest in understanding the lineage of the data, not just when it enters Kafka or whatever data collection system you have, but also understand upstream where the data is coming from. But it's still a different model. Right? So,

I think in that case, you know, Marcus is not necessarily the best system

to,

understand how your microservices

depend on each other. It is kind of, related world,

but our model is really about this more asynchronous communication between system and through the assets. Yeah. So what I found most challenging is I think controlling

the story around Marquez. Because every time, you know, internally, we were we were we went to different teams, they had different assumption on what Marquez was and also the type of metadata

Marquez was storing.

So depending on who you talk to, it would be metadata

around,

you know, services or it was metadata that was very general and you could store whatever you wanted, in the repository.

But the key thing that, I always had to kinda drive was is, you know, Marquez is

relevant and also most useful within the context of, data processing. So that that was probably the most difficult part is sort of,

educating our end users on why this is important, what it unlocks,

and what they could actually do with the metadata that's stored in Marquez.

And looking to the future of the project, what are some of the plans that you have both from a technical

and organizational and community aspect as you continue to evolve and grow it? So from from a technical standpoint,

you know, like, now that the internal,

model is stable, having more integration, like I mentioned,

Luigi as another scheduler,

all the things people are using for processing data and understanding the lineage.

So and that's a part of the project that can really scale in parallel. Right? Different

people, users can contribute different integration in parallel,

and that scales very well in an open source project. And for example,

why doing Parquet once a core model

and format representation existed, having integration with a lot of different things, whether it's Avro, Swift, Protobuf,

Spark, Hive, all of those things. It was really easy to work those in parallel. So I think we are at this step with Marquez, and that's really

the next step is how we build all all those integrations,

so that it becomes more valuable.

Another next step, which to me, is a natural next step for a project is to move to

possibly a foundation.

Right? So kind of the next if you want to really

show that this project

is not it's community driven, not owned by any particular entity and not controlled by any particular entity. And everybody is on an equal footing on helping evolve

the mission of the project and making it successful.

That's really kind of a good testament

in showing that.

Look. It's owned by,

Open Source Foundation,

and that's,

how you can help driving community involvement and more contributors,

because they know that they're going to be on an equal footing to everybody in the community. So that's also,

to me, a a next step we're thinking about. Yeah. And and for me, I think, the next step, you know, building on top of the metadata that we've collected so far, because that unlocks a really cool feature that we've been,

discussing is data triggers.

Since Marquez is aware of when a job modifies a dataset,

imagine if Marquez also wrote that change log to a a queue somewhere, which then a back end system would listen on and trigger a job based off the dataset being being modified.

The other thing that you can think about is, you know, having some sort of health quality check, you know, before the job is triggered, enabling you to be like, you know, before

I actually kick off this this job, are all of the partitions

that are required for this job to run actually present? So we could do those type of health checks at that point. So for me, it's just, there's so many more things that we can do with just the meta, the metadata that we've collected so far. And yeah. So I'm very excited about the the future of the project. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? And I'll start with you, Willie. Yeah. For for me, it would it would have to be the tooling around sort of ensuring the quality of the dataset that that's, part of the input of your job and also the output of your job. I think we've seen over the years amazing tooling around code and visibility in code. So you will have logging for your application to understand the run time. You'll have metrics for the your system to understand its performance and also the the load on your system. So there's very little of that that we see in the open source,

around datasets themselves, and I think that's where, Marquez really fits in and and the problem it's trying to solve. And as Julie mentioned, great expectations is 1 of those really exciting open source projects that allows you to define,

the shape of your data as well as the expectations that,

you you'd like to see before you actually process that dataset. And, Julian, how about yourself? So related to what we just said, I think, like, the data operations

in general

is kind of

a big missing point. Because you see from the service in a service world,

there's a very mature way of how do you run your unit test? How do you deploy? How do you monitor your application? How is your own co rotation working?

And in the data world, I think they're not that much

either tooling

or

even best practices

that are defined, right? So part of building Marquez is really about

how do you take ownership

of your jobs? How do you understand

what you're depending on and who owns the dataset you're depending on and the job that produces it? Would depends on the dataset you are responsible for?

And how

as companies grow and you have more and more teams that depend on each other,

through sharing datasets,

and how we build

this,

really good culture

of data ownership and depending on each other, and how we all call for it. And, especially in a world where machine learning

is becoming more prominent,

problems in data affect more and more production. You know, it used to be that services when a service is down, you're most likely impact something right now.

When a batch processing doesn't work, well, maybe you'll impact something in a few hours or next day, and maybe it's less urgent. But it's becoming more and more urgent and important

to have a good, you know, production,

practices

around data processing. So I think that's 1 of the gap, and that's where,

Marquez

kind of help. And, also, it connects with all those other aspect of governance discovery.

But, also, how are you ownership ownership of dataset and jobs and how they're produced? Well, thank you both very much for taking the time today to join me and discuss your work on Marquez. It's a pretty interesting project and 1 that I look forward to taking advantage of for my environment. So thank you for your efforts on that front, and I hope you enjoy the rest of your day. Thank you, Tobias. You too. Yeah. Thanks. I always enjoy talking about metadata. So this was a a great

discussion.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering cast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links