Summary
Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Marquez is?
- What was missing in existing metadata management platforms that necessitated the creation of Marquez?
- How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?
- How does it compare to the Amundsen platform that Lyft recently released?
- What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?
- What are some of the capabilities that are unique to Marquez and how are you using them at WeWork?
- What are the primary resource types that you support in Marquez?
- What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?
- Can you explain how Marquez is architected and how the design has evolved since you first began working on it?
- Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
- What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?
- Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
- How is the metadata itself stored and managed in Marquez?
- How much up-front data modeling is necessary and what types of schema representations are supported?
- Can you talk through the overall workflow of someone using Marquez in their environment?
- What is involved in registering and updating datasets?
- How do you define and track the health of a given dataset?
- What are some of the interesting questions that can be answered from the information stored in Marquez?
- What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?
- For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?
- What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?
- When is Marquez the wrong choice for a metadata repository?
- What do you have planned for the future of Marquez?
Contact Info
- Julien Le Dem
- @J_ on Twitter
- julienledem on GitHub
- Willy
- @wslulciuc on Twitter
- wslulciuc on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Marquez
- WeWork
- Canary
- Yahoo
- Dremio
- Hadoop
- Pig
- Parquet
- Airflow
- Apache Atlas
- Amundsen
- Uber DataBook
- LinkedIn DataHub
- Iceberg Table Format
- Delta Lake
- Great Expectations data pipeline unit testing framework
- Redshift
- SnowflakeDB
- Apache Kafka Schema Registry
- Open Tracing
- Jaeger
- Zipkin
- DropWizard Java framework
- Marquez UI
- Cayley Graph Database
- Kubernetes
- Marquez Helm Chart
- Marquez Docker Container
- Dagster
- Luigi
- DBT
- Thrift
- Protocol Buffers
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline. But what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full life cycle of data in your warehouse. Featuring built in version control integration, real time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities.
It's everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com /dataformandemailteam@dataform.co with the subject Data Engineering Podcast to get a hands on demo from 1 of their data experts. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corineum Global Intelligence, ODSC, and Data Council.
Upcoming events include the Software Architecture Conference, the Strata Data Conference, and PyCon US. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today.
[00:02:25] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Willie Lulchuk and Julian Ladem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem's metadata. So, Willie, can you start by introducing yourself?
[00:02:37] Unknown:
Yeah. Sure. So, Willie, I'm a software engineer at WeWork, and I've been with the company for just over just over a year now. Since joining WeWork, I've been working on the Marquez team in San Francisco. But previously, I worked on a real time streaming data platform that was powering behavioral marketing software. And before that, I designed and scaled sensor data streams at Canary, which is an IoT company, based in New York City. And Julian, how about yourself?
[00:03:03] Unknown:
Hi. I'm Julian. I've been at WeWork for about 2 years. I'm the principal engineer for data platform, which means that I focus more on the architecture side of the data platform. And before that, I've been at, Yahoo, then Twitter, then Dremio.
[00:03:20] Unknown:
And going back to you, Willie, do you remember how you first got involved in the area of data management?
[00:03:25] Unknown:
Yeah. So I feel my involvement has been a bit unconventional. So what I mean by that is I owe a lot of my understanding of data management to Julian. You know, I draw a lot of my inspiration about the topic from our earlier the conversations that we had. So before Marquez was really a thing, Marquez was this really thin data abstraction layer on a diagram, that Julie and I, talked discussed. And the cut really, the it cut across multiple concerns. So you think about ingest, you think about storage and compute, and how interactive these component.
So back then, we called it the metadata layer. I know the name wasn't as cool, but this abstraction layer would eventually become, and be called Marquez and become a critical core component of WeWork's data platform. So, you know, now over a year later since we've had that discussion, you know, we have the opportunity to tell others about our journey, you know, why organizations invest in those tools, tooling around data management,
[00:04:18] Unknown:
and what we've learned, building Marquez at WeWork. And, Julian, do you remember how you first got involved in the area of data management?
[00:04:24] Unknown:
Yes. So, 12 years ago, I was working at Yahoo and, building platform on top of Hadoop. So that was, the very beginning of the Hadoop ecosystem and building batch processing on top of it. And so that was very interesting. We built schedulers and some new things. After that, I started contributing to open source project like Pyg, and I joined Twitter. At Twitter, I work in the data platform. I also got involved with building metadata system to improve how we share data. And also I built Parquet when I was over there. And, that was the beginning of having to deal with, how do we scale the organization? How we, manage data at scale and building platforms on top of it? And so that's how I got to, finally join WeWork to work on the architecture for the data platform and thinking about those data management problem and get it right, from the beginning.
[00:05:24] Unknown:
And for anybody interested, you are actually on a previous link to that in the show notes as well. And so as we've mentioned, we're talking about the Marquez engine that you've both been working on now. So I'm wondering if you can just start by describing a bit about what Marquez is and some of the problems you were trying to solve by creating it?
[00:05:46] Unknown:
So Marques is, metadata management, metadata storage layer. And what it is about, it's really about capturing all the jobs, all the datasets, and for each job, what dataset it reads and writes to. And this is really about understanding operations, how which version of my job consume what version of a dataset and produce wide version of a dataset? And helping with, is it taking longer and longer over time? Who do I depend on? Who is depending on me? And, you know, this problem of data freshness, data quality, all of that, having better visibility and capabilities to, ensure you have good quality.
And around that, that also enable a bunch of use case around data governance, around data discovery, data catalog. And so it's really about capturing the the state of your data environment. So that's kind of, like, the basic of what Marcus is, and it's really about the data lineage, but really from this big graph perspective of jobs and datasets.
[00:06:56] Unknown:
And what was missing in the existing solutions for metadata management that were available at the time you first began working on this project that you felt you could do a better job of addressing with Marquez rather than trying to maybe build just some some supplemental resources to tie into those existing engines?
[00:07:14] Unknown:
So some I think if you look at tools side, we work we use Airflow, for example, which is 1 of the, main, scheduler open source scheduler around here. And Airflow focuses a lot on the job lineage and doesn't know much about dataset. And if you look at other things like Atlas, they know a lot about data lineage and focus more on governance, but they don't really have this precise model of connecting jobs in dataset. So there's kind of the operation side of things and really having a precise model, of those dependencies is missing. And, that's kind of why we started markets. Right? You also have things like, the Hive metastore, which knows about all the datasets and their partitions. And they focus a lot on the dataset, not too much on how, jobs depend on that asset and how people depend on each other. So there's there's a lot of those I think a lot of components exist that touch around the metadata, but they don't really connect all the dots together. So it's kind of what we were trying to achieve with markets. And so
[00:08:19] Unknown:
in terms of the capabilities that you have built into it, I'm wondering if you can give a bit of compare and contrast to some of the other tools and services that build themselves as data catalogs or metadata layers and maybe talk a bit about some of the ways that it's being used such as at in the Amundsen project from Lyft that we had on the show previously.
[00:08:39] Unknown:
Yeah. So yep. So before we can kinda compare and contrast the differences and similarities between, you know, features enabled by Marquez, we first have to ask ourselves, right, why do organizations take on the state engine engineering challenge to build their own in house data catalog solution. Right? So for example, you know, we have, you know, Uber. They have their own internal, data, catalog called Databook, Lyft, which I think they were previously. They were on a previous episode. They have a Munson, and then LinkedIn recently, they they open sourced Data Hub. But mainly, these solutions focus on 3 core features. So you can think about data lineage, which is how do you track the transformation of your dataset over time, you know, what are those intermediate processes that touch that data and also derive datasets?
The other core component is data discovery. So how do you democratize data? How do you get to a point where, you know, employees within your organization can trust your data and know how to, if they want to access a dataset, how do they connect and pull that data? The other component is data governance. So really understanding who can access what data and do they have the right privilege, right privileges to interact with that data. So, you know, in a Venn diagram, if you if we take, like, a, you know, a a few steps back and look at the intersection of those features, Marquez is at the center. Right?
But the the unique thing that we built out in Marquez is this versioning capability. So both for datasets and also for jobs. And that's really you know, when I when I talk about Marquez, that's the real differentiator and sort of the versioning logic that we built in to support, for example, for datasets, we version you know, inversioning ensures historical logs of changes of datasets. And for example, you know, with Marquez, if your the schema for a dataset changes, we tie that to a dataset version. If a column is added to a table or a column is removed, that's important and we wanna track that. Similarly for for jobs, you know, if the business log business logic changes, so instead of, you know, maybe you're adding a filter to a dataset or you're replying, you know, additional joining logic, we wanna capture and keep a unique reference to a link in source code a link to source code, that allows us to reproduce the the actual artifact of the job from the source code, itself.
[00:11:16] Unknown:
V z visualization part of, the metadata management data management. And, Marquez is focusing more on the operation operational lineage of data and jobs. And so we we actually had a hack quick project when we connected the 2 as a proof of concept of using them together. And so I think that's an interesting, things we could approach in the future and see how those communities can collaborate, and we can build on top of each other.
[00:11:48] Unknown:
Yeah. Ex exactly. So, you know, before, Amundsen was open source, we actually had an opportunity to speak with the Amundsen team at Lyft. So, you know, it was this amazing in person jam session where, you know, we we talked about metadata and it really ended with a deep technical whiteboard discussion on how these those efforts can be combined. So, you know, if we scan the features of a Munson, it supports associating associating owners to datasets, data lineage powered by Apache Atlas. It also supports, data discovery, which is backed by Elasticsearch.
So, you know, for for, Marquez, we do have our own UI that we that we use to search for datasets and explore the metadata that has been collected by our APIs. But the the cool thing with the Munson and something that Julian touched upon was that they have an API contract, which makes, you know, pulling metadata from a back end metadata service in the in the Munson, UI very easy. So that becomes a pluggable component in their architecture. And 1 of our goals is to provide Marquez as a pluggable pluggable back end, for for Munson.
[00:12:57] Unknown:
And what are some of the other integrations that you're currently using on top of Marquez and some of the ways that you're consuming the metadata and maybe some of the downstream effects of having this available that has maybe simplified or improved your capabilities for being able to identify and utilize the these datasets for your analytics?
[00:13:18] Unknown:
Yeah. Sure. So, as Julian mentioned, at WeWork, Airflow has quickly become an important component of our data platform that's powering billing as well as space inventory. So, internally, naturally, we've prioritized adding airflow, support for Marquez. So the integration allows us to capture metadata for our workflows, manage and schedule by airflow, enabling, you know, data scientists and data engineers to better debug problems as they come up. 1 answer that a lot of our data scientists and analysts really care about is that also common question but really hard to answer is why is my why was my workflow failing? And allowing, you know, 1 1 solution to this and the 1 key key feature of Marquez is the data lineage graph, that it's maintained on the back end. So the integration allows us to, checkpoint the run state of a workflow, understand the run arguments to the pipeline itself, and conveniently a pointer to the workflow definition and version control.
The some of the other integrations that we've been focusing on is, with Iceberg. So it's a really exciting project that was open source by Netflix, and that now I think it's incubating in, the incubating as an Apache project. And Iceberg is is a table extraction on that, table extraction for datasets that are stored across multiple partitions in a in a file system. So with, with that, you know, Iceberg does allow us to begin to version files in s 3 and capture metadata around, around file systems.
[00:14:54] Unknown:
And as far as the capabilities that are unique to Marques, I know that you have mentioned some of this idea of linking the jobs that produce given datasets to the datasets themselves and being able to version them together. And I'm wondering if you can talk through some of the just overall benefits that that has as far as being able to consume datasets and ensure the health of the data and ensure that you have some visibility into maybe when a schema mismatch occurs as far as a job being produced or some of the other information that you're able to obtain by using Marquez as this unifying layer across all of your different jobs and datasets?
[00:15:34] Unknown:
Yeah. So there there are a couple of use cases where that becomes very handy. So 1 is, of course, when something goes wrong. Right? I think a lot of what when you see data processing in companies, a lot of those framework environment are very designed with the best case scenario in in mind. People know what happens if the job is successful and you produce data and you trigger downstream processing. However, when something goes wrong, then it becomes hard to debug. Or if you need to reprocess something, it becomes hard to debug. So Marques is capturing very precise metadata about when the job run, what version of the code run, what version of the dataset was right, especially if you use a storage layer like iceberg or a delta lake, where you have precise, definition of each version of the dataset.
And so when your job fails or it's taking too long or, the job is successful, but the data looks wrong, you can start looking at what changed. Right? You can see if for your particular job, does a version of the code change since last time you tried, or is the did the dataset shape of the input, changed? Right? You could use things like great expectations, which is an open source framework for defining, declarative, properties of your dataset and verify that they're still valid or that it didn't change significantly. And you could look at that not only for your job, but for all the upstream jobs because you understand the dependencies.
So often, you have simple thing happening, like why is my job not running? Well, it's not running because your input is not showing up, and your input is not showing up because the job that's producing it is not running. Right? So you can walk that graph upstream until you find the source of your problem. And it may be that there's some input data that's wrong. It may be that the there's a bug that got introduced. And you can figure out what's going on. Right? So first, you have a lot of information depending what's happening. And second, since you have a precise model and you know for each run what version of a dataset, it ran on. If you need to restate a partition in a dataset, you can improve your triggering. You know exactly what jobs need to rerun.
So I think the state of the industry is often that people have to do a lot of manual work when they need to restate something and rerun all downstream jobs. And the first capability that is required is having visibility and understanding all the dependencies. Right? What to rerun. And in the future, you could even imagine using that very precise model to trigger automatically all the things that need to be rerun. Or, if something is too expensive to be rerun and it's not worth it, you could flag it as something that doesn't you know, the data is dirty and should not be used or something like that. So there are a lot of aspects like this that are important. And I think in the world where you see a lot and more machine learning jobs happening on data, having this information of that particular training set training job run on this version of the training set using those hyperparameters and producing that version of the model that was then used in that experiment with an experiment ID and tying everything together has a lot of usefulness. Right? Because people need to be able to reproduce the same model. So capturing this information, or if the model is drifting over time, having the proper metrics and being able to get back to that version of the training set or understand what has changed, whether in the data or in the parameters, is really important. So that's some of the, you know, specific things we have in mind where we're looking at this very precise model of jobs and dataset and what's running.
[00:19:29] Unknown:
Yeah. And and and if I could add to that, you know, the you know, a a lot of what happens, you know, as a data engineer, you you you work on a pipeline and you deploy changes periodically. But really, you know, if, you update the logic of your pipeline, usually what happens about a a week or so later is really when you start seeing downstream issues with your dashboards. It's like, hey. You know, I is this is the data wrong? Why is the you know, I see a sudden drop in my graph or my dashboard? And that could be related to a number of things. So with with Marquez, you have this highly multidimensional model which allows you to say, okay. Which job version? At at what point was this introduced, this bug? And also, what were the downstream jobs, that were affected by the output of this particular job version, which allows you to really, you know, make, backfilling a lot more, I think, straightforward than kind of what we what we see now. And, really, I think a lot of data engineering teams tend to avoid that and say, oh, yeah. Let's just write it off as something we could address,
[00:20:30] Unknown:
when the pipeline runs again. Yeah. Being able to identify some of the downstream consumers that are gonna be impacted by a job changes, I can see as being very valuable because it might inform whether or not you actually want to push that job to production now or maybe wait until somebody else is done using a particular version of a dataset or at least, as you said, having that visibility into what are all the potential impacts. Whereas if you're just focusing on the 1 job, it can be easy to ignore the fact that there are downstream consumers of the data that you're dealing with. And then in terms of the consumers of the data that you're dealing with. And then in terms of the inputs to Marquez, we've been talking a lot about some of the sort of discrete jobs and batch oriented work flows, but I'm curious too if there is any capability for being able to record metadata for things like streaming event pipelines where you have a continuous flow of data into a data lake or a given table or that might be fed into a batch job that's maybe doing some sort of windowing functions and how the breakdown falls as far as batch versus streaming workloads?
[00:21:29] Unknown:
So we we do have that in the model. So the the core entities are this notion of jobs and dataset. Right? And they're attached to a namespace, and that's our modeling for ownership and, multi tenancy, like jobs and, dataset fully in a namespace, who's producing them. And and then for each jobs and dataset, we do have types attached to them. And depending on the type, we capture slightly different metadata. And so on the dataset types, we have the the batch address dataset that could be, iceberg, Delta Lake, you know, usually stored in a distributed file system like s 3 or something similar. And we have the more, table data set, like if you use a warehouse, like Redshift or Snowflake or Vertica. In that sense, we have a less precise model because we can't really pinpoint a particular version of a dataset. We can't go back to a specific version of the table, but we can version the changes in the schema. So we do capture that. And then the 3rd type is a streaming dataset, so typically something like a Kafka topic, which has a schema as well if you're using the schema registry with Avro like we do. And, so we can version that.
And, similarly, we don't have, like, that precise pinpointing on the version because because the job is continuously running instead of having those discrete runs than a batch dataset has. So we have those 3 types of dataset at the moment, whether it's more like SQL table in warehouse, streaming dataset in Kafka, or batch dataset in s 3. And then on the job side, similarly, you have batch jobs and streaming jobs. And, a batch job has discrete runs. And for both types, we capture, you know, the version of the code and when the job started, when the job stopped. And for batch jobs, you have, like, discrete runs that are tied to a version of a dataset. And for streaming job, you still have runs because the streaming job starts and ends, but you have fewer of them. Right? And they're more continuous.
And so you have less of this you don't have this tracking of versions of dataset. But we do track when the schema evolved if you update your streaming job, for example, and you added a field to the output. So we do capture those different type of information. And so they're the higher level model. And then depending on the type of dataset or the type of jobs, we can we try to be more precise
[00:23:55] Unknown:
in, what we capture depending on each environment. And I'm wondering if you can dig a bit more into the specifics of the data model for Marquez. I know you mentioned the sort of different entities as far as datasets and jobs, And I'm wondering both what are some of the lowest common denominator as far as the attributes that are necessary for it to be useful within the metadata repository, and if there's any option for extending the data models for use cases outside of what you are in particular concerned with at WeWork. So the the we have this notion of
[00:24:29] Unknown:
job and dataset, and I think maybe job is a little bit of an overloaded term. But when you define system like this, you always always have some terms that are using a specific meaning in 1 in 1 area and a different meaning in another area. So by job, we really define something that consumes and produce data. And so the the common denominator is really this notion of inputs and outputs, and having jobs that consume and produce data. So thing that's always common is you have inputs and outputs and, you have a version of the code that was deployed, and you have parameters.
And for a dataset, there's a physical location, an owner, to it, same as for the job. Right? So this notion of ownership and dependencies is common to everything. And then what we do is we do specialize in the model, we have specialized tables for each type of dataset and job to capture a little bit of what when we can be more precise in 1 environment. Because what we capture in a streaming environment versus a batch environment is not the same. So they're a higher level model that's similar with the input and output. And some of the other things we've been thinking about because, of course, upstream from your data processing, there are services that depend on each other as well, but the model is slightly different. So in our model, you always have this notion of something consuming datasets and producing datasets. So you always have the datasets in between dependencies between components and, artifacts that people build.
And in the service world, usually, it's directly service to service dependencies. So it's something we haven't really spent a lot of time on, but that people start asking sometimes is how you connect both worlds and having the dependency tracking, which often people do with open tracing, things like Jaeger, Zipkin in the service world. How do we connect the dot? Because there are a lot of there there's like a jewel between the data processing and the service world, and there are a lot of those concept that align. And so how do we,
[00:26:48] Unknown:
connect the dot between those things? And can you talk a bit about how Marques itself is actually implemented and some of the overall overall system architecture and maybe some of how that's evolved since you first began working on it? Yeah. Sure.
[00:27:01] Unknown:
So Marques itself is a modular system. So when we first designed the the original source code and also the the back end, data store, we wanted to make sure that the, first of all, the API and also the the back end data model was platform agnostic. So, you know, if we when I think of Marquez, I always kinda talk about 3 system components. So first, we have our meta repository and the repository itself stores, you know, all dataset and job metadata but also tracks the complete history of dataset changes. So, you know, you can think of when someone does when a when a system or a a team updates their schema, we wanna track that. So we keep we keep a complete history of that, as well as when a job runs, it also updates the the dataset itself. So Marquez on the back end, creates those relationships.
The other component is the, you know, the the rest API itself. And, you know, if you if I can talk a little bit about the stack itself, you know, it's written in Java. We do use DropWizard pretty extensively on the project to expose the rest API but also interact with the the back end database itself. And really the API drives the integration. So, you know, for what 1 example that we talked about is is the airflow integration that we've done. And then finally, we have the UI itself, which is used to explore datasets and discover datasets as well as, you know, explore the dependencies between jobs themselves and, allows our end users, you know, at at WeWork to navigate, different sources that we've collected, as well as the datasets and jobs that Marquez has cataloged.
[00:28:40] Unknown:
And when I was going through the documentation, it looks like the actual underlying storage engine, at least for your implementation, is postgres. I'm wondering what the motivation was for relying on a relational database for this, any other supported back ends that you have, and what the benefits are for using a relational engine versus a document store or a graph store for this type of data?
[00:29:05] Unknown:
Sure. You know, for for us, you know, Postgres gets us pretty far. You know, you know, when we when we whiteboarded the data model for Marquez, it was a relational model. So we kinda went with that. You know, there there is going to be a point where a relational database cannot get us to the scale that we need. But we when we when we designed the the system, we wanted to make sure that it was simple to operate and there was limited depend you know, there wasn't too many dependencies that you had to pull in to get up and running. So, you know, as we see, more and more usage of Marquez internally, we will naturally kind of transition to a graph database because that gives us more rich relationships and allows us to kind of pinpoint in a node, in a graph, you know, the what are the relationships between a job and a dataset.
But that doesn't mean Marquez doesn't have a, a graph database. We actually do. It's called Kaley, which is open source by Google, and that's what we use to drive the data lineage graph that, is is a key component and really a huge, a huge, feature of of the API itself. A document store, I think that would be a little hard. I mean, for us, if you look at what we're trying to model, a document store would require I mean, if you think of, you know, DynamoDB, you know, you do have to do a lot of prefetching and filtering yourself within the application or you push that down to the actual, NoSQL database itself. So for us, naturally, it just made sense to use Postgres and then transition over to a graph database as we scaled out. And I think 1 1 of the obvious pieces
[00:30:39] Unknown:
where you can help scaling that model is, since we capture all the runs of a job. And when people look at what's happening, they're mainly interested in what has been happening recently. Right? So you can archive all the old runs to a more key value store, type model that would scale easily to storing all the historical runs of all the jobs and all the old versions of datasets. And it's we're still talking about metadata here. So they're kind of it's not that much data, but it does accumulate over time. And so from that perspective, I think the relational database gets you pretty far from the number of all your datasets, right, encamping the metadata for that. And we can add as we see people using it on larger and larger environments and, data ecosystems.
You can start archiving the historical runs of the jobs to a secondary storage that scales better, in volume and for something that you may want to look at,
[00:31:42] Unknown:
more in aggregate or something like that. And for somebody who's interested in using Marquez, can you talk through some of the overall workflow of getting it set up and getting it integrated into a data platform and maybe some of the work involved in actually populating it with the different metadata objects and records?
[00:32:00] Unknown:
Yes. So, you know, Marquez, it's a open source. So you you do have the option of just building the JAR itself. So if you have a a running Postgres instance and you wanted to apply the the, the Marquez data model, you just point it to that database and Marquez will run the migration scripts that we have that applies the schema, to that database. So that's 1 option. The other 1 is we, you know, at WeWork, we are heavily invested in Kubernetes. So that is 1 option as well. We do use a helm chart to deploy the UI as well as the, the back end, API itself.
So those are 2 options that, our end users do or, you know, someone who wants to get up and running with Marquez has. We also publish a docker image. So if your, you know, your organization is a an environment that runs containers and manages through Kubernetes or some other, container management system, you can get up and running that way.
[00:32:55] Unknown:
And then as far as getting the job information and everything, I know that there are airflow connectors and you have native clients for Python, as well as a integration that I noticed is a fairly recent addition. So I'm wondering if you can just talk through some of the other work as far as, once you've got it up and running, just just the overall work of actually integrating it into the rest of a data platform to record metadata and job and dataset information. And then also on the downstream, setting up consumers for being able to take advantage of that information?
[00:33:26] Unknown:
Right. So as you mentioned, we do have a Python client. We do also have a Java client, and we're working on a Go client as well, because there's a lot of applications that are written in Go lang at WeWork. So really the integration itself requires this Java client these clients that really implement the rest API. So a lot of, when when we do integrations with our internal platform components or integration with, open source project like Airflow, what we end up doing is using the rest API. So we have an API for registering source metadata, data metadata around datasets, but also an API around, around jobs. So really, it comes down to just understanding when your pipeline is running or when, your your application is running. What are the friction points? So really what we care about is when does someone, when does your application access data and also when does it write write data itself. So Zulu is the 2 key integration points that we care about.
[00:34:25] Unknown:
Yeah. And as those, integration are contributing to contributed to the project, really, they become there's less and less work to do, for people to integrate. So today, if you use Airflow, you have the Airflow support, right away available. But some other companies use a scheduler called Widgy. So currently, we don't have Luigi support. So someone who wants to use Luigi with Marquez would have to write a a Luigi integration to send the same information. But once that is done, everybody using, the Luigi scheduler would benefit from it. And so the same applies to Spark. So we have integration for the Snowflake, Redshift, SQL, and that's something that everybody can leverage. And, really, it's something that the more that's 1 of the reason for open sourcing markets is really it's something that becomes more valuable the more it's used in the open source. Right? Because people contribute those integrations.
And then the more we have, the more it's easy for anyone to use it right away without much work. And so that's kind of
[00:35:33] Unknown:
the advantage of open source in this kind of project. Yeah. And, you know, in terms of kind of, like, continuing on that, so the, you know, the the 1 exciting integration that we've done with Airflow is, you know, we do provide a SQL parser. So a lot of the time what we see is Airflow is used for ETL, workloads, mainly sort of reading from s 3 and then writing to, writing to, your warehouse. So we what we ended up doing was we we have this built in SQL parser that really understands what are the tables that are part of your SQL SQL statement, what are the tables that are part of your join, and also what what are what tables are you writing to. And, you know, the the key thing when we were looking at, integrating with Airflow, we want it to be really easy. Just drop and play. And if you just have to do a 1 line change to modify your, your library in terms of what what input you're using, we wanted to make that really, really simple. So it's just a 1 line change and, by default, you get all of this rich metadata sent to Marquez.
And, by default, you get a lineage graph that sort of cuts across multiple airflow instances if, you're you're doing, depending on your deployment, you could do a multiple, multi tenancy deployment in airflow or you could have single instances. So there there is that opportunity to, you know, stitch together the interdependencies between, workflows.
[00:36:57] Unknown:
And in terms of the actual separation there, do you have a different deployment of Marquez for production versus preproduction workflows? Or do you have it all in 1 UI so you can view the entirety of your datasets across all of your environments?
[00:37:12] Unknown:
Yeah. So we we follow, a fairly standard deployment process. So we we do have a staging environment for for Marquez. And most of that really is, your our sort of dummy data, but also if someone's testing out a new pipeline, we do have that reported to the Marquez back end. But and then we also have a deployment process for production. We sometimes do sync, metadata from production just to kinda see, you know, to provide a more populated metadata in staging. So that way we can start querying. Okay. We added this new field. Does it really make sense? Should we drop it? Does it really answer the question that we we've been trying to ask? But, yeah, we we we hooked into, CI and we have a continuous deployment to both staging and then also production.
[00:38:00] Unknown:
And as far as the assumptions that you made and the ideas that you had going into this project, what are some of the ways that those have been challenged or updated as you've actually started using it in production and exposed it to other organizations that have started employing it for their environments?
[00:38:16] Unknown:
1 of the, other metrics for the success of Marquez is looking at coverage of lineage. So when you we look at that, sometimes it's a little bit of a moving target. Right? Because in the Airflow integration so we integrate with Airflow, and we have multiple instances of Airflow for multiple teams. So right away, as you deploy the Airflow integration, you see all the jobs. But you may not see all the lineage right away because to capture the lineage, then we have extractors that figure out the lineage for each type of operator people are using inside of Airflow. And so when we define targets in terms of we need to cover all the operators that people are using, and we start working on that. Meanwhile, of course, people keep innovating and using more or take operators. And so making sure you define a more standardized way of working together and, making sure as we include more operators, we don't have more and more, that needs to be integrated is a challenge that we've seen, in the past.
And so it's kind of important to work with your users since kind of having having, how do you make sure that your coverage of lineage target doesn't be become a moving target. Right? And you keep, the more lineage you the more coverage you add, the more coverage you need to have. And at the beginning, it was a bit challenging, but as soon as you start paying attention to it, it actually works pretty well. We've seen some effort, like people where you starting using DBT to have lineage information in their jobs. But then they have, like, lineage information for inside the team. Right?
And, markets gives you lineage information across the entire organization. And so just working together has been important and making sure we have, like, aligned goals on how we we build that. So that's been a a little bit challenging from that aspect. Yeah. And, you know, it's funny. We we do version our d b,
[00:40:19] Unknown:
the schema that we do have from Marquez. So I think we're on version maybe, like, 21. But if you look back at what we initially had, it was it was just, I think, 3 entities where you had job, datasets, and runs. And, you know, if you fast forward to where we are now, we have a a far richer, data model where we we capture, not only the run logs, but also we we capture, the context around the job itself. So recently, with our Airflow integration, we wanted to capture the SQL. So and that way, we could display it on the Marques front end. So we added this job context field, which is just a key value pair that allows you to store additional information about the job itself. When we first started, I think the most and most, like, tricky part for me was was to really understand how we were going to provide this extensive metadata model that allows us to version datasets.
It was always theoretical, but once we kind of got it running in production and our first integration with Airflow allowed us to really expand and implement that versioning logic, which, you know, kinda looking back now, it's a it was a far more bigger task than I thought it would be. And right now, it's just fairly simple versioning functions depending on the dataset itself. And, also, we we didn't we did kind of expand on ownership of metadata, so with a namespace. So a namespace allows you to, group metadata by context. So initially, we we tracked it at the job level, but then we kinda move that up 1 level where we now tie ownerships to datasets and jobs. So, really, there was just so many additions and modifications that we've made, in the past year, from our first, whiteboard session and the 1st data model that we have for for Marquez. Yeah. I think it's really important to have those entities and their relationship right.
[00:42:04] Unknown:
Because from that, then it's really easy to add more metadata around each entity. But they're evolving the relation the entities themselves and the relation between them is, a bit harder, especially once you're in production. And so having this notion of job, job version runs, dataset, dataset version, and inputs and outputs, and really having their their right modeling of how the what the world looks like enables a lot of this. Yeah. And 1 1 last thing, you know, the when when
[00:42:33] Unknown:
we thought about the meta meta repository, we didn't really want to store schemas. We didn't wanna become a schema registry, that stored all the all the dataset fields, but what we ended up seeing was the need for that. So Marquez now is able to, version fields of a dataset and tie those to a version. So we care when we when we capture metadata for a dataset, we also capture its fields. So we have the name, the type, and also description itself, which is, I think a direction that I didn't think we would take. But, man, you know, it's really kinda paying off and, we're seeing some really cool usage,
[00:43:08] Unknown:
based off that. And in terms of the description, I know that 1 of the most valuable aspects of having a metadata repository and a data catalog is being able to capture the context of the datasets so that you can understand what their intended purpose is and some of the information that went into the decisions as to how it was produced and some of the schema that was formed. And I'm curious what level of additional annotation is possible beyond just a free form description field or some of the interesting ways that you've seen that leveraged?
[00:43:39] Unknown:
So we have some tagging features and, it can be used to leverage to, you know, to implement privacy or, security aspect or encoding SLAs. Right? Is my data experimental? Is my data production ready? Those kind of aspect that people can use it for. Other aspect is adding data quality metrics in the dataset. So we've been experimenting with, great expectations to do this. And you then people can decide usually, it's the it's used in 2 ways, whether when you're producing the data, just having some declarative, properties and force in your dataset and fail. You know, you don't want to let anybody see that dataset if it's the code may run and not declare any errors, but the result is not correct. And so that can be used as a, you know, circuit breaker to not start the downstream jobs and never not publish this dataset.
Other ways people use it is, actually, the consumers may have different opinions of what the data quality should be for them to run their job. So they can also use as a pre validation check, like enforcing certain data quality metrics before consuming a job and preventing, you know, bad data to percolate through the system. Right? Because then it can be expensive or, have impact in production, especially if you're doing machine learning or recommendation engine or things like that. If you have beta bad data going in, then you have bad recommendation coming out. Right? And that has a real impact on the production systems.
So those are some of the, ways people are using it. So there are always 2 aspects. Either you you have a more generic generic tagging or a flexible type of metadata adding to an existing entity, Or if it's something that can benefit that's from being including in the core model, then it can become, like, an actual attribute
[00:45:46] Unknown:
or an entity in the model. Yeah. And and the 1 way we we plan on using descriptions is for our search results. So if someone's searching for a dataset and they happen to provide a description for a dataset, we wanna reward, the owners of those datasets by moving those datasets up the the search results. Because we do we do make dead descriptions optional, but we, like I said, we do wanna reward our end users for putting the extra effort to annotate their datasets.
[00:46:13] Unknown:
And we've talked a couple of times about the health of a dataset. And you mentioned, Julian, the idea of using something like grid expectations for being able to populate some of these data quality metrics. And I'm wondering what are some of the other useful signals as to the overall health of a dataset? And then also things like the last updated field for indicating, when something might be stale or when you might want to get some additional information about why it's not up to date or why it's in a particular state as far as the health of the quality? So, data freshness is often a a property of data
[00:46:51] Unknown:
that you see. So, yes, things like to me, data freshness is really more an attribute of the pipeline producing the data. Right? Like, people look at data freshness when they all they see is their dataset. And they say, like, oh, when was the last time a dataset was updated? But, really, other thing you can look into is, is it taking longer and longer to produce this dataset? Right? And it it does it retry? Does the system fail and retry a couple of times before working? And those are all attributes of the jobs producing the data.
And so that's kind of, part of the importance of understanding that graph. Right? And a lot of those data transformation, they are not linear. Right? Most people, they start with the dataset size and as they're being successful, their input size will grow and grow. And the job may that consumes that data, it does something with it, may take longer and longer. Right? To join is not a linear time operation. The bigger your dataset, the the time it takes is not proportional to the input. And so those are kind of things that you will have to maintain your pipeline as you go. Something that was working early on in the life of your product, may not work later just because, the processing time doesn't scale linearly with the size of your input. And so that's 1 basic 1, you know, like, data freshness and understanding why it takes time to do something.
Also, as you get more users or more data source, like, the the shape of the data may change, right, the distribution of values. And that's also can impact processing or data quality. So, you know, great expectation is 1 way, to get more information about the size of your input. Another 1 is looking at how long does it take to process the data. If you have failures, it's important to correlate with how is the code changing because you may have changed an algorithm and, you know, added some functionality, but break something else. And so how all those changes as your organization grows and more and more people are involved in modifying the pipelines, the more you have different conflicting changes that may have impact on the overall system. So several of those are interesting attribute of the data in the data freshness, data quality.
And, sometimes it's important to just look at the, like, business metrics also that derive from it, not just, like, the data property itself. But how what are the metrics of, like, if you do a recommendation engine based on that data, just having great expectation metrics on how is the distribution of a column evolving may not be sufficient. Right? You may want to track metrics downstream from that is how does that affect the user engagement in some way and connect that all the way to how the input dataset change.
[00:49:54] Unknown:
And what are some of the interesting or unexpected or challenging aspects of building and maintaining the Marquez project that you have learned in the process of going through it?
[00:50:05] Unknown:
Yeah. There's been some growth, you know, so Willie mentioned before how we evolve the model, to how do we get to this, precise and, good model of those entities and starting the integrations. I think once you have this good model, then you can start having more integrations in parallel. Right? Because once a model is more stable, it's easier to build more integrations. And whether it's schedulers or processing frameworks like Spark and Flink and, Kafka and all those things. So that's 1 challenge. The other challenge is about 1 thing we did early on is make sure we talk to other companies to validate the use cases and validate the model, and so in starting building that community. And the second as aspect is talking to other companies is whether they want to use them use the, open source project. And then the next level is, do they want to contribute to the project? And so making sure, that we are all on an equal footing, building that community. Right? So it's kind of, like so we started with having this design doc in the open and validating the use cases, validating the model, working with other, people at other companies, and, working with others, trying it out, how we work together, and making sure we do all the development in the open so that everyone feels on a we all on an equal footing, building that project.
So I think it's part of the challenging. Right? How do we make sure this project, which is going to become more valuable the more people use it, we all, feel an, we all have a feeling of ownership of it, and it's really a community driven project.
[00:51:55] Unknown:
And so Marquez definitely looks like it provides a lot of value and utility for being able to manage the health and visibility of different datasets across an organization. But what are the cases where it's the wrong choice and you'd be better served with a different solution?
[00:52:10] Unknown:
So 1 thing we we keep mentioning in this model, right, is there's this strong notion of jobs in datasets. Right? So it's kind of Marquez relied on this notion that you have things that depend on each other through datasets. So this this, like, asynchronous type of communication where you produce a dataset whether it's streaming or batch dataset, and someone else will consume that dataset. Right? So that's how we model dependencies. So that works well for any kind of batch and stream processing type of jobs. Right? It's called this whole data ecosystem kind of works like that, and that's the model. And, so we capture this information. If you're in an environment where you have, you know, every request looks different and, like, depending on the request, you may be sending an event to a lot of different things and or you talk to different type of services, then there's not necessarily the best model for it. You know, like, if you look at things like open tracing or, you know, the projects like Jaeger, Zipkin, other other projects that are similar that look at how do requests flow through a system.
They may not look the same depending on the request. Right? And they may, like, you may have a lot of dependency between the microservices. Then Marcus is not necessarily the best model. So we'll definitely look in the future how we connect those 2 worlds because there's a lot of, interest in understanding the lineage of the data, not just when it enters Kafka or whatever data collection system you have, but also understand upstream where the data is coming from. But it's still a different model. Right? So, I think in that case, you know, Marcus is not necessarily the best system to, understand how your microservices depend on each other. It is kind of, related world, but our model is really about this more asynchronous communication between system and through the assets. Yeah. So what I found most challenging is I think controlling
[00:54:13] Unknown:
the story around Marquez. Because every time, you know, internally, we were we were we went to different teams, they had different assumption on what Marquez was and also the type of metadata Marquez was storing. So depending on who you talk to, it would be metadata around, you know, services or it was metadata that was very general and you could store whatever you wanted, in the repository. But the key thing that, I always had to kinda drive was is, you know, Marquez is relevant and also most useful within the context of, data processing. So that that was probably the most difficult part is sort of, educating our end users on why this is important, what it unlocks, and what they could actually do with the metadata that's stored in Marquez.
[00:54:57] Unknown:
And looking to the future of the project, what are some of the plans that you have both from a technical and organizational and community aspect as you continue to evolve and grow it? So from from a technical standpoint,
[00:55:09] Unknown:
you know, like, now that the internal, model is stable, having more integration, like I mentioned, Luigi as another scheduler, all the things people are using for processing data and understanding the lineage. So and that's a part of the project that can really scale in parallel. Right? Different people, users can contribute different integration in parallel, and that scales very well in an open source project. And for example, why doing Parquet once a core model and format representation existed, having integration with a lot of different things, whether it's Avro, Swift, Protobuf, Spark, Hive, all of those things. It was really easy to work those in parallel. So I think we are at this step with Marquez, and that's really the next step is how we build all all those integrations, so that it becomes more valuable.
Another next step, which to me, is a natural next step for a project is to move to possibly a foundation. Right? So kind of the next if you want to really show that this project is not it's community driven, not owned by any particular entity and not controlled by any particular entity. And everybody is on an equal footing on helping evolve the mission of the project and making it successful. That's really kind of a good testament in showing that. Look. It's owned by, Open Source Foundation, and that's, how you can help driving community involvement and more contributors, because they know that they're going to be on an equal footing to everybody in the community. So that's also,
[00:56:48] Unknown:
to me, a a next step we're thinking about. Yeah. And and for me, I think, the next step, you know, building on top of the metadata that we've collected so far, because that unlocks a really cool feature that we've been, discussing is data triggers. Since Marquez is aware of when a job modifies a dataset, imagine if Marquez also wrote that change log to a a queue somewhere, which then a back end system would listen on and trigger a job based off the dataset being being modified. The other thing that you can think about is, you know, having some sort of health quality check, you know, before the job is triggered, enabling you to be like, you know, before I actually kick off this this job, are all of the partitions that are required for this job to run actually present? So we could do those type of health checks at that point. So for me, it's just, there's so many more things that we can do with just the meta, the metadata that we've collected so far. And yeah. So I'm very excited about the the future of the project. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? And I'll start with you, Willie. Yeah. For for me, it would it would have to be the tooling around sort of ensuring the quality of the dataset that that's, part of the input of your job and also the output of your job. I think we've seen over the years amazing tooling around code and visibility in code. So you will have logging for your application to understand the run time. You'll have metrics for the your system to understand its performance and also the the load on your system. So there's very little of that that we see in the open source, around datasets themselves, and I think that's where, Marquez really fits in and and the problem it's trying to solve. And as Julie mentioned, great expectations is 1 of those really exciting open source projects that allows you to define, the shape of your data as well as the expectations that,
[00:58:46] Unknown:
you you'd like to see before you actually process that dataset. And, Julian, how about yourself? So related to what we just said, I think, like, the data operations in general is kind of a big missing point. Because you see from the service in a service world, there's a very mature way of how do you run your unit test? How do you deploy? How do you monitor your application? How is your own co rotation working? And in the data world, I think they're not that much either tooling or even best practices that are defined, right? So part of building Marquez is really about how do you take ownership of your jobs? How do you understand what you're depending on and who owns the dataset you're depending on and the job that produces it? Would depends on the dataset you are responsible for?
And how as companies grow and you have more and more teams that depend on each other, through sharing datasets, and how we build this, really good culture of data ownership and depending on each other, and how we all call for it. And, especially in a world where machine learning is becoming more prominent, problems in data affect more and more production. You know, it used to be that services when a service is down, you're most likely impact something right now. When a batch processing doesn't work, well, maybe you'll impact something in a few hours or next day, and maybe it's less urgent. But it's becoming more and more urgent and important to have a good, you know, production, practices around data processing. So I think that's 1 of the gap, and that's where, Marquez kind of help. And, also, it connects with all those other aspect of governance discovery.
[01:00:37] Unknown:
But, also, how are you ownership ownership of dataset and jobs and how they're produced? Well, thank you both very much for taking the time today to join me and discuss your work on Marquez. It's a pretty interesting project and 1 that I look forward to taking advantage of for my environment. So thank you for your efforts on that front, and I hope you enjoy the rest of your day. Thank you, Tobias. You too. Yeah. Thanks. I always enjoy talking about metadata. So this was a a great
[01:01:05] Unknown:
discussion. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Willie Lulchuk and Julian Ladem
Julian's Background and Experience
Origins of Marquez
What is Marquez?
Missing Features in Existing Metadata Solutions
Capabilities and Use Cases of Marquez
Integrations with Marquez
Benefits of Marquez's Versioning Capabilities
Handling Batch and Streaming Workloads
Data Model of Marquez
Implementation and Architecture of Marquez
Setting Up and Integrating Marquez
Managing Multiple Environments
Challenges and Learnings
When Marquez is Not the Right Choice
Future Plans for Marquez
Biggest Gaps in Data Management Tooling
Conclusion and Closing Remarks