What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans could focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ernie Ostic about Manta, an automated data lineage service for managing visibility and quality of your data workflows. So, Ernie, can you start by introducing yourself?

Sure, Tobias. Thank you. Ernie Ostyk, SVP of products here at Manta Software. And do you remember how you first got started working in data?

Wow. Since the beginning. I started my career 40 years ago with a small software company

in the reporting domain

and immediately found myself building data warehouses.

Of course, we didn't call them data warehouses back then, but it was building these, you know, extracts for analysis with extracted mainframe data.

And I've been intimate with the bits and bytes of transaction and business data ever since.

In terms of the Manta product, I'm wondering if you can share a bit about what it is that you're building there and some of the story behind it and why you decided that this was the problem space that you wanted to focus your time and energy on.

Great story there. First of all, from a Manta perspective,

Manta offers a what we call a unified platform for data lineage.

We provide deep analysis of a customer's application environment,

and we deliver insights on how data flows through that environment. We were incubated as a consultative solution in the Czech Republic,

fitness almost 10 years ago, and then we expanded into US around the 2016

time frame.

For me,

lineage has been sort of a core part of

my career since I got involved heavily in ETL back around the turn of the century.

And people were just started thinking about lineage.

But it wasn't really on their minds. It was kind of, you know, almost a religious sell, if you know what I mean. But when governance started to come into play,

I started blogging on the idea of lineage back in, like,

2008 when governance got really heavy in my previous roles. I was with IBM for a long time before coming here about 3 years ago.

It sort of became this new passion and I got introduced to Manta

and knew Manta for several years before I got together with our CEO and just said, you know what? This would be a cool place for me to go. I wanted to get back to a small software environment.

In terms of the main focus of the Manta product, I'm curious what are the core problems that you're aiming to solve and some of the overarching goals of how the product aims to tackle those problems?

The core problems that Manta aims to solve, pipeline visibility.

That's really the primary objective for our customers.

And 3 major use cases that we see. And the first is impact analysis and problem resolution.

And problem resolution is really all about what happens

when things break. Right? We have lots of data pipelines floating around the environment.

Managers report doesn't appear on their desk or it comes up blank or at 0 values.

You know, there's gonna be phone calls. There's gonna be screaming. Stuff's gonna happen. So how quickly can we determine what went wrong,

fix it, and make sure it doesn't happen again? But on the other end, developers that wanna be proactive

wanna be able to say, I'm about to make this change, the change to a table, a stored procedure, an ETL program.

What am I gonna impact? What might I break?

Or what should I be careful of? Because it will break it. You know, on the very least is just notifying people

that there is a change that's coming that could cause an issue. The second and extremely big use case is regulatory compliance. Driven a lot by the governance initiatives over the last 10 years,

Catalogs are blossoming all over the place.

And there are all kinds of scenarios where people not only need to have better trust in the data that's flowing through their environment, certainly a big initiative for governance, but also

the regulator comes knocking on their door. And they need to be able to

sign off or prove that they understand the data that is being used. They control that data. And without good lineage, there's no way to deliver that. And lastly, we've got a lot of customers that use us for application modernization.

Now what modernization

means for them

kinda depends on their context. But if they're migrating a legacy application up to the cloud

and that legacy application was built 20 years ago on some particular database,

They wanna get the ROI of the cloud, but they're stuck because they don't know what they need to move. And lineage kind of helps them do what I like to call a metadata inventory.

Right? Understand what you have

and which parts of it that you really need to move. There's so many situations where people have, you know, purchased the model

20 years ago. It has

thousands and thousands of tables and views and reports and things in it And there's no real clear idea

of which of those pieces they're still using because the subject matter experts have all retired moved on gotten promoted. And so, you know, when you really boil it down, they can look and say, wow. Out of these original 176

reports, we're only really using these 27.

So by doing lineage on those 27, we can have better idea of what we should move and not just have to do a complete lift and shift. It's kind of like, you know, looking at the

boxes that you have in your attic as you move from place to place to place. And what happens? A lot of those boxes you never open. They just get moved from place to place to place.

So trying to avoid that lineage will really help. In the overall data ecosystem right now, data lineage and metadata systems are definitely a very

active topic right now. And I'm wondering if you can talk to your view of what that market looks like and

the differentiating factors between

lineage

as a problem category

and metadata

as

an overarching umbrella and maybe some of the points of confusion that are coming up because of the various different conversations and different motivations and agendas

that are percolating through the space and how you view the role of Manta in that context.

Whew. That's a loaded set of things to think about. But,

you know, you started about talking about, you know, metadata and lineage being on everybody's tongues right now. And I think my first reaction to that is

to say, finally.

I mentioned it a little bit earlier, but you know 20 plus years ago, you had to preach the idea of solid metadata management. Nobody cared. Nobody had time.

Yeah, people cared a little bit about it, but they don't really thinking about metadata perk itself, right? 20 plus years ago, we were worried about y 2 ks and that certainly helped kind of drive the need for metadata definition.

But I told you I started blogging about this back in 2, 008,

and lineage was still a mystery to a lot of people. And so 15 years later, we're still here and the problem still remains.

And we have

exponential data growth that has, you know, completely taken over many more times we had back then,

and

a whole lot more technologies that are doing it. But I would have to say that, you know, trends overall,

I think in 1 respect,

governance and catalogs

have helped in pushing this and giving people some level of structure and definition

because people wanna know what they have,

and they wanna appreciate

what it means and where it exists, but also kind of where it flows in their organization.

That's kind of a more of a broad brush look at lineage, but it certainly helps drive it because now business people care about it, and it's not just a techie type of thing.

There is the whole discussion now

that didn't exist before.

And I think it kind of evolved out of, you know, data quality analysis,

but the analysis of privacy, private data,

GDPR, all the new regulations that are coming out now, people really have to care about where that privacy data lives. Certainly, they wanna know because a lot of times it's customer information, and they should do a better job of understanding how they're checking their customers. But they also have to be sure that they know for all the rules that come along with GDPR.

And people are tracking lineage simply to answer that problem alone.

There's so much attention now to operational

runtime metadata. You hear about data ops and observability

and being able to see kind of in real time how data is flowing

largely for the problem resolution

use case. Right? So we can make sure that we can see, you know, hey. There's pipes that are broken. Let's go fix them right now and make sure that nothing goes wrong.

But

you, you know, you kind of talk about

where things are going and things that have our attention. And I'll kind of get to where Manta fits into that in a moment. But the last piece I would talk about is evolving open source.

Metadata initiatives

such as Egeria and Open Dash Lineage, Apache Atlas have been coming around now. And I think people just paid lip service to them, but now they finally seem to be making a curve and more people are kind of aware. And, you know, Manta is in the thick of every single 1 of those cases in terms of our current development as well as how customers are applying us right now. In terms of the view of lineage, 1 of the things you mentioned in there was open lineage. I also know that there's the open metadata effort, which has its own perspective of what lineage should look like.

And 1 of the interesting things that I'm curious to dig into is that on your website, 1 of the things that you call out is Manta is, quote, lineage done right.

And I'm wondering if you can talk to some examples of what lineage done wrong looks like.

No. Great question.

Our biggest competitor

in the way I would describe lineage not done, right or lineage done wrong is

not doing anything at all and

In reality every single company across the globe. I don't care if they're, you know, 40, 000 employees or 4 employees. Everybody does linear. And the way that they typically do it is good old fashioned sneaker net in the 3rd grade game we played in class when we were in grade school, you know, the telephone game and you pick it up and you're, you know, trying to track someone down at this report. Something's wrong with it, you know, and a small place, you can just yell across the wall.

I'm gonna reach out the door and say, where'd you get this information from? But that doesn't happen in a large enterprise, you know, where someone has to call and say, you know, the CEO looked at this report. Something screwed up in it. Can you tell us where it came from? And said, well, I didn't write that particular report. So let me go talk to, you know, Candace and she's not around right now, but I'll leave her a voicemail, you know, or now that we're all remote She's in Europe somewhere. So it's 6 hours ago So we'll have to reach her tomorrow And then it just goes on and on and on and on all the way down the line.

And so lineage ends up taking forever. I love going through that

kind of interview with people and say, how do you do lineage today?

And, you know, most people laugh at me because they go through that whole rigmarole.

And I always ask them again, well, then how long does it take?

And, you know, delay in trying to make a decision is significant or the trust that someone has in that particular data.

So they're either

doing it in such a fashion that they're not really doing it at all, so it's not really effective,

or they're doing it manually.

They're bringing in all kinds of consultants or they have somebody who's full time just chasing down spreadsheets and trying to come up with their own, you know, textual lineage results in historic

forensics, really on the lineage,

that they can't deliver the kinds of information they need in a short period of time to do the use cases we just talked

about. As far as the

problem space of data lineage,

you know, 1 of the ways to think about constructing it is by saying, okay. I'm going to

build the data lineage graph based on the tasks that I am explicitly

executing

in my workflow orchestration engine, or I'm going to build this lineage graph based on the query log in my data warehouse to understand what are the queries that are populating different tables from which tables, etcetera.

And I'm wondering what are some of the gaps

in the overall view of lineage that

are prone to come up when you're using that view of how do I construct this lineage.

And some of the ways that you think about

both leaning into direct integrations where I'm going to use this tool to push lineage information, but also the cases where I need to actually go out and crawl information to

build the missing pieces of that lineage view across the entirety of a data system?

Fantastic question. And 1 that, you know, we work with quite a bit here.

Manta is really known for our core competency of being able to go out into code, existing SQL code, and ETL, and reporting code, and actually chewing through it, and coming out with the actual lineage and the transformations.

But runtime lineage is also a very important thing and being able to go into logs. In fact, this is something that we've started to invest heavily in as we get into the deeper research into things like open lineage, which is very event oriented.

And 1 of the concerns that we have for

exclusively

event oriented lineage

is that there are things that could be missing. Right? So

by not having all of the current code that's currently operating, if there's something that only runs once a year, then you may not have a record of that. And, you know, of course, then you get into ideas of how long do you keep your logs around for? And can we afford to save all that disk space?

And so, you know, a particular process that ran only at the end of q 1, and now it's, you know, the middle of August, and that thing only runs again at the end of the year. I have no record of it because I don't have those logs anymore. So we're worried about missing holes.

Another thing that, you know, is really important is to be able to do the impact analysis

before codes ever in production.

So during the design process, either because I'm doing an update and doing maintenance,

or during an early part of the construction of an application, I wanna see what the lineage looks like. And with code analysis, I can do that right up front. We also have customers that look at prototyping

and actually laying out what will lineage actually look like before I've even written a single stitch of code. And Manta provides the ability to be able to do that. Things that could not be done until I've actually run a process. And we really want to move

the lineage

ability in both directions. Right? Move it in the direction so that it can be powerful in a runtime scenario,

but also move it in a direction so that an architect

can start to lay out what the lineages before,

anyone has actually

put a finger in their IDE.

That sounds like a dangerous operation.

It's almost like sticking it in the light socket.

Yeah. Right.

Digging a bit more into the work that the Open Lineage team is doing and just the overall concepts of what pieces of information do you need to be able to have

a useful and comprehensive view of lineage is what are the attributes

that are necessary

to be able to be tracked consistently

across the board from all the different systems that you're working in? And what are the attributes that are

nice to have because they provide extra context, but aren't a core requirement to be able to actually operate on the lineage graph that you're building?

Yeah. Great point. And,

you know, 1 of the things that we're excited about with Open Dash Lineage is

that it's going to provide an opportunity

for vendors to

sort of prepare

other people for consuming lineage from their solution right off the bat. And that's important, especially for

newer vendors that are, you know, trying to put their footprint in the market, and they wanna get out there and open lineage can give them their really nice checkbox to go ahead and do so.

At the very bare minimum,

the events of what you wrote to and what you read from and what is the object. Is it a notebook?

Is it something you call a job? Is something that you call a workflow? I mean, everybody's got a different word for it. But in the end, it's probably

reading something

and writing something.

You know, you take the Spark world, and it's gonna it's a data frame. If it's a a piece of code that's pulling

JSON

documents out of Kafka and pushing them into m q series. Right? That's a whole another paradigm to look at, but it's still reading somewhere, writing somewhere.

And so you need to have at least those 3 things. What is doing the reading and writing?

What am I reading or writing from?

And what am I writing to? Now

everything changes when you start to dig deeper into that because that's the least, and that could be

more than good enough for a lot of situations.

But the next question that comes along as well, can you tell me which column it is?

And for a lot of those different use cases, you can. So columns become sort of the next level, but it's not like the MVP or it's not the bare minimum. Bare minimum, you need this thing it's writing and what it's writing to. Then I wanna get into columns.

Alright? Understand if it's, you know, some complex JSON schema, of course, I can dive into the columns. If it's a database like Snowflake, of course, I can get into the columns. But then there's also lots of gray areas where that can't be true.

Or you get situations where it's

completely unstructured.

Had a customer asked me the other day, how do I track the MP fours that I'm moving around on different locations within

Amazon s 3?

Well, there's no columns in an MP 4. Right? So things change, but you have to kinda look at it. So back to your properties. Right? So gotta know what is front writing from and what is it's actually doing the work.

We potentially want to get down into columns if we can. It's kind of like a luxury if you can do that.

Then the next step says, okay.

Can I get exactly the details on the time? I wanna know time it ran. If it only ran every 6 months like my previous example, is that important to me? Maybe I only wanna look at the lineages in the last, you know, week because everything else is immaterial to me. And then we start getting to the really cool optional stuff, which is

how many records did it actually load?

Right? And so when you get that question from the manager who's looking at a report and he's not sure he should trust it, knowing when it was loaded,

but also exactly how many records were loaded

might give him some indication

or validations.

And so as you go further and further and deeper, those are the kinds of things that you get. When you get into the real luxury items, you wanna know a little bit more about the data quality score that went along with it. And in fact, you know, was there a check against the different algorithms that were run during that particular transformation?

And finally,

and this is where we get it more into design lineage,

it's picking up the actual transformation expressions themselves.

Right? There are use cases where someone says, I need to see the actual rule that was used

to come up with this particular calculation. And without

looking at source code, that becomes very difficult to do. I think it's interesting that you are factoring some of the data quality

and

some of the observability

elements into that lineage question because in a lot of the conversations I've had, they're generally disjoint because

mostly the vendors who are trying to work in those different spaces haven't started to edge into the overlapping areas. And it's all metadata at the end of the day, which is another interesting element of

your core focus or the way that you're advertising your product is around this concept of lineage, but you are branching out into just broader metadata

and things that wouldn't necessarily be considered lineage by a number of people who are,

at least passingly familiar with the space.

And I'm wondering if you can talk a bit to some of the, maybe, the product design or the ways that you and your team are thinking about lineage and data quality and metadata and the overlaps

and how each of those may

be potentially disjoint and how you think about the boundaries between them, or if that is an artificial construct and it's actually natural to say data quality and data lineage are actually part of the

same system, not even 2 sides of the same coin.

They're definitely

super tightly related

and I think 1 of the most interesting

discoveries

that

I made when I was first getting into Manta

was working with 1 of our customers

who had their own

quality analysis work.

And they used a feature at that time, a very little known and new little use feature of Manta

to stuff their own

custom data quality scores back into Manta.

And

Manta ourselves, we're focused on code. We don't look at data. Right? There are lots of companies out there do a fantastic job of reading data, and we're starting to partner with some of them. Some of these blossomed out of this idea that came out of 1 of our customers a long time ago. So they took data quality scores, and they put them into the actual nodes of lineage that Manta was already scanning and reporting on.

So if they can take a data quality score, they would actually put it into, let's say, an Oracle table that they had detected, which is now in the middle of a lineage path. And so that the user who is looking at this lineage could then go see, oh, well, last night's data quality score for this particular table or this particular column

Is, you know, 87?

Alright. That's a good score. But if the score shows up as 42,

then there's something wrong here. And, you know, the analogy is basically saying and what this user had kind of aligned for me is that if I have a lineage and it's like, if I drew it on a whiteboard,

you know, complex picture with all kinds of boxes and lines,

if I had yellow or red sticky notes that I could stick onto that lineage in the particular places where they're appropriate,

I could have a better idea of judging which of these data quality problems I really should address first, which ones lead into other ones, and possibly set up prioritization on which of these I should address first. It actually inspired us

to create a feature that we call active tags. So

users have this ability to add metadata into Manta, And we can put in an active tag that recognizes it in the visualization

and actually puts a big fat red icon

on whatever column or whatever piece it is that that

particularly

score

corresponds to. It has since inspired

integration that we have been working on with several of our partners whose core competency is not lineage,

but they're very good at diving in and coming up with all kinds of algorithms that analyze data in terms of quality

or in terms of privacy.

And now digging more into the Manta platform itself, can you talk through some of the implementation

of how you've approached this challenge of being able to

build and maintain these lineage graphs and provide these insights and some of the overall

design and capabilities of the platform and how you went about building them?

Sure. Absolutely.

Manta started as a code analytics tool

for helping with code standardization.

And it became very clear that in the work that we're doing there, being able to lace together,

you know, column to column, source to target types of

pictures and streaming

with edges for connecting lineage and all the different nodes was something that we could make use of in this growing governance space. And that's kinda how we first got started.

And in doing so,

really being able to help

customers that were doing work in governance early on 6, 7 years ago,

who needed to be able to examine the code that they have. So that gave us an opportunity to say, well, what code are you talking about? Well, in so many cases early on, it was about their analytic systems.

So the fallout of the list of technologies that we needed to support became very clear. Yeah. Relational databases,

Teradata, Oracle, etcetera.

The ETL tools like Informatic and DataStage

and Microsoft SSCS that move data between them. And, of course, the reporting tools that are out there, the business objects and the cognosis of the world.

Well, you know, fast forward,

we had to build teams

that were able to work on the actual

analytics. Right? That hardcore

work to figure out exactly how does data flow through a particular tool. What does its metadata look like?

How does this column flow into that column? Number 1.

The merging ability,

because 1 of the keys to our customers getting end to end lineage is how do you reconcile through a target? You know, you have a stored procedure that's in Oracle that writes to 6 different tables.

But then

the ETL tool, like Informatica, has to read from those tables. So somehow you have to reconcile that you understand that, okay, tool x writes to a table

called person

and tool Y

reads from a table that's called person. Well, that's fine if it's Oracle, but what if it's a sequential file? You know, maybe it got copied and renamed.

But, ultimately, being able to reconcile that is all part of our platform. But the real key then is,

how do you then show it to users, which is a real key for us is making lineage easier to consume and do it in a way that's performant.

Right? So it meant an evolution

of

the technology that's under the covers. We used to put it in a relational database

that ran out of gas because the kind of customers that we're talking to, they have massive amounts of lineage metadata.

They ran out of gas. We moved to a graph database.

Right? We put it on a graph database.

Finally, it ran out of gas. It was an open source graph database, good 1, but couldn't handle the kind of volumes and performance that we needed. We've since, you know, gone through the engineering to

put the premier graph database Neo 4J, it's public,

underneath the covers, because we needed the kind of radical performance that people who have

100 of millions of connections in their lineage and still would be able to get performance

when they're loading it. And of course, when they are retrieving lineage information for their research.

So that's just a few of the areas, but our platform,

and that's the reason we call it a unified platform for lineage.

It's not just about pretty colors and laying out the lineage in a graphical fashion, but all the research that you need to do and the infrastructure needed to support it. I think the last thing I'll say is that as new technologies

come out, it feels like there's a new 1 every week. That's ETL tool or reporting tool on the web

or

old legacy scenarios

still get, you know, products from

when I started my career that people want to do lineage for, even if it's just for a small number.

And so, you know, being able to expand,

you know, our team is able to

study a new technology

and then decide how to support it best within this paradigm.

And that includes OpenDash Lineage.

Bigeye is an industry leading data observability platform that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate and reliable.

Companies like Instacart,

Clubhouse and Udacity use BigEyes automated data quality monitoring,

ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business.

Go to dataengineeringpodcast.com/bigeye

today to learn more and keep an eye on your data.

As far as the

interesting challenges that you have come across as you've grown the size of your customer base and been exposed to more varied operating environments and varied requirements. I'm curious, what were some of the,

I guess, interesting surprises that you encountered as to,

wait, people wanna do what with lineage, or they wanna store what information, how? Just some of the kind of

discoveries that you've made along that path that have directed the focus of your product and the evolution of its implementation?

I think that the

mention that I made earlier

about

moving m p threes or mp4s around from place to place

is an example

of 1 of the surprises

And what I mean by that is, you know, people come to us, they just purchased an ETL tool that's new on the market. And I'll go back 2 years.

We were about to release a scanner

for Azure Data Factory.

And, you know, 2 years ago, we had customers saying, okay. You guys do Informatica. You do all these other ETL tools that we really, really like as your Data Factory. Is that 1 you can put on your list? Talk to enough people,

and you weigh the market because you put it on your list. But we would get requests for

things like, well, we wanna have you scan Amazon s 3.

And of course our immediate question always is like,

but there's no lineage there, right? The lineage to S3 is done by Snowflake. It's done by Informatica. It's done by data stage is done by people's Python code, right? So there's no lineage in s3 itself

and people have said to us but you guys do lineage. We have stuff out there that we wanna be able to track.

And so for us

and this isn't the only thing that we've been exploring lately, but it's 1 that I think that we're excited about now delivering is more generic things so that people can build lineage that they need

for things that aren't necessarily scannable. I mean, I had somebody asked me about, well, I have FTP

scenarios that people still FTP manually.

Well, there's no scanning of that. Somebody gets in, they log in, they run-in an FTP,

or what's there to scan? There's nothing to scan, but you still need to be able to illustrate that kind of lineage.

So 1 of the things that we provide for people is the ability to build their own lineage, and do it as easily as they can. Yes.

We want

huge percentage to be automated because it's not automated, and what are you gonna gain? But there's always those little snippets and pieces that you're gonna need to be able to take the lineage from somebody's head. What are they thinking about? And be able to

illustrate that lineage because the subject matter experts the only person that knows that this particular

we use the MP for example since it's so obtuse,

knows that these MP 4 files that have to do with, I don't know, human resources

and all of the in person interviews that we do have to get moved from here to here as we, you know, go through the recruitment process.

That's something they wanna track for lineage. It's personal data. There's a hundred of the things that it might go along with, and it simply needs to be drag and dropped

and establish lineage

without it. So I think that that's been 1 big surprise for sure.

And so as far as the process of getting onboarded onto Manta and integrating it into an existing data platform,

what is involved in actually

being able

to both push

the lineage metadata into the Manta platform, as well as being able to do that discovery of these various different assets and how they're flowing through the different workflows and

maybe some of the

design and modeling considerations that go into how you want to compose your different lineage subgraphs?

Well, basically, for a person that gets started with Manta, they install it.

Today, it is a on premises solution, although we're very close to

releasing our first fully

SaaS offering.

But, ultimately, in either case,

somebody has to sit down and decide, okay. What is it we want to first look at for lineage?

And, you know, our customers will typically have some particular path that they need to take. Right? A large bank wants to support their risk department, and their risk department is, you know, doing analysis on

commercial loans. So they have

a set of technologies that support them. There may be a data warehouse

or other scenario that's sitting in Oracle with data at some particular point in time. That data is being manipulated by some ETL tools and gets put up into Snowflake. Maybe from Snowflake, it's being read by a tool like Tableau. So it's just a couple of tools out there. But if they're examining that particular application,

it already gives them sort of a subset of which tools they need to focus on. So they might say, okay. We're gonna start with Oracle. So the first challenge is which Oracle database do we need? Can we get there?

Establish what the credentials are. Can we get there directly?

And can I deal with the internal politics? You name it. You know, somebody's gonna say, well, I don't want you to accessing that particular data space at this particular time,

or I'm not gonna give you credentials to that database because you're not allowed to actually take a look at it and see it. And this really depends on who's driving

the lineage requirement and how high up in the organization do they have sponsorship?

But ultimately

connectivity becomes 1

which scheme is do we care about? Do we want the whole database? Do we want all 22, 000 assets that are in it? Oh, we only need

loan stuff is these 2 schemas. Alright. So we start to focus on that.

That all is kind of a 1 time thing. But, of course, depending on many firewalls there are or permissions that you get, that could be a, you know, a 5 minute exercise, or it might take you 5 hours as you chase down, you know, the politics within your organization.

Ultimately,

Amanda goes out,

does an extraction, and pulls out all the pile of pieces that it's necessary and necessary and performs that analysis and puts it into the graph database where it can now be reviewed. The next kind of challenge that people go through is to say, okay.

Fine.

Now I have all this Oracle lineage. I have all those pieces. Let me move on to the next tool in the process. And so you're gonna go through the same thing.

But as I did mention at 1 point earlier in the call, and once I bring in the ETL tool, let's say that it's a data stage ETL tool,

then it's gotta read from the Oracle table. So we gotta make sure that there's a reconciliation

and meshing. The data stage tool call the Oracle system something different.

Okay, then we have to have settings that can figure that out. But once you figure out how the Oracle system is called, pretty much everything else is down the line simply because,

At least most tools in this marketplace will look at a database and understand database schema table column That's all the same. I just got to figure out well this guy connected with ODBC This guy connected with JDBC. This guy called the resource,

you know, commercial loan prod, and this 1 called it commercial loan p. Okay. Once you reconcile that, the rest is the same. So it's kind of like 1 time settings.

From that point, it's just gonna work on.

So those 2 things are crucial. And that continues as you go through the pipeline. I think the next thing that people start to look as well, how often? How often do we change this code?

And the Tableau reports might be changing

daily,

but that Oracle database

written 10 years ago,

maybe not so much. Right? So maybe we don't have to scan that 1 as frequently

and, you know, so they start looking at timings like that. That help. As teams are going through this process of onboarding different systems

and maybe integrating custom lineage information for different

sources that aren't supported out of the box,

What are some of the points of friction that they're likely to run up against either because of some of those internal politics or because of some of the

technical requirements that are inherent to being able to

collect and maintain this information.

So great point. When we get into a hybrid scenario,

right, you've got a lot of automated lineage,

and then there's some pieces that need to be bridged.

Bridged because the technology that Manta doesn't support

or might be

some technology that we do support, but it was a new feature that was put into that utility and, you know, we don't support it yet.

Or like the, you know, the MP 4 example that I brought up earlier.

So the first thing is to

kind of follow a methodology,

which is what is it that my user wants to see?

And, you know, if it's some particular,

let's say, homegrown ETL tool that has been,

you know, in the company for a long time,

Asking what your user wants to see is important, kinda comes back to the discussion that we had earlier about attributes. So does your user,

mister report writer or the CEO, if she wants to actually,

you know, see column information,

what do they actually need to see in terms of the lineage? If they only need to see that it came from this application goes to that application,

that may be as deep as you need to go. And some people have asked us for super high level lineage.

But if I have a technical person involved, they might say, well, I need to know exactly which

ETO

program, which workflow it actually was, and maybe which module within that workflow.

Okay. That's fine. But if you want that kind of lineage and you need to do it manually,

that means you have to dig deeper. It's gonna be a little bit more complex.

Now you do application to application lineage. You could sit some sit someone down with a

CSV file in notepad or Excel, and they could just drop the lineage in an hour. The subject matter expert could lay the whole thing out. But if you need to get into the depths of the individual modules

or you want individual column lineage or you want transformations, now you have to find out where do I get that information?

Is the code is it even accessible?

Right? So you have to decide where can I get that from? Is it in a spec? Is it in some metadata somewhere?

And or is it again up in somebody's head?

And finally, the last question I think needs to get asked when you get into those hybrid situations is how many?

My favorite example on this 1 is so many did ask me about easy retrieve, which was a mainframe JCL based report writer back from the nineties. And they said,

well, we only have 27 reports, and we are going to replace them at some point with Tableau. But we had 18 months that we need some lineage. And I said, well, if it's only 18, how often do they get edited? And they laughed at me because since you're from the nineties, you're not getting edited at all. Alright. So once you know what the volume is then you can decide

is this something that I can sit and subject matter expert down in front of a spreadsheet and just lay out what the lineage is Or should I start thinking about writing my own parser,

you know, similar to what Manta does? But for this old technology that probably Manta nor any other third party vendor is gonna bother writing lineage for. So do I have to go through that exercise? Well, if it's only 27,

no. What's the threshold?

Probably somewhere between, you

know, a 100 or 52100.

Right? And when somebody says, well, we better do the effort of writing a little program to parse through this because that's just too many to have someone, you know, update on a manual basis.

I think that's a good opportunity for me to talk quickly about partners. We have partners who

build

scanners

for technologies that they're really deep experts on. And yet we have not had a chance to write because of resources or the market or the number of customers. In fact, we've already listed some of these on our website. 1 of our partners wrote

a scanner for DBT.

We didn't have that yet, but we provide the facility for them to go ahead and do so and inject that lineage.

And so that expands our footprint.

Once you have a lineage graph built, that's all well and good. You can go and point and click around and say, oh, this is fancy. I can look and see things, but I'm curious how that

materially impacts

the work of the different stakeholders of that data and some of the ways that it gets factored into their workflows or integrated into their tool chains to be able to

provide some of those rich insights or detailed debugging capabilities

as you are building and scaling and maintaining a data system?

So for us, the visualization is certainly the flashy thing and demo as well. It's nice to actually go and see.

But when we talk about the workflow for the different people, you know, certainly either

not the decision makers, but probably the people who support the decision makers are going in to look at lineage that way. I've talked about developers who might go in and look at a store procedure, see how it's gonna impact things downstream.

But all of the lineage that you see visually

is also accessible

through APIs

and other methods that you can go get access.

So in some cases,

people will actually have a predefined program

that will go in and do a comparison

for a particular object

and actually see what the lineage looks like

between this week last week and not use our GUI to do it, but get an extract that they can look through programmatically

to actually say, okay, I'm watching this particular store procedure.

I see that it actually changed.

Right? On a, you know, Manta finished its scan. So I kick off this program that goes has an automatically checks the lineage, and then can send an alert to someone that says,

hey, Tobias. Your store procedure that you cared about has changed.

Right? Mantis flagged that. We got it through an API.

You might wanna go look at it or go ahead and do things. So it's not something that is exclusive

to

looking at from a visual perspective.

And so how does that affect a person's daily workflow? Well, in some cases, you might get an alert that tells you about something, and you might not even really care about Manta for that matter

because it's gonna be something that's done behind the scenes. And then for other cases, we have people that go into, you know, Manta on a regular basis, and it's why we continue to look at ways to make lineage easier to understand with color coding and

the abilities to what we call basically aggregate the lineage. Right? You can see it at a higher level.

What you were talking about being able to see the changes in that lineage

brings me to another interesting capability that I noticed while I was preparing for the show is the versioning of that lineage graph. And I'm wondering if you can talk to some of the

motivations for that and some of the

use cases that it unlocks and just why is that an important feature, and what extra capabilities and value does provide for being able to have that versioning and be able to go back in time and see

what was the lineage at any given point? I'm so glad that you asked. That is certainly a key thing that has been in the Manta architecture since day 1. We call it revisions.

And what it basically is, is a complete slice across all of the

assets that you're scanning,

where we are able to compare

what the lineage of a particular object looked like

this week versus last week or last quarter.

Or when that report person who's, you know, dealing with the regulators

comes into your office with their hair on fire and says, oh my gosh, I'd have to see what the lineage looked like on

December 31,

2021.

Okay. Fine. Let's just go. Let's click here. Boom. Open up the lineage revision that we did on December 31st, and here you go. Here's all your lineage. So they can actually see what it looked like at that point, but also to be able to compare it. So we we do a comparison

that allows you to see how the lineage has changed and what that usually leads to

is a property or transformation

change that occurred upstream.

It could be as simple as being able to see, oh, look at this. The

stored procedure or the CTO

last quarter

was reading from

these 3 columns in our data warehouse, and now it's reading from these other columns. Well, which one's right or wrong? Well, somebody made a decision that in this giant data lake of ours, there was some redundant data and we were pulling it from the wrong place. Okay. That's 1 possibility.

1 of our customers

has a packaged application

where they have a schema. It's got, you know, hundreds of tables in it.

And

their biggest concern or issue that they have is every time this application vendor puts out a new release,

it breaks a lot of their custom reports.

So they actually use it to compare

the schemas

of that particular source.

And in their cases,

they're not really using it for lineage. You talk about surprising uses for mantha. They're using it just to compare properties. Properties that are in this entire schema of tables that have been added or changed

from the prior scenario.

And our technology can do that too. Right? And being able to do it within the context of lineage helps them out because of the reports that they fear are gonna get screwed up when schema changes come and they didn't know it.

Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale.

Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code.

Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it.

Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month.

For more information on Prefect, go to dataengineeringpodcast.com/prefect

today.

That's prefect.

Another interesting

use case for maybe the versioning, but especially the automated discovery of data lineage is that when you're doing that discovery,

it will help to understand, okay, what are the source assets that are fed into my system and where are they going?

And I'm wondering if there have been any examples of

the discovery of those source assets

being something that led to an moment in the sense of, oh, shoot, that never should have been fed into this system, and just some of the ways that this lineage information

can

surface potential

mistakes or

compliance issues because of the fact that certain datasets are maybe being consumed downstream where they should never have left the system that they were originally stored in.

My favorite example of that has been in this

growing area of concerns about privacy

and PII data. And, you know, it helps to back up a little bit and think about lineage. Right? And, you know, for any listeners that haven't actually played with any lineage tool, ours or anybody else's,

there's typically

kind of an idea that I like to refer to standing on an island

when you ask for lineage.

Because you're typically asking it for it from the perspective of some object within your world.

So you're standing on a report. So when you ask for lineage,

well, typically, the report is at the end of the stream or at the glass as people like to say. So everything is gonna be upstream,

right, or to your left if you're talking about typical left to right paradigm.

If you ask for something in the middle, you'll see lineage in both directions You ask for it in your way over to source. Everything's gonna be in in the other direction. So, again, that's why I say standing on an island. What is your perspective when you request lineage?

So in a scenario where I have a customer report that's completely packed full of, you know, personal information.

I know it's got personal information. So 1 thing I might do is do a lineage and look upstream.

And that might show me all the lookups in the source tables and transaction systems that comes from. But now,

once I've found 1 of those tables,

it's so eye opening to then switch the gear and stand on that location on the island. So now I'm way over on the left, right? And I say, show me lineage.

And of course, it's gonna show the other reports.

But this is what's been really interesting at sites that have now had their eyes open to say, oh,

not maliciously.

These aren't breaches,

but we've been

spraying

customer information

to audit files,

extracts,

and 20 other places that this information should not be going anymore. And that's a perfect example of the sort of awakening that people have. You know, there's been other scenarios too where people will find a lineage.

It goes back to a table in their data miner data warehouse.

And then there's no lineage on the other side

and they think well

Wait a minute who's loading that and they've had to then go do some investigation to find out, you know, in fact, you know There's a whole another system where people have been loading this table You didn't even realize it And so you get these situations, and we have APIs to do so to look for

what are objects that you have that have

no lineage at all. Right? There's nothing connecting them. So who's using it? It's like it's an inert object in the middle of your environment.

Or scenarios where

it's got hundreds of outbound connections, but no inbound connections.

Now that means something isn't being scanned, and it kinda opens up some new discoveries that have to be made. Throwing you a bit of a curveball here. Most of these conversations around data lineage and lineage systems are in the analytical context where I wanna understand what are the data flows from an application database table through to

a Tableau report, for instance. But

another case where you have data flows that can be challenging to map and understand what are all the different usages of this piece of information

is in the

systems operations context where I have a configuration

element

that maybe has a database password that is then getting passed to an application somewhere,

or I have some attribute that says this is a certain piece of information about an operating environment,

and these are all the different places where it needs to be used to be able to make sure that everything gets linked up properly together. And I'm curious if you've ever seen Manta or similar systems applied to that context where I am a systems engineer, and I wanna understand what are the configuration elements that are being used in which locations and how are they being propagated.

I would say that that's a place

where people have looked to subject matter experts to sort of paint a higher level metadata picture

because a lot of those things aren't necessarily something that is so easily parsable.

Right?

Be neat if we kind of addressed our technology there, and I think that we could.

But where I've seen that potentially come up is with our customers that really wanna do planned metadata.

And so they've discovered this. And instead of putting it into

Visio or PowerPoint,

they wanna put it into Manta so that it can be seen within the context of the lineage that they are doing automatically.

But that would be a place where I would say it's probably mostly manual.

Just thinking about that because as somebody who is also responsible for running infrastructure, it's always interesting to try and look and see, okay, what are the cases where this piece of information is not actually getting used anymore and I can just delete it? Or what are the cases where this piece of information is depended on everywhere and maybe I need to refactor that into

a single canonical reference and everything, you know, else pulls that from a key value store or something. For sure. Yeah. There's some definitely interesting lineage paradigms that, you know, we toy with exploring a little bit more. 1 is process lineage.

And I think we may start to see that as we play with open dash lineage, but process lineage, like, not really how the data flows, but what's the actual execution

of a sequence of things in Uzi in the Hadoop environment or,

you know, old legacy scheduling tools that you still see out there once in a while control m, etcetera,

Where the flow

between

nodes is not a flow of data, but it's a flow of control. You know, run process a.

If process a finishes successfully,

it's gonna run processes, you know, x, y, and z. But if process a fails, it's actually gonna run process q.

And, you know, I wanna show that potential behavior

and have it laid out as it's

specified,

not just from a runtime perspective.

And then I've been getting requests recently, especially with all the work that's going on with containers

of, you know, what is the arrangement of containers within this whole Kubernetes infrastructure that I have, and how do they call each other?

And that's solving another different kind of need, but it's certainly something that we wanna explore.

We've brought up a few different interesting use cases, but I'm curious if there are any other examples of interesting or innovative or unexpected ways that you've seen Manta used.

Brought up a couple of them so far. I thought that that creative customer that was

injecting data quality scores into Manta was a really creative use that we've seen

some of the

time

time checking. Right? So in terms

of traveling in time, idea, right, with the revisions that we talked about for have been some of the other really creative situations.

Well, in terms of your own experience of working at Manta and working in the space of lineage and data quality management, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Well, certainly, all of us for the last 20 plus years have been

dealing with

remote teams, remote working. So that's nothing new, but every company has a different culture, right? And I have to get used to that. So that was certainly 1 thing when I first got on board Manta 3 years ago, just working closely with the team that we have here in the United States, as well as we have engineering offices in the Czech Republic, as well as just opened 1 in Lisbon, Portugal.

So dealing with that. But I have to say probably 1 of the biggest challenges, and this is not a technical 1, but the transition that Manta

and

all of our

fellow vendors, competitors are not in this space

in their transition to a

subscription licensing model.

That was really an interesting transition that requires everyone from engineering

through

our Salesforce

and our customer success teams have to rethink how we address working with a customer when it's not a perpetual license that has a maintenance fee.

But in fact, it's a whole new license subscription for every single year. And I would my start of my career in customer support.

So I like this because what it does is it puts a whole new focus on the customer

that everyone in the company

needs to ensure that the customer is satisfied and happy.

And, you know, a passion of mine is providing value to the

subscription model, it's forced everyone to focus on the user

and their experience.

And I know that's not a technical scenario, but it certainly impacts everybody. And I think, you know, listeners who are dealing with different software companies, if they've been through this transition, and I know some companies are still going through it,

I would expect that they will find a better relationship with their vendor over time, for sure. That's a real big 1.

For people who are interested in being able to take advantage of these lineage capabilities

and be able to populate a lineage graph both from push based and discovery based workflows or

being able to just understand what are the data flows in their organization?

What are the cases where Manta is the wrong choice?

I would say

looking at who we help the most

are the sites that have

huge heterogeneous

lineage issues because they have tools across the board

on many different platforms

with many different skill sets

and across kind of the history of their organization.

A site that has

very narrow scenarios,

you know, that only has Snowflake.

There's gonna be other solutions out there that are gonna help that site, probably cheaper,

that can focus on just what they need in that 1 little space and provide the kind of lineage that they require because their number of variables are very small.

So I'd say, you know, small organizations

with narrow technology

love to help them, but I think that they'll also find a lot of good solutions out there that don't need

the Neo4j

for millions of assets, and they don't need the broad

spectrum

of scanner types that we offer.

And as you continue

to build and iterate on the Manta

platform and explore the different use cases for Lineage and some of the different types of information that people want to track, what are some of the things you have in store for the near to medium term or any projects that you're excited to dig into?

A couple of things. I think first and foremost, I did mention that, you know, we're very close to putting out a commercial SaaS offering.

We've done the work that is required to be containerized

so We can fully support it.

And so we're excited to do that. We think that'll open up new avenues for Manta even within the current technology framework that we have today.

But

really where we'll see a lot of new development for us in particular is in making lineage

continually

easier to consume

and to automate the kinds of things that it can identify for you. Right? So

today,

you can get information out of Manta that would help In my example before,

you know, that Tobias wants to understand that a particular store procedure that he cares about has changed.

But to actually,

you you know, put that into the tooling so you can create your subscription yourself,

and Manta will just automatically notify you or send you a Slack message.

So being able to do some of those automated rules is crucial.

Being able

to increase

the number of people in a company that are comfortable with lineage. Lineage

still tends to be

something that

leans technical,

but being able to

branch that out so that users

who are

maybe not pure business users, like the CEO themselves,

but more business oriented users going to feel comfortable to look at lineage.

And so our color coding pioneered that early on continually to making that easier is significant for

us. The next thing really is continuing to expand our portfolio.

We talked a little bit about, you know, Open Dash Lineage,

but there are also so many other new technologies that are coming out today and tomorrow

that our customers are requesting of us. They're all across the board in different categories.

And the last 1 I would say

is starting to see the need for

customers that wanna track not just classic tools,

but be able to look at something like the APIs that they have, like having a crawler that will read through a swagger page and be able to help illustrate the lineage

of those APIs that are established.

That's not like, you know, just going into the next DTL tool that was invented,

but really kind of looking at the things that people are doing to try and create standardized interfaces across their organization.

Are there any other aspects of the Manta product or the overall space of data lineage and its applications to data systems that we didn't discuss yet that you'd like to cover before you close out the show? I think exploring

the

the rich application information that we have,

you know,

doing it visually is 1 thing, doing with APIs is another, trying to make intelligent

decisions about notifying people when changes happen. And, you know, that's all important stuff. But we know so much about someone's application,

that being able

to measure its complexity,

being able to

identify

opportunities for better performance because we know all the paths that data is flowing through an organization.

That type of deep analysis

is where we're really also excited to take the Manta platform.

Alright. Well, for anybody who wants to get in touch with you you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think the integration

between all the different tools and technology

and, you know, coming from my experience at IBM

where we had a single platform with lots of integrated, you know, tools that did governance and ETL and data quality and stuff.

Even there

with a single platform,

customers still

had

lots and lots and lots of other solutions. They were pulled into their company either because it was historical

or because people have their own budgets you can do what they want or simply there was a desire to have some best of breed solution that is dealing with data.

So even when you're in that environment,

you're dealing with so many different

platforms and tools that trying to get them all together and talk to each other is a massive problem.

And, you know, we talked about open lineage a lot today.

Another initiative

that Manta is involved with, and I was involved with had the pleasure of being with the team,

when I was with IBM

is working with an initiative called Egeria.

It's also sponsored by the ODPI and the Linux Foundation.

And this is

a standard for open metadata.

And if everybody really could cooperate there,

then it would alleviate so much friction that companies deal with as they try and get tools to talk to each other and share information.

Now, of course, you know, there's a lot of roadblocks there, right people have proprietary models people don't want to share everything

And so, you know, these initiatives

have been slowly evolving,

but they're starting to get more momentum. And

we certainly believe in those efforts. I think they're

a long way from

wholesale industry adoption.

But with that being the biggest gap,

we need that in order to help customers

really get, I wanna say frictionless

connections between all these tools.

And until we do so, that gap will continue to remain.

Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team at Manta are doing. It's definitely a

interesting product and an important space for the data ecosystem. So I appreciate all of the time and energy that you and your team are putting into building the product that you are and contributing to help

grow and expand and improve that ecosystem,

and hope you enjoy the rest of your day. Thank you, Tobias. Really appreciate the time.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

To the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links