Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex.

This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating

the quality of the data and causing teams to lose trust.

Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, Ciflae can detect data quality anomalies,

assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Siflae.

Siflait also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflaitoday.

That's s I f f l e t. Your host is Tobias Macy, and today I'm interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets. So, Shinji, can you start by introducing yourself?

Sure. Thanks for having me back here again, Tobias. Excited to be here. My name is Shinjie Kim. I'm the founder and CEO of Select Star. Select Star is an automated data discovery tool that helps everyone to be able to find and understand

their data.

For folks who didn't listen to your previous interview, I'll add a link in the show notes, but can you briefly give an overview about how you get started working in data?

So I have a computer science background that worked as a software engineer, data scientist,

product manager in the past where I was an actual like, a direct data producer,

someone who brings and also creates data models

and also as a active data consumer

to build models and make just business decisions based on the analysis that,

we create.

I started a company in 2014 called Concord Systems

focused

on distributed stream processing, which we sold to Akamai,

and now it's a IoT platform called the IoT EdgeConnect.

So I've been working primarily in

data platform infrastructure technologies

for about 6, 7 years prior to SelectStar.

And

I started Select Star because I saw a lot of news around

data understanding

for

everyone to be able to find the right data.

Now it's not just the large enterprises, but every company now has 100 and thousands of

datasets

in their Snowflake, BigQuery, Redshift

as they are plugging their

source applications,

not just their production databases, but also different SaaS applications

into the data warehouse.

And a lot of this, I would say, is a actually a really good

movement

towards

having more people inside the organization to be able to make data driven decisions

for them to be able to run 300

analysis of their customer,

making better

decisions faster.

At the same time, navigating

through

your data warehouse or, you know, your BI tool has been more challenging than ever today because of the amount of data and the context that everyone now needs to have in order for them to truly use that data.

Before we get too much into some of the product updates from the last time we talked, I'm interested in digging into this term of data discovery that I brought up at the open and that you mentioned in the description of what you're building.

And it's a term that has

been relatively new in terms of widespread usage in the data

ecosystem.

And I'm wondering if you can give your definition of what it means and the technical and social and process

aspects that it encapsulates

and that are necessary to make a data discovery

capability

viable?

Yeah. I think that's a great question.

I think this definition

and how this is translated in different organization is still forming.

But the way that we define data discovery at SelectSAR

is all about finding and understanding

the data that you have. So

basically

in order to make that happen,

1st and foremost on the technical side of it, you do need to ensure all your metadata

is available in 1 place so that you have

a structured

way to find the data or a spot where exactly is located.

And then on top of that, in order

for you to truly understand the data, you would want to have the context

of that data asset.

The context of data asset can be things like when was this data asset created,

updated,

where it came from,

who's using this the most,

and how are is this data being used in what type of joins or

queries. And also

even beyond just the database if there are other applications that are leveraging

that specific data. And this is, I think, where a lot of the tooling, like SelectStar, really tries to help out to automate the process of bringing these, you know, also like Gartner calls the active metadata

to be available

and searchable within 1 platform.

Something like the social components of data discovery.

I think it really comes from just

ensuring everyone

to be able to like, know where to go and find

information or ask questions.

This has been primarily been done through a lot of Slack channels or 1 on 1 messages

is what we've observed from our customers before they adopted Select Star.

And in a way, I think adoption of any new tool

requires a bit of change management or people to start utilizing that tool. And I think

as people start utilizing it, the part that data discovery can also really help around the social components would be allowing people to comment

on that data or tag other people

about that data, having that to be integrated

directly with your Slack channels or getting a notification

on email or Slack, I think is the other parts of component that can be included in data discovery.

The important part about this, like, social component of the data discovery is

that most of the time with this semantic level of information

or more of this tribal knowledge that are being discussed

within conversations or Slack messages,

these are usually a lot harder to as a, like, a metadata perspective.

Being able to have that integration

with Slack or email, so on and so forth, is important so that it can also be searchable

within that 1 platform.

Last but not least, you also mentioned about the process

side of this. And I think process side is what eventually, like, brings a lot of these ad hoc knowledge sharing

to come together.

What we see from a lot of our customers is that they may use Select Star as just like a primary

go to place like a Google for data.

But as you continue utilizing the platform, you will start adding different descriptions

or tagging

or ownership.

And having these processes,

understanding

that the datasets

are marked as, like, to be deprecated

or whether this is a gold, silver, or bronze table

who are the main owners of this

and having, like, a templates

for your data documentation.

These are all part of the process perspective that we recommend every customer to have so that there is a standard

put into place that people can trust

as they are utilizing a data discovery tool. Does that make sense? Yeah. It makes perfect sense. So definitely thank you for sharing that perspective on what this term is

being used for and how you're adopting it for your own work at SelectStar.

And an interesting

comparison is maybe a year or 2 ago, the word of the day was data catalog,

and

that was

the

kind of clearinghouse for how you figure out what data an organization has and maybe figure out what is the popularity ranking for a given table or something like that. And And I'm wondering if you can talk to some of the

differences between what people

have in mind when they use the term data catalog versus data discovery

and where that overlap

sits? I think the main difference of data discovery

really comes from

providing this like active metadata

or the automated data context

around how the data is currently being used inside the company.

Traditionally,

data catalog

has existed,

like, you know, since database has been existed

to kind of give you a full schema and map of all the metadata

of any sources that you connect

to. In a way, the whole

purpose of data catalog is to create an inventory

of all your data,

which I would say still a lot of enterprise data catalog tools are focused on.

Whereas companies like SelectSAR

and more of a data, quote, unquote, discovery platform

has an emphasis

around

trying to direct people to find the right dataset

and giving them the right way to use that dataset.

So it's really more focused on consumption side of data, how to use that data better. And if you're looking for certain types of data, which 1 is the right data to use?

That's kind of how we see the market as a main differentiation.

Catalog overall or

just any metadata catalog overall, I think is almost like now a

baseline

feature that many other data tools also have, including observability

or quality types of tool. The aspect of discovery really comes from

combining all the usage data and the insights of

the multiple

apps

together

to provide a better way of using the data is is how we define it. I think it's an interesting evolution because

the overarching

agreement has been that metadata is the lifeblood of any data system.

You need to be able to collect and understand the different applications of that metadata to be able to build something that is truly

flexible and adaptable for the

evolving data needs and data tools and simplifying some of that integration pain.

And I think that data catalogs were something that was hit on early on because it's something that is

understandable

and

relatively well scoped for being able to capture that metadata and make it useful. And I think that now that we have that as a

basis point that everybody can understand and that a lot of people are starting to adopt, it gives the

opportunity for folks like yourself and other people who are working in the kind of metadata

arena

to branch out from there and figure out what are the next set of capabilities

and features that we can

build on top of now that we have this unified view of metadata, now that we have gotten everybody on board to saying, okay. We are actually going to build a centralized view of metadata and make it useful. What can we do with that now?

When we talk to companies that have

tried to adopt

more legacy data catalog players,

The way they started from is as kind of some kind of data governance project.

And first thing that they were doing is trying to just get all metadata in 1 place.

That itself

may be like a year long project because of each connector, it's different metadata format. Everything may take a while to load. But then, like, once the data is loaded,

like, what are you actually doing with that metadata?

If there isn't much context around that metadata,

will people actually

use the catalog

to find what they're looking for?

And it goes to that question again, as I'm looking for data around x, y, z.

And if I don't know what the table is called or column is called, can I still find that dataset?

And with the traditional way of

only cataloging

the physical metadata, and that is really, really hard to tell.

Whereas with by looking into more of how the data is being used and how the each dataset is connected to another,

You can get additional

context,

which really helps you to actually use that data

beyond just having like a 1, like, place to search for.

On that point of actually

using some of this additional context and

social information

to figure out what is the data asset that I'm actually looking for. To your point of, if you don't know the table name, then you don't really even know what you're looking for necessarily. And so 1 of the early solutions

was to just say, we'll just rank by popularity, and so whatever the most widely used table is is probably the 1 that you're looking for, and then you can just kind of branch out from there and maybe use the lineage view to understand what are the tables feeding this, what is this feeding into.

I'm curious if you can talk to some of the other

aspects of context and how that helps with some of the detective work of saying, I'm trying to solve this problem. I don't even know what data is out there. How do I figure out the piece of information that I'm looking for?

So like you mentioned, I think it just looking at the popularity based on, like, looking at how other people are utilizing data, which are the data that's being referenced the most,

which are the datasets that are being

like selected

the most.

And once you also go into the level of who is querying this data

or looking at the side of what does this analyst or this team use the most,

I would say like, without really having any idea about their datasets, I think this is actually 1 of the places that a lot of our users start from, kind of looking up

they're new to the team, being able to, like, look up what their team uses the most or what their managers or peers use the most is like definitely 1 place to start.

Another way to start is by observing

what may be happening inside the data discovery platform. So earlier we mentioned about, like, the social aspect of this platform.

People may be talking about or discussing

a specific theme or the word or the table that you didn't know what it was called. But because there is a discussion going on, you are now exposed to more of the context

about the dataset.

And you can think of the other ways of how that keyword may match up with other datasets as well. Another part that we also look into regarding lineage

is so lineage

primarily shows you the whole data

model of, like, where the data the each column was generated, where it's flowing to, how this, like,

becomes either in the,

like, a reporting table or more of, like, a materialized views.

Other ways that we can also look at lineage

or different angles to look at lineage is what are the tables that are actually driven from the same parents

or same sources of the data?

So you can start with, like, a certain KPI. You can look at the sources

and then find kinda it's almost like a sibling, so if you're looking at it as a graph. So these are, like, other ways to discover

other datasets or dashboards that are created on top of that data.

And last but not least, other parts of discovering other datasets,

I would say, also comes from

noticing

the joints

that are happening. So you may just start from 1 table cannot just start from anywhere, like, I know. So it can be any popular tables. Right?

A lot of those tables will have other

dimensions tables that you may be you may see that it's being joined on. And this is, like, a great way to discover how other datasets that you may not be aware of in the past,

but is actually being joined with the tables.

But most importantly, I think search is the, like, the most important part here if you have a certain dataset you are looking for.

And so, like, there are many ways to look into the search,

not just on the level of

indexing

on the name of the table, but also looking into the

table comments, column comments,

any docs that it might be related to, tags that are attached to the tables,

and

any of, like, the actual, like, people that are up that you think might know about this table. So being able to search through in any of those

aspect

of the dataset, I think, is also important for discovery purposes.

The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using it.

SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day.

Just connect it to your DBT,

Snowflake, Tableau, Looker, or whatever you are using and select star will set everything up in just a few hours.

Go to dataengineeringpodcast.com/selectstar

today to double the length of your free trial and get a swag package when you convert to a paid plan.

1 of the other elements that you mentioned is that in the

data creation and data production

environment, so a lot of the applications that we build,

the interactions

with end users that generate some of these data points,

There is a lot

of implicit context that exists in terms of the application logic that feeds into the specific records that are generated,

the sequencing

of particular engagement events with a customer.

And I'm wondering if you can talk to some of the ways that some of that context can be captured and propagated into this discovery system to be able to feed a

richer understanding

of what the data actually semantically means

once you do come across the table and are trying to figure out how am I going to use this information,

what are the valid

aggregations or transformations that I can perform on it while maintaining the original meaning of the information as it was generated?

Great question. And it is something that we are continuing to work on at Select Star to provide more of these places to apply

the context to

data analysts and data engineers day to day workflow.

Few things I can point out here are so

regarding

queries and joins,

by giving you a sense of which are the most most used types of select queries

and also joins, meaning like a join condition, which tables are being joined, which are the join conditions actually, or join keys

that are being used. You can actually kind of like a map out what are datasets that you can utilize together.

And

by also and this may require

a little more than just the metadata, but either by bringing out which may be the foreign keys or primary keys on the table

or for us to detect that for you.

We can easily create almost like a entity relationship diagram

where you can see, like, a full data model of the database based on that specific table.

And this is a new feature that we released that we're starting to see more analysts

utilizing

because

most of the time when you have a data lake, data warehouse environment,

the relationships around the primary key foreign key gets lost.

And many times people are just guessing

whether they should be able to like, you know, which are the right join keys for that particular table. So

by highlighting

which joins have already been done in the past and being able for you to be able to look up that query, this is a, like, a concrete example and context you can get right away. Another part that we are introspecting

regarding

data lineage is how that data has been transferred

between the lineage. So within the column lineage perspective,

we can tell you whether the data has been propagated

as is, whether it has been transformed,

or whether it's been aggregated

to the next column.

And utilizing

this context,

customers can also

understand

when it's aggregated,

how that's being aggregated.

If it's a column that is just using the same data as is, We also utilize that relationship in order to propagate any descriptions that you might have from upstream,

propagate any tags that it might have. And also, we are starting to work on putting in more of the workflow so that you can also propagate ownership

or notification

chain with that. So I think these are some of the things that we are now starting to add the scratch on the surface around

as we are introspecting

further into how the data has been composed

and how the data is currently being used.

By putting them into these structures

like lineage

or our popularity model,

we can, you know, programmatically

start giving you more

context.

Your mention of the foreign key relationships as they exist in the source databases that are often lost when you're just pulling the data

directly from that database into your warehouse or into your lake.

That's an interesting observation.

And I'm wondering if you have seen any of these systems such as Airbyte and Fivetran, etcetera,

able to

capture some of that context as well in the process of doing the extract and load where you say, this is the table I'm loading from. This column has a foreign key relationship with this other table, and then maybe also things like

understanding and introspecting the fact that this table also has a compound index on these 2 or 3 columns so that that way you can understand, okay, these 3 columns have some sort of implicit relationship with each other because they're often used for being able to fetch a specific record and being able to feed more of that information

into the downstream analysis so that

analysts don't have to go digging it back into the source database or digging through the source code that generated those records to understand

what those

application level

operations and requirements are and how you can reflect that into the analysis that you're performing.

This has been like 1 of our asks

to my friend in the past because they already have that ERD

of many of their source connectors.

But today, it's not directly

replicated

to the destination

today. But, you know, there may be something that they are starting to work on to try to

expose more of that source metadata

to the destination so that there is also a clear

lineage and metadata transfer

beyond just the data transformation and load, itself.

But I think this is kind of that like, they are in a great position

to do so to bridge that gap. And I think that that that layer is the right place to do it. Yeah. I could definitely see that being very useful, particularly as you're maybe starting to build out your DBT models to go from, here are my raw data assets, to the intermediate tables where you're trying to do some

domain object modeling of the application objects and trying to recombine them from these normalized tables into something that is a more unified view of that concept and just being able to use those relationships

from the source database to even automate some of the

dbt SQL that you might need to write to recombine those tables?

But it is something that we have requested

because, yeah, I think there are both on source and also destination. Also beyond that, like, the data warehouse side when companies are, now moving data

back to the applications.

Having that, like, lineage back to on the application side is also is is a new area that we are starting to emerge

to add to the discovery perspective.

Yeah. That's an interesting point too of being able to

capture that source metadata. This is how these tables existed at the time that we pulled them out of the source system. We've done these transformations and enrichment from other applications.

Now we wanna feed that back into the source application, being able to split that back out into the normalized models based on the information you had on the way in.

Yeah. I'm closing the full circle now.

Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.i

95%

reported being at or overcapacity.

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5%

report having current investments in automation, 85% of data teams plan on investing in automation in the next

12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion,

transformation,

orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to data engineering podcast.com/ascend

and sign up for a free trial. If you're a data

podcast listener, you get credits worth $5, 000 when you become a customer.

And

another interesting element of data discovery and data catalogs is that a lot of the conversations around them are oriented around the data warehouse or the data lake or these tabular representations

of information.

And I'm wondering how you have seen this aspect of data discovery branch out into

some unstructured data that maybe is

related in some way to the tabular assets that you're working with, but is not something that you can shove into the data warehouse. So maybe you have some

customer interaction events that relate

to an image or an ebook that you're trying to use as inbound marketing, and you wanna be able to process that PDF in some way, maybe feed it into a machine learning model to do some semantic understanding of it and just being able to

link the

tabular data of these are the customer events that we're working with to the specific asset that they are related to as far as the original behavior that you're trying to do the analysis on?

I think that's

an area that we haven't explored

so much yet,

where we are like, the extent that we are starting to get exposed is around the metadata

of those unstructured data regarding the access pattern

or the events that has happened, you know, that comes in,

you know, more of

evaluative format or different format, or it could be happening through more like an s 3 bucket. So it's more of the file access.

But,

eventually,

it also you know, in order to

have

the our data lineage model and also metadata model to be standardized

within Select Star, we basically try to

compose a model that can be fit into a relational model

in general. But this is, like, something that we are starting to notice that, you know, when you have all these, like, JSON events coming through

and you want to track through different

events, like, how that would

eventually

convert into different columns

eventually. That's that's basically kind of the extent that we're starting to think, wherein with right now. I think the other side of where the data is being used a lot, but we don't talk as much about the metadata and we are just starting to is

the BI side.

Because for the rest of the company, the data consumption

happens through

BI tools like Tableau, Power BI,

Looker.

There are usually

for a lot of these complex BI tools, they have their own data models underneath.

So how is that transferred

as its own data model within BI? And then from there,

is exposed

as part of a chart or dashboard

or some sheet

or workbook. So that side of the model, we've been building a lot to kind of, like, expose that model even though there wasn't anything, like,

defined as a relational model in the past.

So that side of the part, you know,

we try to make it so that it's still, like, unified with

our metadata model. So, you know, like, every dashboard will have some kind of a chart or some kind of subcomponent.

And within those subcomponent and also in the dashboard, like, we will display

in several queries,

like, how many people are viewing

this dashboard

and what interactions are there. What are, like, the filters or, you know, food buys that people are running, which gives you another set of eyes around how the data is being consumed

in your ecosystem.

Digging now into the work that you've been doing at SelectStar,

I believe it's been at least a year since the last time we spoke. And I'm wondering if you can share some of the ways that

your platform and product has evolved and some of the ways that these broader conversations around

data catalogs versus data discovery versus metadata management has

influenced the overall

product focus and the ways that you think about prioritizing effort and maybe even some of the ways that

the customer journey goes from, I have this need of understanding

what is the catalog of all of my assets through to where we are now of I need to do more broad based discovery and context management around these assets.

So, you know, we started with a data discovery platform designed for

any data team members to be able to easily

find, understand, and utilize their data. Kind of them using FlexR as if it's their Google for data.

And the pillars of the technologies that we built for that included

automated metadata catalog,

column level lineage,

and then this, like, usage model, like a popularity model.

And

starting already this year, we've been starting to

leverage all 3 parts more of in combination

and started building new features.

And this also kind of maps with where our customers

are heading. So a lot of companies that have adopted Select Star initially wanted

to gather the insights around metadata.

And as their data team start

they are now starting to bring on people outside of the data team to start utilizing Select Star and have Select Star as the go to place if anyone wants to ask questions or look up information about data. So 1 part that we have added early this year

is the whole notion around

docs and metrics.

So this is to provide

beyond just a physical data model for our users to be able to put together

what their business process looks like

and how the data models are mapped.

And on top of that, they can define

the KPIs as metrics

and also

refer to this metric

can be calculated

with this

people query, or this is represented

by

a measure, this in looker or tableau,

or by looking into this column

or table. So that's kinda like the 1 major part that we've been upgrading

so that data teams can

basically share

the context beyond,

like, what we are giving them automatically. They can really start adding more of a semantic level and business level

understanding of data, that they can share with everyone else.

And the 1 of the important parts around that is, 1, is providing them with a way to be able to create this augmentation.

But we also wanted to make sure it has a good connection

back to their data

so by referencing

a table or being able to mention

tables, columns, dashboards,

users,

will create now, like, a backlink connection.

So that anything that you have now defined as a metric,

if you go to its table,

then you can see a metrics label and, like, a the column is already marked

because it's mentioned

as a KPI.

You can see that for a table, this table was mentioned within

a data concept docs

within Select Star. And having that link

between

the high level documentation

and the data model has been a major part that's starting to drive SelectStar to go kind of like having the usage beyond.

So that's 1 part. The second part is that as we are starting to notice

that, you know, beyond the data team, different parts of the organization

is starting to utilize Select

Star. We've added

more

enterprise level or enterprise grade access control capabilities.

So

most of the time, like, you know, a lot of the emerging enterprises that we've been working with, companies like Opendoor or Handshake,

they have, like, basically, like, a open access to their data team for their data warehouses.

At the same time, as more companies are opening up their data warehouse, 1 is that they want to make sure that people are not confused

by all the data they have. And, also, there are always

a set of data that you will just, like, you know, legally not supposed to be expose to or want to gate

the access of that information.

So

we are releasing a more of fine grain access control within

Select Star so that you can define

who can see what.

You can define this by team level.

You can define this by certain attribute like tags,

and it is much easier for you to basically

define how the overall

experience of using data discovery will look like depending on

per user, depending on which team they belong to, and what the dataset

really, you know, is tagged with or entails. And

last but not least,

another new part that we're releasing pretty soon is

all around

exposing more of these

context

of data

beyond the select star. So we've noticed a lot of customers also utilizing our API

within their workflow,

but also the parts that we provide as the

automated description or

lineage

or or even, like, discussion items, we are starting to we'll be starting to expose that through, like, a crawl plugin

so that you don't have to always, you know, be in selectstar.com.

You can just use it while you're browsing through

your the ITIL or you're in a, SQL ID or wherever

you want.

So those are kind of, like, the major changes or major updates

that has happened and is also coming up with us.

Yeah. In particular, what you were just saying about being able to use the metadata that you have in Select Star and surface that in the BI environment, but also in the SQL IDE is definitely

very interesting. And, also, the integration that you're doing with DBT so that you can have that information feeding in both directions where you're working on your DBT model, so you're able to see in your IDE, this is some of the information that I have about all the other tables that I need to, you know, understand. Or, like, as I'm starting to say this table name, I can understand, okay, what are some of the columns and some of the associated

metadata about that?

And then I build my dbt model, and that feeds back into what you have in select star so that people can understand downstream. Okay. This is actually the set of dbt models that generated this table that I need to understand. And it's definitely great seeing more kind of cross pollination and

bidirectional

integrations and interactions

with more of the tools that people are using to be able to build up these analyses and build up these assets so that context isn't something that is the responsibility of 1 system. It's something that everybody can collaborate on and feed in both directions as people are building and generating that additional context.

Yeah. For sure. I mean, API is, like, the really interesting part that, like, you know, we're

starting to see we get surprised by the use case with how much you are doing with the API. You know, that gives us, like, really cool ideas too. So yeah. I agree.

As you have been

exploring the space further and iterating on your platform,

what are some of the most interesting or innovative or unexpected ways that you have seen folks building or using these data discovery capabilities, and in particular, some of the ways that

context is able to be captured and managed and propagated throughout?

I'll elaborate a little bit more on that API usage.

We have customers that are starting to use our Lineage API in their CI pipeline

just in order to not have any more

downtime in their data.

If you think about how a lot of companies utilize Lineage, today, they are primarily using it to introspect

and find the root cause of why this dashboard broke or why this data pipeline has an issue.

What this company did this company is called Gizometry. It's a public company that runs a marketplace

for manufacturers

and suppliers.

And we actually just released a case study about this because they were actually looking for a data lineage partner

for more than a year

because they wanted to put this into their CI pipeline.

This was a pretty critical issue for them where their data engineers

end up spending hours and hours, but facing issues that, you know, that their production engineers didn't really know that they were creating.

So by integrating

the lineage API,

like, for any

metadata changes, like column

deletion

or name changes,

like, it will ping

our API

and we will return

how many, like, downstream

objects they may get affected. And if that's more than 0, then it will basically

send, like, a, like, automated comment on their git

to say, hey. There are issues that's gonna happen.

Check out this page on Select Star, and it will just, like, utilizing our API. It will auto generate that lineage link that they have to go to.

And it's pretty remarkable to see how they just don't have these

pages anymore.

And

by saving a lot of time on this, like, their data team gets to really focus on more proactive

forward looking projects

than people spending time on just triaging or

fixing production issues on their data pipeline.

We've also seen

customers

starting to

generate

their

legal security reports

around PII tags data.

This also leverages

lineage a lot because our part of our lineage will also give you kind of the usage information

of the downstream

effect

or whoever that have touched that data recently.

So building more of this reporting capability is another part that we felt like we didn't necessarily intended in the beginning, but we are seeing a really interesting usage of it.

Yeah. It's definitely great being able to automate any sort of compliance documentation so that you don't have to do all the tedious work of building it yourself.

I'm also interested in understanding a bit more about the use case you were referring to with feeding the lineage information back to the production engineers. So just to make sure I'm getting this right, it sounds like

using terminology in projects that I'm familiar with, say, you have a Jenga web application.

It has the database models in the ORM.

A developer says, I'm going to

add a column or rename a column using a migration in the ORM. They are using

your API in the CICD to understand, okay. This ORM

model maps to this database table, which is getting consumed into this downstream report eventually.

So if you rename this column, then this is actually going to break these tables in this downstream report, and so, you know, make sure that this is communicated to all the people who need to care about it. Is that correct?

Yeah. I believe it blocks the PR, the way they implemented it.

Basically, it will look at the diffs of the code that that is getting merged. So any, like, a metadata

will be, like, taken into Select XR, run through the you know, if there is any response from the lineage API.

And if there is, then it comes back with, you know, here's a link that you need to go check the, you know, downstream effects of because there are more than 0

objects that are getting affected because of this.

Definitely very cool and something that I would like to see more investment in across the board of being able to

feed that information in both directions where as developers are modifying and manipulating the source systems that data is being consumed from for downstream reports,

they are brought into the conversation about what impact is this change going to have rather than having that be the responsibility and burden of the data engineers who have to be in constant firefighting mode once that change propagates and they have no more control over it.

Exactly.

Yeah. I mean, for production or prop engineers,

they don't know

what the impact is. The attribute factors

changing that small column or deprecating a table that nobody seems to be

using. Right? And they may not all have or have known of us like Star, but then because the API is integrated,

they can easily check out that page

because, like, their CI will give them that information.

Very cool.

So in your own experience of going on this journey of launching your product and then going through the past year or so since we last talked and evolving the platform and expanding into this data discovery

capability and conversation that's happening across the ecosystem.

Interesting things I could say about all 3 different lessons. I would say the interesting part and also starting to like see as a challenging

part in the industry that we are in right now is that

more people are starting to wake up to the fact that they do need a better data discovery

because they have just migrated to their data warehouses or cloud data warehouses

and realizing that it's not super easy to use when you have hundreds and thousands of tables and so many different databases that you have to sift

through. So the kind of, like,

awareness

around the importance of data discovery is growing, and I think that's really awesome to see. At the same time,

it's also

hard for people to start thinking about, like, how are we actually going to utilize data discovery?

How can we

communicate this to our management

in order to adopt the tool or invest in a tool or

capabilities?

And I think that this is something that, like, we as an industry is starting to really develop

that I think it can be confusing for, like, a very initial, like, customers that hasn't thought about this as a capability in the past. Because I think this is definitely

1 of the newer areas that has emerged as more of a post like, once you have this, like, modern data set running,

this is starting to become a lot more clearer

issue. But, like, how to, like, make that as a standard stack, I think it it does take some time for everyone to be on the same page.

And so as you continue to

build out and iterate on the platform and product vision and product direction for Select Star and continue talking to people in the ecosystem who are

getting up to speed on the capabilities

and use cases for metadata and discovery capabilities. What are some of the things you have planned for the near to medium term or problem areas that you're excited to dig into?

I mentioned a couple things of what's coming up in our road map. I'm also really excited to

basically continuing

to leverage and build up on this, like, context metadata or active metadata

structure

that we have.

So a couple things that we've recently done are, like, automating your documentation. So you document

in 1 place, and based on lineage

or if the dataset is duplicated

or any new

places that we see it fit, we will start

propagating the documentation

or tags or ownership information throughout

the platform. This combined with

allowing you to also

link your

business

documentation

is something that we are starting to develop more so that you can actually transmit

the knowledge and share

the knowledge around data

beyond,

first, having your data team to make it, but then for it to be also understandable

by people that are outside of the data team. And, you know, the whole notion around allowing you to transfer this or see this within your BI tool or

your IDE, SQL ID is all part of that so that this context metadata is more ubiquitous and versatile.

And there is always, like, also the higher level context that you can add. These are some of the parts that I'm very excited about to enable more people to be able to understand and use data better.

Are there any other aspects of the work that you're doing at Select Star or this overall conversation

of data discovery that we didn't discuss yet that you'd like to cover before we close out the show? I'm very excited about, like, everything that's happening in the ecosystem.

When I first started the company,

I was told, from multiple,

I guess, investors and other people that data discovery seems like

a vitamin. It's not a pain killer. Like, seems like just a nice to have tool. Why not just use a, you know, like, Notion Noc or whatnot?

And for me, the last couple of years of us being in the market, we've seen so many amazing use cases of how this really changes the data team's work culture, how much take time that they've saved,

and how they can really now focus more on

forward looking projects and also

having them to enable the rest of the company.

So, yeah, I'm really excited for what's coming and also

as we are noticing

more companies moving towards a self-service analytics and enabling more of their more of their company employees to leverage data better.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think that's a great question.

Data management

overall, it's not just tooling perspective. There's also a big part around change management,

the

process perspective, social perspective in order for,

yeah, to be, you know, managed well.

And 1 part, if there's anything we can do regarding tooling and technology, we can do better, and this is definitely 1 of the areas we're continuing to work on, is bridging the gap between the business processes

and then the data models that supports that business processes.

Today, I think there are ways to try to document this in, rich text documentation,

But this still is, like, to really fully map it and have it automated

and have it understandable

by everyone.

I think that there are just definitely more ways to go, and I'm curious to

find out what other solutions are maybe out there in about a year or 2 and also

kind of the road map of how we wanna tackle this problem to really bring

the understanding of data in 1 place.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Select Star and your perspectives

on the

different challenges in use cases for data discovery and how we can

use that information

to power upstream and downstream

work and use cases. So appreciate all the time and energy that you and your team are putting into contributing to this ecosystem, and I hope you enjoy the rest of your day. Thanks so much, Elias.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

Subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts@dataengineeringpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links