Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100

credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain.

Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here.

I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know.

Go to data engineering podcast.com/97

things today to get your copy. Your host is Tobias Macy. And today, I'm interviewing Shinji Kim about Select Star, an intelligent data discovery platform that helps you understand your data. So, Shinji, can you start by introducing yourself? Well, thanks for having me here,

Tobias. Really excited.

Yeah. I'm Shinji. I'm the founder and CEO of Select Star. We build automated

data discovery platform

for

everyone to

be

able to find and understand their own data. And do you remember how you first got involved in data management?

So I studied software engineering at University of Waterloo

and have worked with a lot of companies in Silicon Valley since 2007

as a coop intern, which has been an amazing experience.

1 of my very first internships

in the Bay Area was working at Sun Microsystems

Research Labs

in sales forecasting

as a statistical analyst, building our models,

crunching about 10 year worth of

sales, marketing, and operation

note data. And it basically what's our

forecast compared to the plan versus the actual.

That's I would say is how I got involved in data.

I've also worked at Barclays Capital.

We're building an application

for

Global IT database consolidation.

So I guess that is the first experience of doing

data management.

I built a dot net program that will scan through all the development databases in the bank that weren't being used for more than a year. We have a whole list of things that would get the commission that saved a lot of money for the bank. So that's, I guess, like, another experience that I vividly remember

how much impact there you can make from data management overall. I also worked at Facebook in the growth team, primarily doing keyword

optimization for ad campaigns. We were running

at Facebook back in 2009 for user acquisition and advertiser acquisition. I wrote a lot of ETL jobs from there, and, yeah, just running through the

analysis.

I guess the back in the days of being a data engineer slash data analyst was kind of, like, what I was doing.

And I moved to New York, worked in management consulting for a little bit,

worked at a mobile ad network called Yieldmo, which grew very quickly.

From there, we were processing about 10, 000, 000, 000 events a day on our stream processing

flow on top of Kafka,

Storm, and HDFS,

which were breaking at the time. This is back in 2013.

Mhmm. The lead engineer from the company and I decided to start a new company on a modern way of distributed stream processing

called Concorde Systems.

That's a first company that I started in 2014

that was in the data platform, data infrastructure space.

So we had a stream processor that runs 10 to 20 times faster than

alternatives. At the time, it was Apache Storm and Spark Streaming.

Later, I sold the company to Akamai, and now it's an IoT data platform called the IoT EdgeConnect,

primarily

designed to process

sensor data coming from devices all around the world

for

consumer electronics companies and automotive companies that already has millions of devices like connected cars, smart TVs, game consoles,

and it was basically running Concorde on top of a distributed MQTT broker that's hosted on Akamai CDN edge network.

I spent some time off after I left Akamai

traveling, and then moved to San Francisco about 2 years ago, and I started SelectStar about a year ago

with a lot of the observation that I had in the data space,

as well

as experience

of being the end user of both the data producer and data consumer.

I felt like

the data discovery is an area

and problem that more people are starting to run into

that I feel like is a place where I can also make an impact on. So, yeah, there you go. Yeah. It's definitely a very

interesting progression of challenges that you're dealing with. And I agree that data discovery is 1 of the big sort of

headline issues that people are trying to tackle at the moment. And so I'm wondering if you can just give a bit of an overview about what it is that you're building at SelectStar now

and maybe add some

sort of nuance

to what your thoughts are on the data discovery term versus data catalogs or metadata management, which are other elements in the space that people are using to try and address some of this discovery complexity?

So what we are doing at Select Star, and our main mission is to make data easy.

When you have data

access, but you have a lot of different things to sift through, how do you know and how can you find the right datasets you're looking for?

And the first angle we are starting from is data discovery,

which I define as finding and understanding data.

Finding data,

even though you may not know what it's called, you should be able to find

column or table or dashboard or chart or metric

that you are thinking about.

And understanding data

by having all the context around that data object

or data asset,

such as who's using this, where did it come from, where does it live today,

And what are the ways that this data has been used in the past?

So

I think there are a few things that's happening or has been happening in the market

that is starting to make data discovery more painful.

And

I guess,

smaller features or different angles of how we use data that wasn't

as

taken with a lot

of attention

in the past

from the old data catalog tools.

And this is not because, you know, the old tools are not great. I mean, old tools are not great, but

the main part that I would say is different is just the new world that we are living in.

I see that there are 3 main

things that's happening in the industry

that is,

making today's data discovery

and today's data catalog

to be very different, and it should be very different.

So first and foremost,

what I see in the industry is

and that is very clear to everyone, is collecting more data.

And the data is not just coming from

and not being collected just from your websites and apps anymore. I mean, you are already collecting a lot more data from those 2. But, also, you are getting data from Salesforce,

Marketo,

Stripe.

All your tools that you currently use to store other

types of operational data

is coming now directly into

the same data warehouse that you are also copying your production data from.

And this is primarily

so that you can have 1 place that stores everything,

and so

you can join different tables

and make a new observation and analysis on top of it, which is great,

but

it does have a lot of

impact. So a few things. 1 is that you have a lot more data than before.

A lot of these data is entering the data warehouse or the data lake in a raw form. When arrives, you cannot use it directly. So you have to transform it. You have to, like, match your your customer ID with your customer name,

and you can then have some version that you can use as a dimension table or fact table. And then on top of that, in order for you to actually generate business reporting or any other analysis on top, you still have to run some aggregation, some materialized views, and tables that you have to create on top

That also makes more tables and views, inside the data warehouse.

Eventually, it just becomes too confusing and just too many things that you have to sift through because

there isn't necessarily, like, 1 place to go see that today,

And a lot of the old data catalogs are not necessarily designed for

the operation of, today's cloud data warehouses and, cloud data lakes. So that's 1 part that I see. The second part I see is what I call decentralization

of data ownership.

It used to be, 5 to 10 years ago, you go to the data plot team, they will load the data, transform the data, store the data, and make the report for you.

And if you wanna change anything, they'll change it.

Now most of the organizations,

more larger, and also a lot of modern organizations today,

have their own data team in different divisions, but they are just not called data team, they just they're called the ops teams. So you have sales ops, marketing ops, finance,

product analytics, marketing analytics.

Each business divisions have their own analysts or people that will create their own dashboards,

reports,

and also some of their own tables and views on top of raw data and other materialized,

data.

What this means is you don't just

go to the data team to ask questions about data anymore.

You have to go to the finance team, or you have to ask the product team. And if you're trying to marry multiple datasets together,

you will have to talk to multiple people.

And sometimes you may get different answers from different people.

And that is confusing not just to someone who's just trying to find the answer, but also everyone else. Oh, I didn't know you were calculating revenue like that. You may also end up in a position where you might find wrong answers because you didn't talk to everyone.

So no 1 person or a team holds the, you know, single source of truth or a single answer, and that is starting to, I would say, become a problem.

Last but not least, I also see the trend of what I would call

democratization

of data access

contributes to this issue.

So

it's not just the engineers that are accessing data anymore, it's a lot of business stakeholders

and what we call citizen data scientists, citizen data analysts

that are accessing data directly today. 5 to 10 years ago, a lot of business stakeholders would get email reports of how their business is doing. Today, they have access directly to the data warehouse

through Tableau, Looker, Mode,

and they can create their own reports.

They are

trying to learn the tool so that they can slice and dice the data in different way. What this means is they are now starting to have questions about data,

asking,

so can I slice this number by this dimension?

Which dimension should I use to filter?

What are the right dashboards

that I should look at in order to answer this type of question?

And answers to that is not always

clear,

and

a lot of the questions around is

starting to make the

data team, who's been supporting everyone, whether they're coming from ops teams or direct business, stakeholder or other engineering

teams. A lot of data teams are starting to become

almost like an IT help support, internal IT help support for everyone else to utilize their data.

That's just not great.

I don't know how else I should put it. So I see issues that's happening around, like, lot of ad hoc support that burns people out,

and yet the tribal knowledge of data is still hard to, like, transfer to everyone.

And, hence, also hiring more analysts, sure, could be an answer, but ramping up new analysts always takes time. So, yeah, that these are really the core challenges that I'm seeing in a lot of work today,

especially with companies that have grown quickly in the last couple of years. To your point, there are a lot of different facets of this problem. And 1 of the things that jumped out at me as you were talking through the sort of array of complexities

idea of the single source of truth for metrics. And I know that there there's been some motion in that space in terms of having the introducing the concept of a metrics layer with sort of the Minerva project from Airbnb being 1 of the notable examples, and then there's the transform company that has recently launched

to be a managed service to make that accessible to people so they don't have to build it themselves in house. And I'm wondering if you can give your perspective on the sort of relative utility of the metrics layer as it compares to the set of features that you are

offering in Select Star and just the utility of having metrics as a point solution versus being

integrated into a more sort of holistic approach to the discovery and access and analysis and sort of the social aspects of data within the company? I personally believe that

the concept of metrics, like, you know, what is

revenue, what is activation, right, so on and so forth,

is already defined somewhere, whether that's in the database or the BI tool

or in a SQL query somewhere

in a lot of companies.

A lot of the things that companies like Transform brings to the table as

a sole metrics platform

is to govern

those metrics,

and

more importantly,

being able

to efficiently

run those metrics queries.

And I think it's really in that

efficiency and the partialization and all that that they will do underneath

to calculate those metrics without having to for someone to having to wait for a long time for something to load or slice and dice

the metric to buy. So regarding that, the role that we play and how

we see the metrics players in the ecosystem

integrate with us is

where we are is

we want to

become a centralized

place where you can find all the decentralized

data, like metadata around the ecosystem.

So we are starting with

data warehouse and BI

integrations today.

And

in between, we do have a concept called metrics that our customers can define

by

either adding a SQL query or

pointing to a Looker

measure field

or a Tableau measure field or a column, so on and so forth.

But we do not execute any queries or metrics. It's really designed to

provide

that, hey, when somebody says revenue,

the way that they get that data is by using this field or using this column or using this, SQL query.

The part that we really add value in that point of view,

addition to having a sort of single source of place where people can find the definitions, like, what does it mean, what business problem does it solve, like, this customized documentation that they can add,

is really connecting that definition

back to

where

that metric currently exists

in the data warehouse

and in the BI tools.

So today, when our customers define a metric in Select Star in a form of SQL

field

or a field in BI tools,

We will bring out all the dashboards that that shows up today

so they get a visibility

of where that metric's currently living.

And the way that we are thinking

about integrating with the other metrics platform,

Like, Transform is

having an interoperability

so that

if the customer defines their metrics

in Transform,

we can

bring out the

descriptions, the documentation,

like, how it should define, which table it is, and then

they can move up to transform or other tools to

slice and dice the metric as they need to or save it on their workflow, so on and so forth.

We are currently doing with a lot of BI tools today.

You search for a keyword, we will find you the specific chart or dashboard or explorer field or data source field,

and from there, you can look at for the top users, where does those data come from, what are the dashboards or other dashboards that are related,

and,

like, what are the popularities

of this. And then from there, we always have this button on top

called

open in Looker, open in Tableau, open in Mode. So you can actually go back to that tool and do your deeper analysis or your own workflow afterwards.

In that case, we are still the discovery platform

where a lot of users start from select star to find what they're looking for. They will explore around,

and then they will jump off to their main tool of whether that's Snowflake or some SQL IDE or BI tools.

And that's actually an interesting point to dig into more is the

organizational

challenge of getting everybody

to

collaborate in the same space

where, you know, an analyst is going to be living in Snowflake or their DBT IDE.

A data engineer is gonna be living in their orchestration tool. A, you know, business end user is gonna be looking at the dashboarding system.

And how do you actually create those

integration points to bring people into the same space for being able to ask and answer questions about the data in these

different modes and in these different contexts

without forcing them to change their workflow that they're used to, but still be able to sort of reach out into those different systems and bring everyone together. That I'd be interested to talk through a bit more about how you're approaching that with Select Star and some of

the difficulties or interesting learnings that you've come across as you build out these integrations and work with customers to figure out what it is that they're looking for in a data discovery and collaboration tool.

Yeah. I think that's a really important point because, you know, I mean, no 1 wants to change what they're already

doing just

so that they can use a new tool, especially in data. You also don't want to have issues when you are syncing the data. What is what is the actual right version that we're gonna use?

So, initially, when we first started Select Star, we made this mostly read only.

Really, the magic that we have underneath

is

by parsing through the SQL queries, analyzing the metadata,

putting them all into 1 place so that you are finding these insights that you actually didn't know about before.

But

sometimes,

some of the customers have come to us and said, hey. I just wanna change the

description in select star.

Like, this is much easier to use. Our analysts don't wanna always, like, you know, make a pull request just to change the spelling of the description,

so on and so forth. So we do have a UI to update those documentation in Select Star.

And what we tell our customers is that if you are going to start doing that, that's totally fine. But what that means is we're not going to

try to update our discretion directly from Snowflake.

So we want you to choose

where you're going to add that data.

Do you wanna do it through a dbt

because that's where you are currently updating your documentation?

Or do you wanna do it through a select star?

So initially well, I mean, as of today,

we do read directly from Snowflake, BigQuery, DBT,

you know, Looker, Tableau, like, you know, we will read it and we will always update it every day.

But, actually,

in about couple months, so end of q 3, we plan to release our API so that our customers can actually retrieve the metadata from Select Star directly.

So analysts have updated their

documentation in Select Star. That latest doc, you can, you know, push it into

Snowflake,

Real Looker, Tableau, however you want.

So overall, we don't want to change our user

workflow overall,

but because once a lot of customers start using Select Star, they do want to collaborate

or

add different details

about their

data in Select Star.

Those data, we want to make it available for

our customers to query through. We also plan to

make our metadata, like popularity or column level lineage,

available for our customers to retrieve so that they can utilize it programmatically

in their Airflow jobs or, DBT jobs or, you know, their quality platforms, so on and so forth.

In terms of the actual SelectStar

platform, can you talk through some of the architectural elements that you've built into it and how you're managing

the sort of integration points across the different layers of the data stack

to be able to provide this cross cutting view to the organization?

So the way Select Star works is, we

get

access to the metadata

through service accounts

of data warehouses and BI tools.

For ETL models like dbt.

We just get the dbt model, or we hook it up to the customer's dbt repo in GitHub or

And what we have underneath is now what we would call a unified metadata store

that will basically

kind of, like,

consolidate different models into, like, our version of data

or metadata.

So what it means is for any data warehouses, like Snowflake or BigQuery, even though BigQuery calls it projects,

datasets, and tables, like, we will treat it as

database, schema, and table.

Similarly,

for dash or for BI tools,

we will treat,

like, Looker dashboards as, like, the similar element as a mode report

because a dashboard to us is a set of

charts or

queries.

Each

of these concepts, like, what I would call, like, these unified metadata model that we have has a custom data model underneath

that has the specific

integration

model

that kinda defines our connector.

So with that, that's how we are able to aggregate and say this is the, you know, popularity or this is the data lineage.

So that's how we treat the metadata.

On top of that, we have our query parser.

So our query parser has a support for different SQL dialects,

as well as, like, understanding,

you know, whether,

you know, whether this is, like, a custom SQL from Tableau or this is a query from Mode or how, like, you know, also including parsing through LookML, for instance.

Combining all that along with the metadata model that we have, we emit our own popularity

model.

So it also depends on the metadata you're looking at. So for an example as a table, how many people are referencing this table

in their select queries?

For dashboards, it will be how many people have

viewed

this dashboard in the last 30 days or last 90 days. So these are somewhat customizable model that we give control to our users, but we run our popularity model.

We have a rolling window that we aggregate.

And, yeah, at the end, we do a almost like a

relative measure

so that you can always see at any data

asset

with

how popular this is inside the company

and also to relative to

its own peers. Like, if I'm looking at table, then table popularity is always relative to the other tables, same with columns, same with dashboard, so on and so forth.

On top of that, we have a data lineage. So data lineage is, you know, primarily coming from, you know, model that we generate from our query parser focused on all the DML

and DDL queries

that includes select. So create table as select and update margin, so on and so forth. We will parse that through and put it together.

So 1 part about data lineage that a lot of our customers really like is being able to see the end to end data lineage

from your raw table inside your data warehouse,

to your transient table, to your reporting table and materialize view, to

your local view, to your explorer, to the dashboard.

And similarly to, you know, mode report or Tableau

data source to the embedded data source to the workbooks, and

being able to see that also at the sheet level or view level for dashboards.

That's how we are, like, kind of seeing the world.

And utilizing both of that is what we allow our customers to define their metrics.

So when you define the metric, we can tell you right away what the popularity of that metric is

and which dashboards that includes that metric, so on and so forth. And in the future, after we have the API support so that our customers can leverage

our own metadata model,

we want to provide a way for them to build automated workflows on top, which I'm excited about.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch.

The other interesting thing to dig into is

the linkage of

columns and queries to the downstream dashboards that are consuming them. And I know that that's usually 1 of the

sort of major goals in a data platform is being able to see if I change this column, which reports is that going to change, and then having that built in popularity of this is how this dashboard is being viewed and, you know, being able to see who's viewing it. Now I can get a better understanding of the impact of, you know, changing the

formatting of this, you know, integer column here or, you know, the changing the precision of this float from, you know, 5 to 3 decimal points.

To the point of the integration with the dashboarding tools, I'm wondering if you can just talk through some of the

challenges that are inherent

in being able to work with a variety of different business intelligence systems

and some of the conceptual boundaries that you've had to overcome in terms

of how to

think about modeling the interactions

with the dashboard and feed that back into Select Star and just that whole space of integration complexity?

So first of all, the part that you're mentioning about data lineage

is actually a really useful and very interesting part of Select Star.

It really gives the

and the state lineage, I think, in general, you get to see

and being able to do that impact analysis so quickly,

especially on the engineering side, what's gonna, you know, crash if I change this, versus from the data analyst side,

oh, this dashboard is not loading correctly. Which table is this loading the data from? And is this data each table, is this actually up to date? And I think that's, like, a really valuable thing to have.

On top of that, the part that

BI integration

adds

is to go 1 step further in telling you who's gonna get impacted if this dashboard crashes.

So for us, 1 thing that we show on each of the database table page is

the downstream

impact,

like,

list

where it shows which dashboard

is using this, what the popularity of that dashboard look like, and who the top users are of the

dashboard. So you can attach that

directly to the table or column

that you are looking at. It was actually 1 feature that 1 of our customers have requested and has been very useful for a lot of other customers too.

Regarding the integration,

yeah, it's been challenging.

1 is mapping the models so that they are all fitting into

the, you know, similar

ways that, you know, our users are, like, used to seeing.

The other part

is really just getting around what is actually being supported

versus none between different APIs.

So for Tableau, today, we do have in when you give us access, like, service account, like, API token for us,

We actually utilize both the REST API and metadata API

in order to fetch all the data we need. And then we usually do have to also connect to the Tableau Server Postgres instance that

contains the activity data.

Just an example. And we are, you know, working

along with all of the BI partners on this. So they are very aware of the issue, and they're working on it. So I think this will definitely get better over time.

Yeah. I mean, same with Looker. We get a lot of metadata from Looker API, but we still do have to get Look ML repo separately today

because that part is not exposed through the API. So I would say getting around

different APIs has been actually definitely the

challenge part that we've spent a lot of time on.

Beyond

the dashboarding layer and the data warehouse for being able to do the query parsing and the table analysis? What are some of the other integration points that you've either had to build out or

are working on building out to be able to make SelectStar viable for

presenting this data discovery and collaboration layer for the organization?

Yeah. A lot of our integration

is driven by customers that we're working with. So with that, there are other, you know, integrations that we want to make it happen. Conceptually, a few things that hasn't been as important to our previous customers, but new customers that we're gonna be working with are more important for are something like Struct support for BigQuery,

which I think is a really interesting way to retrieve

data for BigQuery, but is not necessarily used a lot in Snowflake world, for instance.

So that's fun. Dbt is another 1. We are supporting, basically, you know, being able to create Select style documentation

just out of your dbt catalog files, but

if you already have a data warehouse connection through a dbt

and you already have a persistent docs, that dbt docs or the Yaml files

or manifesting JSON

is not necessarily

needed.

What we are starting to look at is, what are the other metadata from dbt that we can make sure that our users can benefit from, such as when's the last time this model have run? What are the

tests, DBT tests, that runs against this table?

So those are some of the things that we are starting

to think through and try to add on to Select Star.

As far as the workflow of a team that is onboarding onto Select Star, can you just talk through some of the steps that are involved in not necessarily just getting set up, but

working on a

single data analysis project together and how the interaction with SelectStar

spans the different roles and stakeholders in the company?

So I see generally 2, like, camps when we work with customers.

1 camp is companies with a fairly larger or stronger data team, like, you know, a lot more people involved and a lot more

deeper,

you know, data models and many different tables, so on and so forth. And we also have companies that has, like, less number of tables, but have a lot more

or their focus is really to enable support

for

everyone else inside the company.

So on the first camp,

actually, I would say both camps, they do go through a very similar framework, but maybe in a different time horizon.

In the beginning,

once the data is loaded so usually,

once we connect it to the data warehouse and the BI, you know, takes just couple hours or so to load all the data and bring out the lineage

and popularity, so on and so forth. Currently, we usually give, like, about 24 hours for this tool to settle.

And then what we recommend to our users is, okay, take a look at, like, what it shows,

and, like, tell us if there's anything missing, if does it look alright.

The part that we ask them to fill out is, first of all, like, what are the service accounts that you're currently running?

Because if you have ETL jobs that's creating tables every hour, it's gonna mess up the popularity.

So we give a tool for them to check off which are the service accounts so that we can adjust the popularity weight based on that information.

And then based on the overall popularity,

a lot of data team actually

decides to onboard Select Star to use it as a data discovery or data catalog for the rest of the team.

For larger companies, they still do that because they don't have a lot of time to dedicate a lot of effort in the beginning when they are just trying out the tool. So they just start giving out access to other analysts, and other analysts still find value because they can find their own existing table. They can find,

and from who's actually using that table, the all the lineage and

the dashboards and so on and so forth. So they start using that, and that's when they start adding their tags or documentation,

so on and so forth. So we take it a little bit slowly with the larger companies because there's a lot coordination within the company. It takes some time, but

we usually

have, like, an onboarding session, initial onboarding session for a lot of our customers, where they onboard, like, you know, 5 to 10 or up to 20 data analysts so that they can start working on it. And then they ask us, like, different questions or can give us feedback through Slack channels, and then we hold, like, office hours. For

companies that that say that their data model's a little bit more manageable,

when they first see their data in Select Star,

many companies actually find that, oh, like, I see all the places that I need to clean up first before I give the access to everyone.

So I'm going to actually take the next month or 2 some time, and I'm going to dedicate my other colleagues

so that we can

deprecate the old database tables that we don't really need.

So we can actually put the right tags that we are going to govern and manage

and start defining metrics.

So we are in process with couple customers that's doing that right now. Like, they initially

wanted the data catalog, and they're like, well, we are gonna now start a data governance project with Select Star. So, like, the project has been shifted a little bit, but that's really I mean, I also do agree with them. Like, that is a right way to also open up the data discovery platform to the rest of the company. So, yeah, those are the 2 kind of different ways that we see. But, eventually,

the main flow is that once you first load the data, you are going to get this,

bird's eye view plus some insights of how the data is being currently being used inside the organization.

Utilizing that insights,

the data team will

clean up or add documentation, so on and so forth, to make it more consumable

for either the rest of the data team or for the rest of the company.

And then that's when they start adding more users in the organization. That's kind of how we've been seeing Select start growing.

Given the sort of breadth of scope and

user base that you're building for, what have been some of the most complex or

difficult

aspects of building out the product, whether in terms of the technical underlaying

or the design elements or the sort of social aspects of building a product that so many people need to interact with?

Yeah. I mean, we're still early stage startup, but, you know, and I would say the a lot of the users, we have about hundreds of users using it now.

And most of them are, I would say, in data team, so data analysts or ops

analysts,

PMs,

and data engineers.

So they are fairly, like, familiar

with either, like, basic SQL or

using, you know, drag and drop and Looker, at least.

I would say it's really everything.

The API integration

has its own challenge. Designing an interface

regarding the application side, designing an interface

that

holds and shows a lot of data,

but is still

simple enough

for new user to start using,

is always an ongoing challenge for us.

There's a lot of different things we can show, but how can we distill it down so that we are showing the most important or most useful thing

for,

different set of users.

And then in terms

of the

adoption of Select Star and some of the ways that it's being employed, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

I would say SelectStar has definitely evolved

for

more open and wider

aspect

than I initially thought it was going to go.

Initially, I thought this would be a great tool to, like, utilize between

just the data warehouse and BI tools, primarily for data analysts.

And still, like, I would say 70% of the use cases are there.

But allowing

or

empowering

data teams to be able to run their data governance on their own terms has been a really eye opening experience for me, because usually data teams get roped into data governance

through security compliance, and they are not sure really exactly what they need to do. But, like, Select Star can kinda open up and give them different insights for it, so that's been really interesting.

The other ways

around

how people actually want to add different metadata on top of Select Star

was another bit of a surprise, but it's very interesting

part

today. So

1 thing that we've gotten requests for is being able to have

discussions and q and a's around datasets in SelectStar

because a lot of companies have analytics, you know, where everybody comes and asks questions.

But

it's very hard to search through. A lot of people ask the same questions or they ask the same kind of question,

but for a different data set.

So things like that, we now have a what we call discussions

attached to every single data asset, where anyone can ask questions

which notifies the owner of the data asset,

and they can reply on the thread, and that sends another notification to the person who originally asked the question.

And any of those comments or questions, answers, all of that is indexed and searchable

on top of the normal metadata. And that

feature

and also now we have a Slack app integration

that also feeds into the workflows of more users,

I think has been a very interesting development for us as we are starting to move towards not just servicing

the data team, but also helping the data team to serve the rest of the company better.

Yeah. And with that, I would say in the future, we also want to start integrating directly into different applications and workflows beyond

Lair.

In terms of your experience of building Select Star, what have been some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think the really interesting part that's been eye opening and that's been really

fun about building Select Star is

seeing how this also

impacts the workflow of not just, like, the traditional data team that I was thinking about,

but for different use cases.

So I would say, couple of things. 1 is, like, it's like bird's eye view and the context of the data unlocks a lot of value for our customers, whether they are small or large.

So

1 is, like, I would say most of our customers fall into mid to large companies,

and they would start for even, like, data migration

or, like, just doing remodeling of data, data governance catalog, which is, like, very kind of, like, expected use cases.

But we also have very small companies using SelectStar

because they want to make sure that they are building their data models in the right way,

and they are tracking

the

that they

are not leaving

all the dashboards or error dashboard,

you know, for long periods of time.

That was, like, a 1 kind of a discovery that we had that we didn't realize before.

The other part,

related to more of different use cases where data discovery can be helpful that has been a bit of a surprising

thing for us is

how

the usage data

can be useful for not just, you know, understanding, you know, or finding data, but also for industries like financial services.

For them to be

able to see what the ROI of the data that they are currently buying

based on the internal usage

and the auditing and so on and so forth. So those are some unexpected usage of Select Star that we started learning about, which we are also very excited to support.

As you continue to build out Select Star and iterate on the problem space and work with your current set of customers and onboard new ones. What are some of the things that you have planned for the near to medium term of the company? I think I alluded to this before, but we are planning to release an API so our customers can pull and push metadata from and to Select Star, which I'm actually really excited about. This, I think, will also open up a lot of different integration points

that for us to integrate with more tools

so that, you know, BI tools and other tools can also be

updated with metadata and the usage information that we can generate.

The other part that we're planning for is self-service.

So we initially

thought, and it still is the case, that it's mostly the mid to large companies that has this data discovery

problems.

At the same time, after we did the soft launch back in March, we've had a lot of requests from smaller companies.

And

another part to our surprise,

a lot of these companies, I would say, you know, are between

a 100 to 300

employee size, or even at, like, 50 person company size.

They were able to all onboard themselves really quickly.

And from day 1, they create tags, they add dashboards or descriptions, and then they will start inviting others.

So

we want to now open up Select Star to more people so that anyone can sign up and try Select Star on their own.

So, yeah, we're just, like, in the process of now building the sign up workflow,

yeah, after we finished with the API.

And are there any other aspects of the overall problem space of data discovery

and collaboration

and the work that you're doing at Select Star that we didn't discuss yet that you'd like to cover before we close out the show? It's been really interesting to

start seeing how the context of data can

impact and help a lot of companies on our side.

I'm also really excited to see with the rise of a lot of tools in the data ecosystem

and,

you know, different use cases that I also haven't seen that could happen in the future.

And,

yeah, I think the

overall

interoperability

is something that, as an industry, that I think we need to work through more,

and we want to be a good citizen on that

for contributing back to

the community

around integrations,

metadata,

and the way that different tools exchange

the data. So So on that point, have you been working with the folks on the Open Lineage project to add support for that both as a ingest and export mechanism for what you're building at SelectStar?

Yeah. We are starting to look into it. It's been a while since I talked to Julian, but I plan to ping him soon.

So we have the API part ready. And

we are also in discussions with a few

PI as well as other metrics companies around

some, like, metrics

protocol

type initiative as well. So that's kind of where we are starting from. Very cool. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I gotta say it's interoperability.

I would say a lot of the previous generation of data companies have

built proprietary data formats and the way to process them, which promotes vendor lock in and long processes for external integration.

I mean, at the same time, because it was proprietary within those companies, I'm sure they were able to run their product development much faster.

But today, I think there's,

a lot more initiatives around being open platform

like DVT, which is amazing.

And, also, there are a lot of amazing point solutions

of all parts of different data stack.

I feel like it the migration always feels like the pain or, like, even for our customers to try out a new tool as a POC, you know, just because, you know, you don't know what that migration or interoperability

is like. So,

yeah, that I feel like is the

gap that I see in the industry that I'm hoping that it will be improved in the next couple of years.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Select Star. It's definitely very interesting product and an interesting problem space. So definitely excited to see where it takes you and where you're able to take and the platform. So thank you for all the time and effort you're putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. This was fun.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links