Summary
Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Shinji Kim about SelectStar, an intelligent data discovery platform that helps you understand your data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SelectStar is and the story behind it?
- What are the core challenges that organizations are facing around data cataloging and discovery?
- There has been a surge in tools and services for metadata collection, data catalogs, and data collaboration. How would you characterize the current state of the ecosystem?
- What is SelectStar’s role in the space?
- Who are your target customers and how does that shape your prioritization of features and the user experience design?
- Can you describe how SelectStar is architected?
- How have the goals and design of the platform shifted or evolved since you first began working on it?
- I understand that you have built integrations with a number of BI and dashboarding tools such as Looker, Tableau, Superset, etc. What are the use cases that those integrations enable?
- What are the challenges or complexities involved in building and maintaining those integrations?
- What are the other categories of integration that you have had to implement to make SelectStar a viable solution?
- Can you describe the workflow of a team that is using SelectStar to collaborate on data engineering and analytics?
- What have been the most complex or difficult problems to solve for?
- What are the most interesting, innovative, or unexpected ways that you have seen SelectStar used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SelectStar?
- When is SelectStar the wrong choice?
- What do you have planned for the future of SelectStar?
Contact Info
- @shinjikim on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- SelectStar
- University of Waterloo
- Kafka
- Storm
- Concord Systems
- Akamai
- Snowflake
- BigQuery
- Looker
- Tableau
- dbt
- OpenLineage
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here. I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know. Go to data engineering podcast.com/97 things today to get your copy. Your host is Tobias Macy. And today, I'm interviewing Shinji Kim about Select Star, an intelligent data discovery platform that helps you understand your data. So, Shinji, can you start by introducing yourself? Well, thanks for having me here,
[00:02:34] Unknown:
Tobias. Really excited. Yeah. I'm Shinji. I'm the founder and CEO of Select Star. We build automated data discovery platform for everyone to be
[00:02:46] Unknown:
able to find and understand their own data. And do you remember how you first got involved in data management?
[00:02:52] Unknown:
So I studied software engineering at University of Waterloo and have worked with a lot of companies in Silicon Valley since 2007 as a coop intern, which has been an amazing experience. 1 of my very first internships in the Bay Area was working at Sun Microsystems Research Labs in sales forecasting as a statistical analyst, building our models, crunching about 10 year worth of sales, marketing, and operation note data. And it basically what's our forecast compared to the plan versus the actual. That's I would say is how I got involved in data. I've also worked at Barclays Capital.
We're building an application for Global IT database consolidation. So I guess that is the first experience of doing data management. I built a dot net program that will scan through all the development databases in the bank that weren't being used for more than a year. We have a whole list of things that would get the commission that saved a lot of money for the bank. So that's, I guess, like, another experience that I vividly remember how much impact there you can make from data management overall. I also worked at Facebook in the growth team, primarily doing keyword optimization for ad campaigns. We were running at Facebook back in 2009 for user acquisition and advertiser acquisition. I wrote a lot of ETL jobs from there, and, yeah, just running through the analysis.
I guess the back in the days of being a data engineer slash data analyst was kind of, like, what I was doing. And I moved to New York, worked in management consulting for a little bit, worked at a mobile ad network called Yieldmo, which grew very quickly. From there, we were processing about 10, 000, 000, 000 events a day on our stream processing flow on top of Kafka, Storm, and HDFS, which were breaking at the time. This is back in 2013. Mhmm. The lead engineer from the company and I decided to start a new company on a modern way of distributed stream processing called Concorde Systems.
That's a first company that I started in 2014 that was in the data platform, data infrastructure space. So we had a stream processor that runs 10 to 20 times faster than alternatives. At the time, it was Apache Storm and Spark Streaming. Later, I sold the company to Akamai, and now it's an IoT data platform called the IoT EdgeConnect, primarily designed to process sensor data coming from devices all around the world for consumer electronics companies and automotive companies that already has millions of devices like connected cars, smart TVs, game consoles, and it was basically running Concorde on top of a distributed MQTT broker that's hosted on Akamai CDN edge network.
I spent some time off after I left Akamai traveling, and then moved to San Francisco about 2 years ago, and I started SelectStar about a year ago with a lot of the observation that I had in the data space, as well as experience of being the end user of both the data producer and data consumer. I felt like the data discovery is an area and problem that more people are starting to run into that I feel like is a place where I can also make an impact on. So, yeah, there you go. Yeah. It's definitely a very
[00:06:35] Unknown:
interesting progression of challenges that you're dealing with. And I agree that data discovery is 1 of the big sort of headline issues that people are trying to tackle at the moment. And so I'm wondering if you can just give a bit of an overview about what it is that you're building at SelectStar now and maybe add some sort of nuance to what your thoughts are on the data discovery term versus data catalogs or metadata management, which are other elements in the space that people are using to try and address some of this discovery complexity?
[00:07:08] Unknown:
So what we are doing at Select Star, and our main mission is to make data easy. When you have data access, but you have a lot of different things to sift through, how do you know and how can you find the right datasets you're looking for? And the first angle we are starting from is data discovery, which I define as finding and understanding data. Finding data, even though you may not know what it's called, you should be able to find column or table or dashboard or chart or metric that you are thinking about. And understanding data by having all the context around that data object or data asset, such as who's using this, where did it come from, where does it live today, And what are the ways that this data has been used in the past?
So I think there are a few things that's happening or has been happening in the market that is starting to make data discovery more painful. And I guess, smaller features or different angles of how we use data that wasn't as taken with a lot of attention in the past from the old data catalog tools. And this is not because, you know, the old tools are not great. I mean, old tools are not great, but the main part that I would say is different is just the new world that we are living in. I see that there are 3 main things that's happening in the industry that is, making today's data discovery and today's data catalog to be very different, and it should be very different.
So first and foremost, what I see in the industry is and that is very clear to everyone, is collecting more data. And the data is not just coming from and not being collected just from your websites and apps anymore. I mean, you are already collecting a lot more data from those 2. But, also, you are getting data from Salesforce, Marketo, Stripe. All your tools that you currently use to store other types of operational data is coming now directly into the same data warehouse that you are also copying your production data from. And this is primarily so that you can have 1 place that stores everything, and so you can join different tables and make a new observation and analysis on top of it, which is great, but it does have a lot of impact. So a few things. 1 is that you have a lot more data than before.
A lot of these data is entering the data warehouse or the data lake in a raw form. When arrives, you cannot use it directly. So you have to transform it. You have to, like, match your your customer ID with your customer name, and you can then have some version that you can use as a dimension table or fact table. And then on top of that, in order for you to actually generate business reporting or any other analysis on top, you still have to run some aggregation, some materialized views, and tables that you have to create on top That also makes more tables and views, inside the data warehouse.
Eventually, it just becomes too confusing and just too many things that you have to sift through because there isn't necessarily, like, 1 place to go see that today, And a lot of the old data catalogs are not necessarily designed for the operation of, today's cloud data warehouses and, cloud data lakes. So that's 1 part that I see. The second part I see is what I call decentralization of data ownership. It used to be, 5 to 10 years ago, you go to the data plot team, they will load the data, transform the data, store the data, and make the report for you. And if you wanna change anything, they'll change it.
Now most of the organizations, more larger, and also a lot of modern organizations today, have their own data team in different divisions, but they are just not called data team, they just they're called the ops teams. So you have sales ops, marketing ops, finance, product analytics, marketing analytics. Each business divisions have their own analysts or people that will create their own dashboards, reports, and also some of their own tables and views on top of raw data and other materialized, data. What this means is you don't just go to the data team to ask questions about data anymore.
You have to go to the finance team, or you have to ask the product team. And if you're trying to marry multiple datasets together, you will have to talk to multiple people. And sometimes you may get different answers from different people. And that is confusing not just to someone who's just trying to find the answer, but also everyone else. Oh, I didn't know you were calculating revenue like that. You may also end up in a position where you might find wrong answers because you didn't talk to everyone. So no 1 person or a team holds the, you know, single source of truth or a single answer, and that is starting to, I would say, become a problem.
Last but not least, I also see the trend of what I would call democratization of data access contributes to this issue. So it's not just the engineers that are accessing data anymore, it's a lot of business stakeholders and what we call citizen data scientists, citizen data analysts that are accessing data directly today. 5 to 10 years ago, a lot of business stakeholders would get email reports of how their business is doing. Today, they have access directly to the data warehouse through Tableau, Looker, Mode, and they can create their own reports.
They are trying to learn the tool so that they can slice and dice the data in different way. What this means is they are now starting to have questions about data, asking, so can I slice this number by this dimension? Which dimension should I use to filter? What are the right dashboards that I should look at in order to answer this type of question? And answers to that is not always clear, and a lot of the questions around is starting to make the data team, who's been supporting everyone, whether they're coming from ops teams or direct business, stakeholder or other engineering teams. A lot of data teams are starting to become almost like an IT help support, internal IT help support for everyone else to utilize their data.
That's just not great. I don't know how else I should put it. So I see issues that's happening around, like, lot of ad hoc support that burns people out, and yet the tribal knowledge of data is still hard to, like, transfer to everyone. And, hence, also hiring more analysts, sure, could be an answer, but ramping up new analysts always takes time. So, yeah, that these are really the core challenges that I'm seeing in a lot of work today,
[00:14:41] Unknown:
especially with companies that have grown quickly in the last couple of years. To your point, there are a lot of different facets of this problem. And 1 of the things that jumped out at me as you were talking through the sort of array of complexities idea of the single source of truth for metrics. And I know that there there's been some motion in that space in terms of having the introducing the concept of a metrics layer with sort of the Minerva project from Airbnb being 1 of the notable examples, and then there's the transform company that has recently launched to be a managed service to make that accessible to people so they don't have to build it themselves in house. And I'm wondering if you can give your perspective on the sort of relative utility of the metrics layer as it compares to the set of features that you are offering in Select Star and just the utility of having metrics as a point solution versus being integrated into a more sort of holistic approach to the discovery and access and analysis and sort of the social aspects of data within the company? I personally believe that
[00:15:49] Unknown:
the concept of metrics, like, you know, what is revenue, what is activation, right, so on and so forth, is already defined somewhere, whether that's in the database or the BI tool or in a SQL query somewhere in a lot of companies. A lot of the things that companies like Transform brings to the table as a sole metrics platform is to govern those metrics, and more importantly, being able to efficiently run those metrics queries. And I think it's really in that efficiency and the partialization and all that that they will do underneath to calculate those metrics without having to for someone to having to wait for a long time for something to load or slice and dice the metric to buy. So regarding that, the role that we play and how we see the metrics players in the ecosystem integrate with us is where we are is we want to become a centralized place where you can find all the decentralized data, like metadata around the ecosystem.
So we are starting with data warehouse and BI integrations today. And in between, we do have a concept called metrics that our customers can define by either adding a SQL query or pointing to a Looker measure field or a Tableau measure field or a column, so on and so forth. But we do not execute any queries or metrics. It's really designed to provide that, hey, when somebody says revenue, the way that they get that data is by using this field or using this column or using this, SQL query. The part that we really add value in that point of view, addition to having a sort of single source of place where people can find the definitions, like, what does it mean, what business problem does it solve, like, this customized documentation that they can add, is really connecting that definition back to where that metric currently exists in the data warehouse and in the BI tools.
So today, when our customers define a metric in Select Star in a form of SQL field or a field in BI tools, We will bring out all the dashboards that that shows up today so they get a visibility of where that metric's currently living. And the way that we are thinking about integrating with the other metrics platform, Like, Transform is having an interoperability so that if the customer defines their metrics in Transform, we can bring out the descriptions, the documentation, like, how it should define, which table it is, and then they can move up to transform or other tools to slice and dice the metric as they need to or save it on their workflow, so on and so forth.
We are currently doing with a lot of BI tools today. You search for a keyword, we will find you the specific chart or dashboard or explorer field or data source field, and from there, you can look at for the top users, where does those data come from, what are the dashboards or other dashboards that are related, and, like, what are the popularities of this. And then from there, we always have this button on top called open in Looker, open in Tableau, open in Mode. So you can actually go back to that tool and do your deeper analysis or your own workflow afterwards. In that case, we are still the discovery platform where a lot of users start from select star to find what they're looking for. They will explore around, and then they will jump off to their main tool of whether that's Snowflake or some SQL IDE or BI tools.
[00:20:11] Unknown:
And that's actually an interesting point to dig into more is the organizational challenge of getting everybody to collaborate in the same space where, you know, an analyst is going to be living in Snowflake or their DBT IDE. A data engineer is gonna be living in their orchestration tool. A, you know, business end user is gonna be looking at the dashboarding system. And how do you actually create those integration points to bring people into the same space for being able to ask and answer questions about the data in these different modes and in these different contexts without forcing them to change their workflow that they're used to, but still be able to sort of reach out into those different systems and bring everyone together. That I'd be interested to talk through a bit more about how you're approaching that with Select Star and some of the difficulties or interesting learnings that you've come across as you build out these integrations and work with customers to figure out what it is that they're looking for in a data discovery and collaboration tool.
Yeah. I think that's a really important point because, you know, I mean, no 1 wants to change what they're already
[00:21:18] Unknown:
doing just so that they can use a new tool, especially in data. You also don't want to have issues when you are syncing the data. What is what is the actual right version that we're gonna use? So, initially, when we first started Select Star, we made this mostly read only. Really, the magic that we have underneath is by parsing through the SQL queries, analyzing the metadata, putting them all into 1 place so that you are finding these insights that you actually didn't know about before. But sometimes, some of the customers have come to us and said, hey. I just wanna change the description in select star.
Like, this is much easier to use. Our analysts don't wanna always, like, you know, make a pull request just to change the spelling of the description, so on and so forth. So we do have a UI to update those documentation in Select Star. And what we tell our customers is that if you are going to start doing that, that's totally fine. But what that means is we're not going to try to update our discretion directly from Snowflake. So we want you to choose where you're going to add that data. Do you wanna do it through a dbt because that's where you are currently updating your documentation?
Or do you wanna do it through a select star? So initially well, I mean, as of today, we do read directly from Snowflake, BigQuery, DBT, you know, Looker, Tableau, like, you know, we will read it and we will always update it every day. But, actually, in about couple months, so end of q 3, we plan to release our API so that our customers can actually retrieve the metadata from Select Star directly. So analysts have updated their documentation in Select Star. That latest doc, you can, you know, push it into Snowflake, Real Looker, Tableau, however you want.
So overall, we don't want to change our user workflow overall, but because once a lot of customers start using Select Star, they do want to collaborate or add different details about their data in Select Star. Those data, we want to make it available for our customers to query through. We also plan to make our metadata, like popularity or column level lineage, available for our customers to retrieve so that they can utilize it programmatically in their Airflow jobs or, DBT jobs or, you know, their quality platforms, so on and so forth.
[00:24:03] Unknown:
In terms of the actual SelectStar platform, can you talk through some of the architectural elements that you've built into it and how you're managing the sort of integration points across the different layers of the data stack to be able to provide this cross cutting view to the organization?
[00:24:19] Unknown:
So the way Select Star works is, we get access to the metadata through service accounts of data warehouses and BI tools. For ETL models like dbt. We just get the dbt model, or we hook it up to the customer's dbt repo in GitHub or And what we have underneath is now what we would call a unified metadata store that will basically kind of, like, consolidate different models into, like, our version of data or metadata. So what it means is for any data warehouses, like Snowflake or BigQuery, even though BigQuery calls it projects, datasets, and tables, like, we will treat it as database, schema, and table.
Similarly, for dash or for BI tools, we will treat, like, Looker dashboards as, like, the similar element as a mode report because a dashboard to us is a set of charts or queries. Each of these concepts, like, what I would call, like, these unified metadata model that we have has a custom data model underneath that has the specific integration model that kinda defines our connector. So with that, that's how we are able to aggregate and say this is the, you know, popularity or this is the data lineage. So that's how we treat the metadata. On top of that, we have our query parser.
So our query parser has a support for different SQL dialects, as well as, like, understanding, you know, whether, you know, whether this is, like, a custom SQL from Tableau or this is a query from Mode or how, like, you know, also including parsing through LookML, for instance. Combining all that along with the metadata model that we have, we emit our own popularity model. So it also depends on the metadata you're looking at. So for an example as a table, how many people are referencing this table in their select queries? For dashboards, it will be how many people have viewed this dashboard in the last 30 days or last 90 days. So these are somewhat customizable model that we give control to our users, but we run our popularity model.
We have a rolling window that we aggregate. And, yeah, at the end, we do a almost like a relative measure so that you can always see at any data asset with how popular this is inside the company and also to relative to its own peers. Like, if I'm looking at table, then table popularity is always relative to the other tables, same with columns, same with dashboard, so on and so forth. On top of that, we have a data lineage. So data lineage is, you know, primarily coming from, you know, model that we generate from our query parser focused on all the DML and DDL queries that includes select. So create table as select and update margin, so on and so forth. We will parse that through and put it together.
So 1 part about data lineage that a lot of our customers really like is being able to see the end to end data lineage from your raw table inside your data warehouse, to your transient table, to your reporting table and materialize view, to your local view, to your explorer, to the dashboard. And similarly to, you know, mode report or Tableau data source to the embedded data source to the workbooks, and being able to see that also at the sheet level or view level for dashboards. That's how we are, like, kind of seeing the world. And utilizing both of that is what we allow our customers to define their metrics.
So when you define the metric, we can tell you right away what the popularity of that metric is and which dashboards that includes that metric, so on and so forth. And in the future, after we have the API support so that our customers can leverage our own metadata model, we want to provide a way for them to build automated workflows on top, which I'm excited about.
[00:28:49] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo or Facebook ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at dataengineeringpodcast.com/hitouch. The other interesting thing to dig into is the linkage of columns and queries to the downstream dashboards that are consuming them. And I know that that's usually 1 of the sort of major goals in a data platform is being able to see if I change this column, which reports is that going to change, and then having that built in popularity of this is how this dashboard is being viewed and, you know, being able to see who's viewing it. Now I can get a better understanding of the impact of, you know, changing the formatting of this, you know, integer column here or, you know, the changing the precision of this float from, you know, 5 to 3 decimal points.
To the point of the integration with the dashboarding tools, I'm wondering if you can just talk through some of the challenges that are inherent in being able to work with a variety of different business intelligence systems and some of the conceptual boundaries that you've had to overcome in terms of how to think about modeling the interactions with the dashboard and feed that back into Select Star and just that whole space of integration complexity?
[00:30:33] Unknown:
So first of all, the part that you're mentioning about data lineage is actually a really useful and very interesting part of Select Star. It really gives the and the state lineage, I think, in general, you get to see and being able to do that impact analysis so quickly, especially on the engineering side, what's gonna, you know, crash if I change this, versus from the data analyst side, oh, this dashboard is not loading correctly. Which table is this loading the data from? And is this data each table, is this actually up to date? And I think that's, like, a really valuable thing to have. On top of that, the part that BI integration adds is to go 1 step further in telling you who's gonna get impacted if this dashboard crashes.
So for us, 1 thing that we show on each of the database table page is the downstream impact, like, list where it shows which dashboard is using this, what the popularity of that dashboard look like, and who the top users are of the dashboard. So you can attach that directly to the table or column that you are looking at. It was actually 1 feature that 1 of our customers have requested and has been very useful for a lot of other customers too. Regarding the integration, yeah, it's been challenging. 1 is mapping the models so that they are all fitting into the, you know, similar ways that, you know, our users are, like, used to seeing.
The other part is really just getting around what is actually being supported versus none between different APIs. So for Tableau, today, we do have in when you give us access, like, service account, like, API token for us, We actually utilize both the REST API and metadata API in order to fetch all the data we need. And then we usually do have to also connect to the Tableau Server Postgres instance that contains the activity data. Just an example. And we are, you know, working along with all of the BI partners on this. So they are very aware of the issue, and they're working on it. So I think this will definitely get better over time. Yeah. I mean, same with Looker. We get a lot of metadata from Looker API, but we still do have to get Look ML repo separately today because that part is not exposed through the API. So I would say getting around different APIs has been actually definitely the challenge part that we've spent a lot of time on.
[00:33:21] Unknown:
Beyond the dashboarding layer and the data warehouse for being able to do the query parsing and the table analysis? What are some of the other integration points that you've either had to build out or are working on building out to be able to make SelectStar viable for presenting this data discovery and collaboration layer for the organization?
[00:33:43] Unknown:
Yeah. A lot of our integration is driven by customers that we're working with. So with that, there are other, you know, integrations that we want to make it happen. Conceptually, a few things that hasn't been as important to our previous customers, but new customers that we're gonna be working with are more important for are something like Struct support for BigQuery, which I think is a really interesting way to retrieve data for BigQuery, but is not necessarily used a lot in Snowflake world, for instance. So that's fun. Dbt is another 1. We are supporting, basically, you know, being able to create Select style documentation just out of your dbt catalog files, but if you already have a data warehouse connection through a dbt and you already have a persistent docs, that dbt docs or the Yaml files or manifesting JSON is not necessarily needed.
What we are starting to look at is, what are the other metadata from dbt that we can make sure that our users can benefit from, such as when's the last time this model have run? What are the tests, DBT tests, that runs against this table? So those are some of the things that we are starting to think through and try to add on to Select Star.
[00:35:03] Unknown:
As far as the workflow of a team that is onboarding onto Select Star, can you just talk through some of the steps that are involved in not necessarily just getting set up, but working on a single data analysis project together and how the interaction with SelectStar spans the different roles and stakeholders in the company?
[00:35:26] Unknown:
So I see generally 2, like, camps when we work with customers. 1 camp is companies with a fairly larger or stronger data team, like, you know, a lot more people involved and a lot more deeper, you know, data models and many different tables, so on and so forth. And we also have companies that has, like, less number of tables, but have a lot more or their focus is really to enable support for everyone else inside the company. So on the first camp, actually, I would say both camps, they do go through a very similar framework, but maybe in a different time horizon. In the beginning, once the data is loaded so usually, once we connect it to the data warehouse and the BI, you know, takes just couple hours or so to load all the data and bring out the lineage and popularity, so on and so forth. Currently, we usually give, like, about 24 hours for this tool to settle.
And then what we recommend to our users is, okay, take a look at, like, what it shows, and, like, tell us if there's anything missing, if does it look alright. The part that we ask them to fill out is, first of all, like, what are the service accounts that you're currently running? Because if you have ETL jobs that's creating tables every hour, it's gonna mess up the popularity. So we give a tool for them to check off which are the service accounts so that we can adjust the popularity weight based on that information. And then based on the overall popularity, a lot of data team actually decides to onboard Select Star to use it as a data discovery or data catalog for the rest of the team.
For larger companies, they still do that because they don't have a lot of time to dedicate a lot of effort in the beginning when they are just trying out the tool. So they just start giving out access to other analysts, and other analysts still find value because they can find their own existing table. They can find, and from who's actually using that table, the all the lineage and the dashboards and so on and so forth. So they start using that, and that's when they start adding their tags or documentation, so on and so forth. So we take it a little bit slowly with the larger companies because there's a lot coordination within the company. It takes some time, but we usually have, like, an onboarding session, initial onboarding session for a lot of our customers, where they onboard, like, you know, 5 to 10 or up to 20 data analysts so that they can start working on it. And then they ask us, like, different questions or can give us feedback through Slack channels, and then we hold, like, office hours. For companies that that say that their data model's a little bit more manageable, when they first see their data in Select Star, many companies actually find that, oh, like, I see all the places that I need to clean up first before I give the access to everyone.
So I'm going to actually take the next month or 2 some time, and I'm going to dedicate my other colleagues so that we can deprecate the old database tables that we don't really need. So we can actually put the right tags that we are going to govern and manage and start defining metrics. So we are in process with couple customers that's doing that right now. Like, they initially wanted the data catalog, and they're like, well, we are gonna now start a data governance project with Select Star. So, like, the project has been shifted a little bit, but that's really I mean, I also do agree with them. Like, that is a right way to also open up the data discovery platform to the rest of the company. So, yeah, those are the 2 kind of different ways that we see. But, eventually, the main flow is that once you first load the data, you are going to get this, bird's eye view plus some insights of how the data is being currently being used inside the organization.
Utilizing that insights, the data team will clean up or add documentation, so on and so forth, to make it more consumable for either the rest of the data team or for the rest of the company. And then that's when they start adding more users in the organization. That's kind of how we've been seeing Select start growing.
[00:39:40] Unknown:
Given the sort of breadth of scope and user base that you're building for, what have been some of the most complex or difficult aspects of building out the product, whether in terms of the technical underlaying or the design elements or the sort of social aspects of building a product that so many people need to interact with?
[00:40:02] Unknown:
Yeah. I mean, we're still early stage startup, but, you know, and I would say the a lot of the users, we have about hundreds of users using it now. And most of them are, I would say, in data team, so data analysts or ops analysts, PMs, and data engineers. So they are fairly, like, familiar with either, like, basic SQL or using, you know, drag and drop and Looker, at least. I would say it's really everything. The API integration has its own challenge. Designing an interface regarding the application side, designing an interface that holds and shows a lot of data, but is still simple enough for new user to start using, is always an ongoing challenge for us.
There's a lot of different things we can show, but how can we distill it down so that we are showing the most important or most useful thing for, different set of users.
[00:41:08] Unknown:
And then in terms of the adoption of Select Star and some of the ways that it's being employed, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:41:18] Unknown:
I would say SelectStar has definitely evolved for more open and wider aspect than I initially thought it was going to go. Initially, I thought this would be a great tool to, like, utilize between just the data warehouse and BI tools, primarily for data analysts. And still, like, I would say 70% of the use cases are there. But allowing or empowering data teams to be able to run their data governance on their own terms has been a really eye opening experience for me, because usually data teams get roped into data governance through security compliance, and they are not sure really exactly what they need to do. But, like, Select Star can kinda open up and give them different insights for it, so that's been really interesting.
The other ways around how people actually want to add different metadata on top of Select Star was another bit of a surprise, but it's very interesting part today. So 1 thing that we've gotten requests for is being able to have discussions and q and a's around datasets in SelectStar because a lot of companies have analytics, you know, where everybody comes and asks questions. But it's very hard to search through. A lot of people ask the same questions or they ask the same kind of question, but for a different data set. So things like that, we now have a what we call discussions attached to every single data asset, where anyone can ask questions which notifies the owner of the data asset, and they can reply on the thread, and that sends another notification to the person who originally asked the question.
And any of those comments or questions, answers, all of that is indexed and searchable on top of the normal metadata. And that feature and also now we have a Slack app integration that also feeds into the workflows of more users, I think has been a very interesting development for us as we are starting to move towards not just servicing the data team, but also helping the data team to serve the rest of the company better. Yeah. And with that, I would say in the future, we also want to start integrating directly into different applications and workflows beyond Lair.
[00:43:48] Unknown:
In terms of your experience of building Select Star, what have been some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:56] Unknown:
I think the really interesting part that's been eye opening and that's been really fun about building Select Star is seeing how this also impacts the workflow of not just, like, the traditional data team that I was thinking about, but for different use cases. So I would say, couple of things. 1 is, like, it's like bird's eye view and the context of the data unlocks a lot of value for our customers, whether they are small or large. So 1 is, like, I would say most of our customers fall into mid to large companies, and they would start for even, like, data migration or, like, just doing remodeling of data, data governance catalog, which is, like, very kind of, like, expected use cases.
But we also have very small companies using SelectStar because they want to make sure that they are building their data models in the right way, and they are tracking the that they are not leaving all the dashboards or error dashboard, you know, for long periods of time. That was, like, a 1 kind of a discovery that we had that we didn't realize before. The other part, related to more of different use cases where data discovery can be helpful that has been a bit of a surprising thing for us is how the usage data can be useful for not just, you know, understanding, you know, or finding data, but also for industries like financial services.
For them to be able to see what the ROI of the data that they are currently buying based on the internal usage and the auditing and so on and so forth. So those are some unexpected usage of Select Star that we started learning about, which we are also very excited to support.
[00:45:49] Unknown:
As you continue to build out Select Star and iterate on the problem space and work with your current set of customers and onboard new ones. What are some of the things that you have planned for the near to medium term of the company? I think I alluded to this before, but we are planning to release an API so our customers can pull and push metadata from and to Select Star, which I'm actually really excited about. This, I think, will also open up a lot of different integration points
[00:46:14] Unknown:
that for us to integrate with more tools so that, you know, BI tools and other tools can also be updated with metadata and the usage information that we can generate. The other part that we're planning for is self-service. So we initially thought, and it still is the case, that it's mostly the mid to large companies that has this data discovery problems. At the same time, after we did the soft launch back in March, we've had a lot of requests from smaller companies. And another part to our surprise, a lot of these companies, I would say, you know, are between a 100 to 300 employee size, or even at, like, 50 person company size.
They were able to all onboard themselves really quickly. And from day 1, they create tags, they add dashboards or descriptions, and then they will start inviting others. So we want to now open up Select Star to more people so that anyone can sign up and try Select Star on their own. So, yeah, we're just, like, in the process of now building the sign up workflow, yeah, after we finished with the API.
[00:47:23] Unknown:
And are there any other aspects of the overall problem space of data discovery and collaboration and the work that you're doing at Select Star that we didn't discuss yet that you'd like to cover before we close out the show? It's been really interesting to
[00:47:37] Unknown:
start seeing how the context of data can impact and help a lot of companies on our side. I'm also really excited to see with the rise of a lot of tools in the data ecosystem and, you know, different use cases that I also haven't seen that could happen in the future. And, yeah, I think the overall interoperability is something that, as an industry, that I think we need to work through more, and we want to be a good citizen on that for contributing back to the community around integrations, metadata, and the way that different tools exchange
[00:48:18] Unknown:
the data. So So on that point, have you been working with the folks on the Open Lineage project to add support for that both as a ingest and export mechanism for what you're building at SelectStar?
[00:48:30] Unknown:
Yeah. We are starting to look into it. It's been a while since I talked to Julian, but I plan to ping him soon. So we have the API part ready. And we are also in discussions with a few PI as well as other metrics companies around some, like, metrics protocol
[00:48:51] Unknown:
type initiative as well. So that's kind of where we are starting from. Very cool. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:11] Unknown:
I gotta say it's interoperability. I would say a lot of the previous generation of data companies have built proprietary data formats and the way to process them, which promotes vendor lock in and long processes for external integration. I mean, at the same time, because it was proprietary within those companies, I'm sure they were able to run their product development much faster. But today, I think there's, a lot more initiatives around being open platform like DVT, which is amazing. And, also, there are a lot of amazing point solutions of all parts of different data stack.
I feel like it the migration always feels like the pain or, like, even for our customers to try out a new tool as a POC, you know, just because, you know, you don't know what that migration or interoperability is like. So, yeah, that I feel like is the gap that I see in the industry that I'm hoping that it will be improved in the next couple of years.
[00:50:11] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Select Star. It's definitely very interesting product and an interesting problem space. So definitely excited to see where it takes you and where you're able to take and the platform. So thank you for all the time and effort you're putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. This was fun. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Overview
Interview with Shinji Kim: Introduction and Background
Data Discovery and Select Star's Mission
Industry Trends Impacting Data Discovery
Metrics Layer and Select Star's Features
Organizational Collaboration and Integration Challenges
Architectural Elements of Select Star
Integration with BI Tools and Data Lineage
Additional Integration Points and Customer Use Cases
Challenges in Building Select Star
Future Plans and API Development
Interoperability and Industry Collaboration