Summary
Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you share your definition of "data discovery" and the technical/social/process components that are required to make it viable?
- What are the differences between "data discovery" and the capabilities of a "data catalog" and how do they overlap?
- discovery of assets outside the bounds of the warehouse
- capturing and codifying tribal knowledge
- creating a useful structure/framework for capturing data context and operationalizing it
- What are the most interesting, innovative, or unexpected ways that you have seen data discovery implemented?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data discovery at SelectStar?
- When might a data discovery effort be more work than is required?
- What do you have planned for the future of SelectStar?
Contact Info
- @shinjikim on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring ensuring it's reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Siflae. Siflait also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflaitoday.
That's s I f f l e t. Your host is Tobias Macy, and today I'm interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets. So, Shinji, can you start by introducing yourself?
[00:02:00] Unknown:
Sure. Thanks for having me back here again, Tobias. Excited to be here. My name is Shinjie Kim. I'm the founder and CEO of Select Star. Select Star is an automated data discovery tool that helps everyone to be able to find and understand their data.
[00:02:16] Unknown:
For folks who didn't listen to your previous interview, I'll add a link in the show notes, but can you briefly give an overview about how you get started working in data?
[00:02:25] Unknown:
So I have a computer science background that worked as a software engineer, data scientist, product manager in the past where I was an actual like, a direct data producer, someone who brings and also creates data models and also as a active data consumer to build models and make just business decisions based on the analysis that, we create. I started a company in 2014 called Concord Systems focused on distributed stream processing, which we sold to Akamai, and now it's a IoT platform called the IoT EdgeConnect. So I've been working primarily in data platform infrastructure technologies for about 6, 7 years prior to SelectStar.
And I started Select Star because I saw a lot of news around data understanding for everyone to be able to find the right data. Now it's not just the large enterprises, but every company now has 100 and thousands of datasets in their Snowflake, BigQuery, Redshift as they are plugging their source applications, not just their production databases, but also different SaaS applications into the data warehouse. And a lot of this, I would say, is a actually a really good movement towards having more people inside the organization to be able to make data driven decisions for them to be able to run 300 analysis of their customer, making better decisions faster.
At the same time, navigating through your data warehouse or, you know, your BI tool has been more challenging than ever today because of the amount of data and the context that everyone now needs to have in order for them to truly use that data.
[00:04:25] Unknown:
Before we get too much into some of the product updates from the last time we talked, I'm interested in digging into this term of data discovery that I brought up at the open and that you mentioned in the description of what you're building. And it's a term that has been relatively new in terms of widespread usage in the data ecosystem. And I'm wondering if you can give your definition of what it means and the technical and social and process aspects that it encapsulates and that are necessary to make a data discovery capability viable?
[00:05:01] Unknown:
Yeah. I think that's a great question. I think this definition and how this is translated in different organization is still forming. But the way that we define data discovery at SelectSAR is all about finding and understanding the data that you have. So basically in order to make that happen, 1st and foremost on the technical side of it, you do need to ensure all your metadata is available in 1 place so that you have a structured way to find the data or a spot where exactly is located. And then on top of that, in order for you to truly understand the data, you would want to have the context of that data asset.
The context of data asset can be things like when was this data asset created, updated, where it came from, who's using this the most, and how are is this data being used in what type of joins or queries. And also even beyond just the database if there are other applications that are leveraging that specific data. And this is, I think, where a lot of the tooling, like SelectStar, really tries to help out to automate the process of bringing these, you know, also like Gartner calls the active metadata to be available and searchable within 1 platform.
Something like the social components of data discovery. I think it really comes from just ensuring everyone to be able to like, know where to go and find information or ask questions. This has been primarily been done through a lot of Slack channels or 1 on 1 messages is what we've observed from our customers before they adopted Select Star. And in a way, I think adoption of any new tool requires a bit of change management or people to start utilizing that tool. And I think as people start utilizing it, the part that data discovery can also really help around the social components would be allowing people to comment on that data or tag other people about that data, having that to be integrated directly with your Slack channels or getting a notification on email or Slack, I think is the other parts of component that can be included in data discovery.
The important part about this, like, social component of the data discovery is that most of the time with this semantic level of information or more of this tribal knowledge that are being discussed within conversations or Slack messages, these are usually a lot harder to as a, like, a metadata perspective. Being able to have that integration with Slack or email, so on and so forth, is important so that it can also be searchable within that 1 platform. Last but not least, you also mentioned about the process side of this. And I think process side is what eventually, like, brings a lot of these ad hoc knowledge sharing to come together.
What we see from a lot of our customers is that they may use Select Star as just like a primary go to place like a Google for data. But as you continue utilizing the platform, you will start adding different descriptions or tagging or ownership. And having these processes, understanding that the datasets are marked as, like, to be deprecated or whether this is a gold, silver, or bronze table who are the main owners of this and having, like, a templates for your data documentation. These are all part of the process perspective that we recommend every customer to have so that there is a standard put into place that people can trust
[00:09:09] Unknown:
as they are utilizing a data discovery tool. Does that make sense? Yeah. It makes perfect sense. So definitely thank you for sharing that perspective on what this term is being used for and how you're adopting it for your own work at SelectStar. And an interesting comparison is maybe a year or 2 ago, the word of the day was data catalog, and that was the kind of clearinghouse for how you figure out what data an organization has and maybe figure out what is the popularity ranking for a given table or something like that. And And I'm wondering if you can talk to some of the differences between what people have in mind when they use the term data catalog versus data discovery and where that overlap
[00:09:57] Unknown:
sits? I think the main difference of data discovery really comes from providing this like active metadata or the automated data context around how the data is currently being used inside the company. Traditionally, data catalog has existed, like, you know, since database has been existed to kind of give you a full schema and map of all the metadata of any sources that you connect to. In a way, the whole purpose of data catalog is to create an inventory of all your data, which I would say still a lot of enterprise data catalog tools are focused on. Whereas companies like SelectSAR and more of a data, quote, unquote, discovery platform has an emphasis around trying to direct people to find the right dataset and giving them the right way to use that dataset.
So it's really more focused on consumption side of data, how to use that data better. And if you're looking for certain types of data, which 1 is the right data to use? That's kind of how we see the market as a main differentiation. Catalog overall or just any metadata catalog overall, I think is almost like now a baseline feature that many other data tools also have, including observability or quality types of tool. The aspect of discovery really comes from combining all the usage data and the insights of the multiple apps together to provide a better way of using the data is is how we define it. I think it's an interesting evolution because
[00:11:49] Unknown:
the overarching agreement has been that metadata is the lifeblood of any data system. You need to be able to collect and understand the different applications of that metadata to be able to build something that is truly flexible and adaptable for the evolving data needs and data tools and simplifying some of that integration pain. And I think that data catalogs were something that was hit on early on because it's something that is understandable and relatively well scoped for being able to capture that metadata and make it useful. And I think that now that we have that as a basis point that everybody can understand and that a lot of people are starting to adopt, it gives the opportunity for folks like yourself and other people who are working in the kind of metadata arena to branch out from there and figure out what are the next set of capabilities and features that we can build on top of now that we have this unified view of metadata, now that we have gotten everybody on board to saying, okay. We are actually going to build a centralized view of metadata and make it useful. What can we do with that now?
[00:13:00] Unknown:
When we talk to companies that have tried to adopt more legacy data catalog players, The way they started from is as kind of some kind of data governance project. And first thing that they were doing is trying to just get all metadata in 1 place. That itself may be like a year long project because of each connector, it's different metadata format. Everything may take a while to load. But then, like, once the data is loaded, like, what are you actually doing with that metadata? If there isn't much context around that metadata, will people actually use the catalog to find what they're looking for?
And it goes to that question again, as I'm looking for data around x, y, z. And if I don't know what the table is called or column is called, can I still find that dataset? And with the traditional way of only cataloging the physical metadata, and that is really, really hard to tell. Whereas with by looking into more of how the data is being used and how the each dataset is connected to another, You can get additional context, which really helps you to actually use that data beyond just having like a 1, like, place to search for.
[00:14:27] Unknown:
On that point of actually using some of this additional context and social information to figure out what is the data asset that I'm actually looking for. To your point of, if you don't know the table name, then you don't really even know what you're looking for necessarily. And so 1 of the early solutions was to just say, we'll just rank by popularity, and so whatever the most widely used table is is probably the 1 that you're looking for, and then you can just kind of branch out from there and maybe use the lineage view to understand what are the tables feeding this, what is this feeding into. I'm curious if you can talk to some of the other aspects of context and how that helps with some of the detective work of saying, I'm trying to solve this problem. I don't even know what data is out there. How do I figure out the piece of information that I'm looking for?
[00:15:14] Unknown:
So like you mentioned, I think it just looking at the popularity based on, like, looking at how other people are utilizing data, which are the data that's being referenced the most, which are the datasets that are being like selected the most. And once you also go into the level of who is querying this data or looking at the side of what does this analyst or this team use the most, I would say like, without really having any idea about their datasets, I think this is actually 1 of the places that a lot of our users start from, kind of looking up they're new to the team, being able to, like, look up what their team uses the most or what their managers or peers use the most is like definitely 1 place to start.
Another way to start is by observing what may be happening inside the data discovery platform. So earlier we mentioned about, like, the social aspect of this platform. People may be talking about or discussing a specific theme or the word or the table that you didn't know what it was called. But because there is a discussion going on, you are now exposed to more of the context about the dataset. And you can think of the other ways of how that keyword may match up with other datasets as well. Another part that we also look into regarding lineage is so lineage primarily shows you the whole data model of, like, where the data the each column was generated, where it's flowing to, how this, like, becomes either in the, like, a reporting table or more of, like, a materialized views.
Other ways that we can also look at lineage or different angles to look at lineage is what are the tables that are actually driven from the same parents or same sources of the data? So you can start with, like, a certain KPI. You can look at the sources and then find kinda it's almost like a sibling, so if you're looking at it as a graph. So these are, like, other ways to discover other datasets or dashboards that are created on top of that data. And last but not least, other parts of discovering other datasets, I would say, also comes from noticing the joints that are happening. So you may just start from 1 table cannot just start from anywhere, like, I know. So it can be any popular tables. Right?
A lot of those tables will have other dimensions tables that you may be you may see that it's being joined on. And this is, like, a great way to discover how other datasets that you may not be aware of in the past, but is actually being joined with the tables. But most importantly, I think search is the, like, the most important part here if you have a certain dataset you are looking for. And so, like, there are many ways to look into the search, not just on the level of indexing on the name of the table, but also looking into the table comments, column comments, any docs that it might be related to, tags that are attached to the tables, and any of, like, the actual, like, people that are up that you think might know about this table. So being able to search through in any of those aspect of the dataset, I think, is also important for discovery purposes.
[00:18:51] Unknown:
The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using it. SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your DBT, Snowflake, Tableau, Looker, or whatever you are using and select star will set everything up in just a few hours. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
1 of the other elements that you mentioned is that in the data creation and data production environment, so a lot of the applications that we build, the interactions with end users that generate some of these data points, There is a lot of implicit context that exists in terms of the application logic that feeds into the specific records that are generated, the sequencing of particular engagement events with a customer. And I'm wondering if you can talk to some of the ways that some of that context can be captured and propagated into this discovery system to be able to feed a richer understanding of what the data actually semantically means once you do come across the table and are trying to figure out how am I going to use this information, what are the valid aggregations or transformations that I can perform on it while maintaining the original meaning of the information as it was generated?
[00:20:27] Unknown:
Great question. And it is something that we are continuing to work on at Select Star to provide more of these places to apply the context to data analysts and data engineers day to day workflow. Few things I can point out here are so regarding queries and joins, by giving you a sense of which are the most most used types of select queries and also joins, meaning like a join condition, which tables are being joined, which are the join conditions actually, or join keys that are being used. You can actually kind of like a map out what are datasets that you can utilize together. And by also and this may require a little more than just the metadata, but either by bringing out which may be the foreign keys or primary keys on the table or for us to detect that for you.
We can easily create almost like a entity relationship diagram where you can see, like, a full data model of the database based on that specific table. And this is a new feature that we released that we're starting to see more analysts utilizing because most of the time when you have a data lake, data warehouse environment, the relationships around the primary key foreign key gets lost. And many times people are just guessing whether they should be able to like, you know, which are the right join keys for that particular table. So by highlighting which joins have already been done in the past and being able for you to be able to look up that query, this is a, like, a concrete example and context you can get right away. Another part that we are introspecting regarding data lineage is how that data has been transferred between the lineage. So within the column lineage perspective, we can tell you whether the data has been propagated as is, whether it has been transformed, or whether it's been aggregated to the next column.
And utilizing this context, customers can also understand when it's aggregated, how that's being aggregated. If it's a column that is just using the same data as is, We also utilize that relationship in order to propagate any descriptions that you might have from upstream, propagate any tags that it might have. And also, we are starting to work on putting in more of the workflow so that you can also propagate ownership or notification chain with that. So I think these are some of the things that we are now starting to add the scratch on the surface around as we are introspecting further into how the data has been composed and how the data is currently being used.
By putting them into these structures like lineage or our popularity model, we can, you know, programmatically start giving you more context.
[00:23:42] Unknown:
Your mention of the foreign key relationships as they exist in the source databases that are often lost when you're just pulling the data directly from that database into your warehouse or into your lake. That's an interesting observation. And I'm wondering if you have seen any of these systems such as Airbyte and Fivetran, etcetera, able to capture some of that context as well in the process of doing the extract and load where you say, this is the table I'm loading from. This column has a foreign key relationship with this other table, and then maybe also things like understanding and introspecting the fact that this table also has a compound index on these 2 or 3 columns so that that way you can understand, okay, these 3 columns have some sort of implicit relationship with each other because they're often used for being able to fetch a specific record and being able to feed more of that information into the downstream analysis so that analysts don't have to go digging it back into the source database or digging through the source code that generated those records to understand what those application level operations and requirements are and how you can reflect that into the analysis that you're performing.
[00:24:58] Unknown:
This has been like 1 of our asks to my friend in the past because they already have that ERD of many of their source connectors. But today, it's not directly replicated to the destination today. But, you know, there may be something that they are starting to work on to try to expose more of that source metadata to the destination so that there is also a clear lineage and metadata transfer beyond just the data transformation and load, itself. But I think this is kind of that like, they are in a great position
[00:25:38] Unknown:
to do so to bridge that gap. And I think that that that layer is the right place to do it. Yeah. I could definitely see that being very useful, particularly as you're maybe starting to build out your DBT models to go from, here are my raw data assets, to the intermediate tables where you're trying to do some domain object modeling of the application objects and trying to recombine them from these normalized tables into something that is a more unified view of that concept and just being able to use those relationships from the source database to even automate some of the dbt SQL that you might need to write to recombine those tables?
[00:26:19] Unknown:
But it is something that we have requested because, yeah, I think there are both on source and also destination. Also beyond that, like, the data warehouse side when companies are, now moving data back to the applications. Having that, like, lineage back to on the application side is also is is a new area that we are starting to emerge to add to the discovery perspective.
[00:26:44] Unknown:
Yeah. That's an interesting point too of being able to capture that source metadata. This is how these tables existed at the time that we pulled them out of the source system. We've done these transformations and enrichment from other applications. Now we wanna feed that back into the source application, being able to split that back out into the normalized models based on the information you had on the way in.
[00:27:06] Unknown:
Yeah. I'm closing the full circle now.
[00:27:11] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.i 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data podcast listener, you get credits worth $5, 000 when you become a customer. And another interesting element of data discovery and data catalogs is that a lot of the conversations around them are oriented around the data warehouse or the data lake or these tabular representations of information.
And I'm wondering how you have seen this aspect of data discovery branch out into some unstructured data that maybe is related in some way to the tabular assets that you're working with, but is not something that you can shove into the data warehouse. So maybe you have some customer interaction events that relate to an image or an ebook that you're trying to use as inbound marketing, and you wanna be able to process that PDF in some way, maybe feed it into a machine learning model to do some semantic understanding of it and just being able to link the tabular data of these are the customer events that we're working with to the specific asset that they are related to as far as the original behavior that you're trying to do the analysis on?
[00:29:18] Unknown:
I think that's an area that we haven't explored so much yet, where we are like, the extent that we are starting to get exposed is around the metadata of those unstructured data regarding the access pattern or the events that has happened, you know, that comes in, you know, more of evaluative format or different format, or it could be happening through more like an s 3 bucket. So it's more of the file access. But, eventually, it also you know, in order to have the our data lineage model and also metadata model to be standardized within Select Star, we basically try to compose a model that can be fit into a relational model in general. But this is, like, something that we are starting to notice that, you know, when you have all these, like, JSON events coming through and you want to track through different events, like, how that would eventually convert into different columns eventually. That's that's basically kind of the extent that we're starting to think, wherein with right now. I think the other side of where the data is being used a lot, but we don't talk as much about the metadata and we are just starting to is the BI side.
Because for the rest of the company, the data consumption happens through BI tools like Tableau, Power BI, Looker. There are usually for a lot of these complex BI tools, they have their own data models underneath. So how is that transferred as its own data model within BI? And then from there, is exposed as part of a chart or dashboard or some sheet or workbook. So that side of the model, we've been building a lot to kind of, like, expose that model even though there wasn't anything, like, defined as a relational model in the past. So that side of the part, you know, we try to make it so that it's still, like, unified with our metadata model. So, you know, like, every dashboard will have some kind of a chart or some kind of subcomponent.
And within those subcomponent and also in the dashboard, like, we will display in several queries, like, how many people are viewing this dashboard and what interactions are there. What are, like, the filters or, you know, food buys that people are running, which gives you another set of eyes around how the data is being consumed in your ecosystem.
[00:31:59] Unknown:
Digging now into the work that you've been doing at SelectStar, I believe it's been at least a year since the last time we spoke. And I'm wondering if you can share some of the ways that your platform and product has evolved and some of the ways that these broader conversations around data catalogs versus data discovery versus metadata management has influenced the overall product focus and the ways that you think about prioritizing effort and maybe even some of the ways that the customer journey goes from, I have this need of understanding what is the catalog of all of my assets through to where we are now of I need to do more broad based discovery and context management around these assets.
[00:32:41] Unknown:
So, you know, we started with a data discovery platform designed for any data team members to be able to easily find, understand, and utilize their data. Kind of them using FlexR as if it's their Google for data. And the pillars of the technologies that we built for that included automated metadata catalog, column level lineage, and then this, like, usage model, like a popularity model. And starting already this year, we've been starting to leverage all 3 parts more of in combination and started building new features. And this also kind of maps with where our customers are heading. So a lot of companies that have adopted Select Star initially wanted to gather the insights around metadata.
And as their data team start they are now starting to bring on people outside of the data team to start utilizing Select Star and have Select Star as the go to place if anyone wants to ask questions or look up information about data. So 1 part that we have added early this year is the whole notion around docs and metrics. So this is to provide beyond just a physical data model for our users to be able to put together what their business process looks like and how the data models are mapped. And on top of that, they can define the KPIs as metrics and also refer to this metric can be calculated with this people query, or this is represented by a measure, this in looker or tableau, or by looking into this column or table. So that's kinda like the 1 major part that we've been upgrading so that data teams can basically share the context beyond, like, what we are giving them automatically. They can really start adding more of a semantic level and business level understanding of data, that they can share with everyone else.
And the 1 of the important parts around that is, 1, is providing them with a way to be able to create this augmentation. But we also wanted to make sure it has a good connection back to their data so by referencing a table or being able to mention tables, columns, dashboards, users, will create now, like, a backlink connection. So that anything that you have now defined as a metric, if you go to its table, then you can see a metrics label and, like, a the column is already marked because it's mentioned as a KPI. You can see that for a table, this table was mentioned within a data concept docs within Select Star. And having that link between the high level documentation and the data model has been a major part that's starting to drive SelectStar to go kind of like having the usage beyond.
So that's 1 part. The second part is that as we are starting to notice that, you know, beyond the data team, different parts of the organization is starting to utilize Select Star. We've added more enterprise level or enterprise grade access control capabilities. So most of the time, like, you know, a lot of the emerging enterprises that we've been working with, companies like Opendoor or Handshake, they have, like, basically, like, a open access to their data team for their data warehouses. At the same time, as more companies are opening up their data warehouse, 1 is that they want to make sure that people are not confused by all the data they have. And, also, there are always a set of data that you will just, like, you know, legally not supposed to be expose to or want to gate the access of that information.
So we are releasing a more of fine grain access control within Select Star so that you can define who can see what. You can define this by team level. You can define this by certain attribute like tags, and it is much easier for you to basically define how the overall experience of using data discovery will look like depending on per user, depending on which team they belong to, and what the dataset really, you know, is tagged with or entails. And last but not least, another new part that we're releasing pretty soon is all around exposing more of these context of data beyond the select star. So we've noticed a lot of customers also utilizing our API within their workflow, but also the parts that we provide as the automated description or lineage or or even, like, discussion items, we are starting to we'll be starting to expose that through, like, a crawl plugin so that you don't have to always, you know, be in selectstar.com.
You can just use it while you're browsing through your the ITIL or you're in a, SQL ID or wherever you want. So those are kind of, like, the major changes or major updates that has happened and is also coming up with us.
[00:38:25] Unknown:
Yeah. In particular, what you were just saying about being able to use the metadata that you have in Select Star and surface that in the BI environment, but also in the SQL IDE is definitely very interesting. And, also, the integration that you're doing with DBT so that you can have that information feeding in both directions where you're working on your DBT model, so you're able to see in your IDE, this is some of the information that I have about all the other tables that I need to, you know, understand. Or, like, as I'm starting to say this table name, I can understand, okay, what are some of the columns and some of the associated metadata about that?
And then I build my dbt model, and that feeds back into what you have in select star so that people can understand downstream. Okay. This is actually the set of dbt models that generated this table that I need to understand. And it's definitely great seeing more kind of cross pollination and bidirectional integrations and interactions with more of the tools that people are using to be able to build up these analyses and build up these assets so that context isn't something that is the responsibility of 1 system. It's something that everybody can collaborate on and feed in both directions as people are building and generating that additional context.
[00:39:39] Unknown:
Yeah. For sure. I mean, API is, like, the really interesting part that, like, you know, we're starting to see we get surprised by the use case with how much you are doing with the API. You know, that gives us, like, really cool ideas too. So yeah. I agree.
[00:39:54] Unknown:
As you have been exploring the space further and iterating on your platform, what are some of the most interesting or innovative or unexpected ways that you have seen folks building or using these data discovery capabilities, and in particular, some of the ways that context is able to be captured and managed and propagated throughout?
[00:40:16] Unknown:
I'll elaborate a little bit more on that API usage. We have customers that are starting to use our Lineage API in their CI pipeline just in order to not have any more downtime in their data. If you think about how a lot of companies utilize Lineage, today, they are primarily using it to introspect and find the root cause of why this dashboard broke or why this data pipeline has an issue. What this company did this company is called Gizometry. It's a public company that runs a marketplace for manufacturers and suppliers.
And we actually just released a case study about this because they were actually looking for a data lineage partner for more than a year because they wanted to put this into their CI pipeline. This was a pretty critical issue for them where their data engineers end up spending hours and hours, but facing issues that, you know, that their production engineers didn't really know that they were creating. So by integrating the lineage API, like, for any metadata changes, like column deletion or name changes, like, it will ping our API and we will return how many, like, downstream objects they may get affected. And if that's more than 0, then it will basically send, like, a, like, automated comment on their git to say, hey. There are issues that's gonna happen.
Check out this page on Select Star, and it will just, like, utilizing our API. It will auto generate that lineage link that they have to go to. And it's pretty remarkable to see how they just don't have these pages anymore. And by saving a lot of time on this, like, their data team gets to really focus on more proactive forward looking projects than people spending time on just triaging or fixing production issues on their data pipeline. We've also seen customers starting to generate their legal security reports around PII tags data.
This also leverages lineage a lot because our part of our lineage will also give you kind of the usage information of the downstream effect or whoever that have touched that data recently. So building more of this reporting capability is another part that we felt like we didn't necessarily intended in the beginning, but we are seeing a really interesting usage of it.
[00:43:04] Unknown:
Yeah. It's definitely great being able to automate any sort of compliance documentation so that you don't have to do all the tedious work of building it yourself. I'm also interested in understanding a bit more about the use case you were referring to with feeding the lineage information back to the production engineers. So just to make sure I'm getting this right, it sounds like using terminology in projects that I'm familiar with, say, you have a Jenga web application. It has the database models in the ORM. A developer says, I'm going to add a column or rename a column using a migration in the ORM. They are using your API in the CICD to understand, okay. This ORM model maps to this database table, which is getting consumed into this downstream report eventually.
So if you rename this column, then this is actually going to break these tables in this downstream report, and so, you know, make sure that this is communicated to all the people who need to care about it. Is that correct?
[00:44:01] Unknown:
Yeah. I believe it blocks the PR, the way they implemented it. Basically, it will look at the diffs of the code that that is getting merged. So any, like, a metadata will be, like, taken into Select XR, run through the you know, if there is any response from the lineage API. And if there is, then it comes back with, you know, here's a link that you need to go check the, you know, downstream effects of because there are more than 0 objects that are getting affected because of this.
[00:44:31] Unknown:
Definitely very cool and something that I would like to see more investment in across the board of being able to feed that information in both directions where as developers are modifying and manipulating the source systems that data is being consumed from for downstream reports, they are brought into the conversation about what impact is this change going to have rather than having that be the responsibility and burden of the data engineers who have to be in constant firefighting mode once that change propagates and they have no more control over it.
[00:45:02] Unknown:
Exactly. Yeah. I mean, for production or prop engineers, they don't know what the impact is. The attribute factors changing that small column or deprecating a table that nobody seems to be using. Right? And they may not all have or have known of us like Star, but then because the API is integrated, they can easily check out that page because, like, their CI will give them that information.
[00:45:29] Unknown:
Very cool. So in your own experience of going on this journey of launching your product and then going through the past year or so since we last talked and evolving the platform and expanding into this data discovery capability and conversation that's happening across the ecosystem.
[00:45:53] Unknown:
Interesting things I could say about all 3 different lessons. I would say the interesting part and also starting to like see as a challenging part in the industry that we are in right now is that more people are starting to wake up to the fact that they do need a better data discovery because they have just migrated to their data warehouses or cloud data warehouses and realizing that it's not super easy to use when you have hundreds and thousands of tables and so many different databases that you have to sift through. So the kind of, like, awareness around the importance of data discovery is growing, and I think that's really awesome to see. At the same time, it's also hard for people to start thinking about, like, how are we actually going to utilize data discovery?
How can we communicate this to our management in order to adopt the tool or invest in a tool or capabilities? And I think that this is something that, like, we as an industry is starting to really develop that I think it can be confusing for, like, a very initial, like, customers that hasn't thought about this as a capability in the past. Because I think this is definitely 1 of the newer areas that has emerged as more of a post like, once you have this, like, modern data set running, this is starting to become a lot more clearer issue. But, like, how to, like, make that as a standard stack, I think it it does take some time for everyone to be on the same page.
[00:47:29] Unknown:
And so as you continue to build out and iterate on the platform and product vision and product direction for Select Star and continue talking to people in the ecosystem who are getting up to speed on the capabilities and use cases for metadata and discovery capabilities. What are some of the things you have planned for the near to medium term or problem areas that you're excited to dig into?
[00:47:53] Unknown:
I mentioned a couple things of what's coming up in our road map. I'm also really excited to basically continuing to leverage and build up on this, like, context metadata or active metadata structure that we have. So a couple things that we've recently done are, like, automating your documentation. So you document in 1 place, and based on lineage or if the dataset is duplicated or any new places that we see it fit, we will start propagating the documentation or tags or ownership information throughout the platform. This combined with allowing you to also link your business documentation is something that we are starting to develop more so that you can actually transmit the knowledge and share the knowledge around data beyond, first, having your data team to make it, but then for it to be also understandable by people that are outside of the data team. And, you know, the whole notion around allowing you to transfer this or see this within your BI tool or your IDE, SQL ID is all part of that so that this context metadata is more ubiquitous and versatile.
And there is always, like, also the higher level context that you can add. These are some of the parts that I'm very excited about to enable more people to be able to understand and use data better.
[00:49:30] Unknown:
Are there any other aspects of the work that you're doing at Select Star or this overall conversation of data discovery that we didn't discuss yet that you'd like to cover before we close out the show? I'm very excited about, like, everything that's happening in the ecosystem.
[00:49:44] Unknown:
When I first started the company, I was told, from multiple, I guess, investors and other people that data discovery seems like a vitamin. It's not a pain killer. Like, seems like just a nice to have tool. Why not just use a, you know, like, Notion Noc or whatnot? And for me, the last couple of years of us being in the market, we've seen so many amazing use cases of how this really changes the data team's work culture, how much take time that they've saved, and how they can really now focus more on forward looking projects and also having them to enable the rest of the company.
So, yeah, I'm really excited for what's coming and also as we are noticing more companies moving towards a self-service analytics and enabling more of their more of their company employees to leverage data better.
[00:50:42] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:58] Unknown:
I think that's a great question. Data management overall, it's not just tooling perspective. There's also a big part around change management, the process perspective, social perspective in order for, yeah, to be, you know, managed well. And 1 part, if there's anything we can do regarding tooling and technology, we can do better, and this is definitely 1 of the areas we're continuing to work on, is bridging the gap between the business processes and then the data models that supports that business processes. Today, I think there are ways to try to document this in, rich text documentation, But this still is, like, to really fully map it and have it automated and have it understandable by everyone.
I think that there are just definitely more ways to go, and I'm curious to find out what other solutions are maybe out there in about a year or 2 and also kind of the road map of how we wanna tackle this problem to really bring the understanding of data in 1 place.
[00:52:10] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Select Star and your perspectives on the different challenges in use cases for data discovery and how we can use that information to power upstream and downstream work and use cases. So appreciate all the time and energy that you and your team are putting into contributing to this ecosystem, and I hope you enjoy the rest of your day. Thanks so much, Elias.
[00:52:43] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com Subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Shinji Kim: Introduction and Background
Defining Data Discovery
Data Catalog vs Data Discovery
Context and Social Information in Data Discovery
Metadata and Data Lineage
Evolution of Select Star
Innovative Uses of Data Discovery
Challenges and Future Directions
Closing Remarks