Summary
A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Your host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you have built at Tree Schema?
- What was your motivation for creating it?
- At what stage of maturity should a team or organization consider a data catalog to be a necessary component in their data platform?
- There are a large and growing number of projects and products designed to provide a data catalog, with each of them addressing the problem in a slightly different way. What are the necessary elements for a data catalog?
- How does Tree Schema compare to the available options? (e.g. Amundsen, Company Wiki, Metacat, Metamapper, etc.)
- How is the Tree Schema system implemented?
- How has the design or direction of Tree Schema evolved since you first began working on it?
- How did you approach the schema definitions for defining entities?
- What was your guiding heuristic for determining how to design the interface and data models? – I wrote down notes that combine this with the question above
- How do you handle integrating with data sources?
- In addition to storing schema information you allow users to store information about the transformations being performed. How is that represented?
- How can users populate information about their transformations in an automated fashion?
- How do you approach evolution and versioning of schema information?
- What are the scaling limitations of tree schema, whether in terms of the technical or cognitive complexity that it can handle?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Tree Schema being used?
- What have you found to be the most interesting, unexpected, or challenging lessons learned in the process of building and promoting Tree Schema?
- When is Tree Schema the wrong choice?
- What do you have planned for the future of the product?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Tree Schema
- Tree Schema – Data Lineage as Code
- Capital One
- Walmart Labs
- Data Catalog
- Data Discovery
- Amundsen
- Metacat
- Marquez
- Metamapper
- Infoworks
- Collibra
- Faust
- Django
- PostgreSQL
- Redis
- Celery
- Amazon ECS (Elastic Container Service)
- Django Storages
- Dagster
- Airflow
- DataHub
- Avro
- Singer
- Apache Atlas
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's immu t a. Your host is Tobias Macy. And today, I'm interviewing Grant Seward about Tree Schema, a human friendly data catalog. So Grant, can you start by introducing yourself?
[00:01:54] Unknown:
Hello. My name is Grant. I come from a relatively general data background. I worked at Capital 1 in the early part of my career developing data products. I got into data science space at startups as well as at Walmart Labs and eventually ended up leading a data engineering team, most recently to build out a digital bank. I don't proclaim to have an overly technical background, and I didn't study engineering in school or really even start to code until I was 23 or 24. You know, I just like building data products, and I like to create value from data and to show it to efficacy. I really just enjoy using data in general.
[00:02:26] Unknown:
Do you remember how you first got involved in data management?
[00:02:29] Unknown:
So it was really out of frustration, to be honest. I think that if you had even asked me a few years ago if I would ever see myself in metadata management, I would have said absolutely not. I started off the 1st couple of years again at Capital 1. And just to provide some context, Capital 1 does data very well, and they have well managed data pipelines. They have clean and consistent data data stewards who are really deep technically and have good business knowledge, and they generally just have a really solid data culture. And so that data culture and expectation of how to treat data, curate it, and extract it extract value from it was really ingrained in me from an early point in my career. It wasn't until after I left Capital 1 I started to understand how bad the proliferation of poor data management was. I'm talking about the most fundamental aspects of quality data management, standardization, documentation, ownership, discovery, access management, and the like.
Then this applied to pretty much every company I went to after Capital 1. To give an example, 1 of the companies I worked at was startup that had maybe 50 developers and a dozen or so data scientists. There was absolutely no documentation for the data. The data scientists would send each other Jupyter Notebooks or just point to a link in GitHub and essentially say, check out how this field was used in the past. Now fast forward a couple of years, I'm working at another start up, and I'm responsible for building out the entire data ecosystem. In the back of my mind, I'm thinking to myself, in order for us to have a sustainable advantage with our data, we need to properly manage our metadata. Good documentation, easily track data lineage, have clear and direct visibility into who the owners are, etcetera.
All of this is part of, again, building a strong data culture. So 1 day, I'm tasked a new field to a database, a pretty mundane task all in all. As many of the listeners may know, adding a new field to the database often means updating the source that captures the data or creates the data. So I started to track that down. The full lineage for that from start to finish ended up looking something like receive a file from a vendor, convert the file from fixed width to parquet. This particular company operated in Latin America, so we did a translation from Spanish to English, and then finally, we saved that file into the database. It took me over 2 hours to fully backtrack and identify all the touch points that are needed to be changed here. And the worst part was that I had created that pipeline the year prior. So at this point, I knew that there had to be a better way for us to manage our metadata and to track our data lineage.
So I started to look for products on the market, and I couldn't find anything that met what I was looking for. All the SAS data catalogs were bundled with a bunch of other features that either I didn't want or I didn't need, and those other features really drove the cost up much higher than what the start up could afford. On the flip side, even though the current open source products that we have today had not been released, I didn't want to maintain a data catalog or I didn't want it to be something that my team had to spend time to set up or maintain even with some sort of dockerized or container service. I just wanted metadata management, and I wanted to use it as a service. After some time, I gave up on the search. And given how clear this problem was to me and how common it seemed to be, I decided to start Tree Schema.
[00:05:37] Unknown:
And so you've given a pretty good background as to what led you down the road of building TreeScima, but I'm wondering if you can give a bit more detail on what it is that you've got working there and some more of the motivation behind actually turning it into a business versus just having it be an internal product that you used with your team at that startup?
[00:05:57] Unknown:
Yeah. Sure. So I think the motivation around turning it into a business instead of just an internal product is that this problem really seemed to be persistent across many of of the companies that my colleagues worked at. And so a lot of the folks that I have worked with in the past had been at start ups, and it's very common again for people at start ups to go to other start ups. And I ended up talking to maybe a dozen or so of my colleagues who worked at unique companies and everywhere that they work, this was a problem. In 1 shape or form or another, it was just difficult for companies early on to have good solid metadata management practices. And so Tree Schema is a data catalog that makes the essential metadata management capabilities available to everyone. This includes catalog basics such as data discovery, rich text documentation, assigning owners to your datasets, being able to have conversations about your data with your data, effectively everything you would consider table stakes for a data catalog.
We have really positioned Tree Schema to be the premier data catalog for start ups, small and medium sized businesses, and a lot of that comes through in the pricing. We have a freemium model with a free tier up to 5 users and then 2 other tiers that are $99 a month and $300 a month, and they support up to 50 300 users, respectively. So with that top tier, you can get to be as cheap as $1 a month per user, which I think is really beneficial to help the small companies get past the hurdle of paying for an additional product. Tree schema is really heavily focused on providing a service that is simple to the end user. And to this extent, enables users to sign up for an account and fully populate their data catalog in under 5 minutes.
As far as I know, there's no other product on the market that comes close to being able to set up a data catalog this quickly or easily. In addition, we've also launched a set of APIs recently that allow teams to interact with Tree Schema programmatically. So we're not just tailoring this to the business users or the data scientists. We really think that our data engineer partners are gonna be the 1st class citizens for getting their team into the data catalog.
[00:08:02] Unknown:
In terms of the utility of a data catalog, it's definitely necessary once you get to a certain size where you have multiple different people working with the data, where it's not the same person who's generating the datasets, who's also consuming them. But I'm wondering if there's a particular stage of maturity at which point the data catalog isn't a critical component of the infrastructure where you can get by with just having a conversation across the table or on Slack real quick for being able to answer quick questions about what is this data, What is it being used for? Where does it come from? And sort of what the tipping point is where it does become an absolute necessity to have that as part of your overall data
[00:08:43] Unknown:
platform? I think this is a very good question and something that far too many organizations failed to ask themselves and certainly failed to ask early enough. My personal opinion, and this may be just a little bit biased, is that teams should start to consider a data catalog from the moment that they have data. The main reason that I say this is that I fundamentally believe that your data catalog should support your data culture. And it is really the data culture that is going to allow you to continue to drive long term value from your business by using data. And so as a data catalog is just an enabler for data culture, the sooner you can get it into your ecosystem, the sooner you can start to integrate your culture around the data catalog.
If properly capturing and documenting your metadata is something that you do from day 1, it will be deeply embedded into your data culture. Your teams will include that in their deployment checklist. Analysts will look for self-service first approaches. Sharing knowledge about your data will be a default activity that your team does, and peers will reinforce this behavior as the team grows, helping to quickly implant the shared need in new teammates. Now if you take this from the other perspective, the opportunity cost of not using a data catalog from day 1, what ends up happening almost without exception is that teams inevitably face 3 challenges. 1st is that during the time that teams do not have a data catalog, their productivity suffers from an immeasurable number of interruptions. When someone has a question, they almost always go to a trusted source and ask for a knowledge transfer. These interruptions add up over the course of a day or weeks, and there are numerous studies showing the negative effects of interruptions on performance and their ability to detriment the quality of decision making.
2nd is that knowledge is lost about data. There are not many people who have the ability to remember every single piece of data lineage and every potential value for every field you have in your different sources or different schemas. Speaking for myself, there have been many times that I needed to go back and research the data lineage for pipelines that I created. And this particular issue is really exemplified when you consider attrition and reorganizations. And 3rd, when an organization grows, it inevitably does approach metadata management. So populating and maintaining a data catalog becomes a secondary activity when it is started to be implemented.
There are immediate hurdles that teams face when trying to get their catalog up to speed with the current state of their data. The biggest 1 being the sheer number of data assets that needs to be documented. This causes a data catalog to be sparsely populated or lacking overall depth and the quality for the data that is documented. And in turn, data users do not leverage the data catalog, which prevents a community from being developed around the metadata, which makes it more difficult to build trust in the data. In the end, overall adoption and usage of data to drive the business fails to flourish.
[00:11:34] Unknown:
In terms of the available options for being able to establish a data catalog, there are a number of different products on the market or strategies involved where some people might just use the internal company Wiki and update it manually, or they might rely on a managed service or a component of a platform that they're already using, whether that's something like Infoworx or Calibra, or they might use an open source platform along the lines of an Amundsen or Medicad or MetaMapper. And then there's also you mentioned data lineage, and there's a whole different set of products that's targeted at that area of things such as Marquez or Datahub. I'm curious if you can give a bit of an overview as to some of the relative trade offs of the different options and your overall on the current state of the market for both metadata management and data lineage and some of the challenges that exist particularly for the scale of company that you're targeting with Tree Schema?
[00:12:33] Unknown:
So I think that Amgen, Data Hub, Medicad, they're all really inspirational products. And you're absolutely right. We're addressing the problem from slightly different perspectives, and I'm looking forward to see how they mature. There's definitely places where we take inspiration from them, and and we would be very honored as well if at some point they're taking some inspiration from us. And consumers should definitely be excited about the growth of of metadata management and data lineage in particular over the next 2 to 3 years. I think it's gonna be just a really explosive space. So there's a couple of things that are unique to TreeSigma. 1st, in a word, is simplicity.
And this is gonna be a topic that I'm sort of harping on over and over because for us, it is the most important factor to your start up's small and medium sized businesses. Tree schema is a 100% turnkey. From the moment you sign up, it only takes a few moments again to point to your database and extract all of your different metadata into your data catalog. Even if your data sits in a private network or is behind a firewall, we allow you to connect through jump servers to access your data safely. So, historically, for a company to use a data catalog, there has been 1 single entry point. It's been the data engineer or some other developer of a technical background.
If you look at the absolute best case scenario for implementing a data catalog internally, you're looking at using some form of containerized application that you can deploy, but even the best developed containers crash for some reason or another if they run long enough. So then you need to start looking at setting up your own storage or having external storage. And then you're thinking about how do you deploy that application as well to different environments to be able to test it and maintain it. I'm essentially giving the SaaS sales pitch here. But the point being that even in the best case scenario, it takes time to properly set up a data catalog. Your users will be pinging you how to use it. Once it is set up, it will eventually fail, and you'll have to spend time to read the source code and understand, you know, why your company's unique usage patterns are are causing some issues. All of this, again, is that it takes more time for what's arguably 1 of the most important positions in the company, the engineer.
So TreeScima's paradigm breaks this pattern completely. Our product is so simple, we often see data engineers as the first users to sign up because they're the ones who are researching. They've been given the task to bring the data catalog into the company. But then we see sort of this hand off to the data users, where you'll have the analyst or the data scientist who are going to drive the integration and the population of the data. And that ownership and that switch from data engineer to data user really frees up more time for the engineer to get back to doing what drives their business forward. The second feature that is really unique to Tree Schema is our API and in particular, the Python client that we have developed as a wrapper to the API.
There are definitely services and and APIs that other catalogs have. The most popular, I think, that comes to mind for me is Apache Atlas. But what is unique about Tree Schema's client is that you can interact with your data catalog in an object oriented approach such as native Python objects. A lot of time has been spent developing this client to make it as simple as possible, and there are really 2 features of the tree schema Python client that are by far the most popular. The first 1 is the ability to manage data lineage as code. So data lineage, we've touched on a couple of times, at its core, describes how data moves from 1 field in a given schema to another field in a different schema. I hope I can do justice to the simplicity of our Python client by describing it here. Effectively, users can create links between fields with only a handful of lines of code. We have several examples of this on our website, even 1 example that directly integrates with a live Faust streaming app. Our suggestion to users is to have data lineage script in your CICD pipeline since it has been developed to be idempotent.
Again, we continue to hold on this basic principle of simplicity. This is really beneficial because not only do you get to manage your data lineage as code, with our Python client, you don't even have to worry about checking or updating the status of your existing data lineage each time you deploy your application. When you deploy your app, you can set the state of your data lineage and tree schema to be whatever you want. Whether that has changed or not since your last deployment, doesn't matter. The second feature that we see is wildly popular is the ability to define sample values as code. And 1 of the most critical aspects to a well curated data catalog is human involvement.
There is a lot that tree schema and other data catalogs can do to infer on their own about the shape and semantics of your data, but we cannot infer the meaning, not yet at least. So for example, you may have a field called status code in your customer table with the values 1, 2, 3, etcetera. Well, what do those mean? This is often 1 of the most common questions that data users have. What does this data actually mean? In tree schema, we call these sample values or field values. The Python client we have allows you to define your sample values and to capture their meaning in the same place that the data is actually created, the code. This, again, is really powerful because just like the data lineage, you can capture and manage specific values and their definitions in the code, but share that knowledge more broadly in a structured way within tree schema.
[00:17:56] Unknown:
There are a couple of points that are worth digging into from that. 1 of them, which I think we can touch on a bit later, is the idea of versioning and schema evolution so that if you have an existing set of fields for a particular database table, for instance, and then in the Python client, you push a schema that is completely orthogonal to what was there, being able to have some sort of check to make sure that it wasn't completely an error or that the schemas are at least compatible from an evolutionary sense. And then the other element is a lot of data catalogs are focused on dealing with static metadata where there's the database schema. It's relatively consistent. It might change a little bit here and there, but it's not going to be going through a constant rate of change.
And then there's the other element of streaming data, which again is probably going to have fairly consistent schemas as long as you have your pipeline structured deliberately. But I'm curious in terms of what you have seen as far as the challenges of being able to reconcile things like database metadata and metadata that is in a data lake, for instance, alongside something like data going through Kafka topic or being processed in the example you gave using FAUST, which is a library for being able to operate on Kafka streams?
[00:19:16] Unknown:
So our data catalog primarily focuses on where data sits and where it physically resides. And within Tree Schema, we call that a data store. And so a data store can be anything from Postgres. It could be s 3. It could be Kafka. It is that physical underlying place where the data is sitting. So within a data store, you have a schema, and this can represent itself in many different ways. It could be a table if you have a SQL based schema. It could be JSON. It could be parquet or avro. It, again, is the semantics and representation of the shape of the data.
Within the schema, you have fields, and then fields have specific values to them. The way that we approach data movement between schemas is going to be what we call transformations. And a transformation is, by itself within Tree Schema, just a container or a shell, if you will. And transformations are holders for transformation links. And that link is how data moves from 1 field in 1 schema to a corresponding field in another schema. This may be done through some sort of SQL process where you have an event based trigger. It could be done through a Faust app where you have a streaming process. It could be a batch job that's doing what we call a lift and shift of moving all of your data from 1 table and dumping it into a data lake or into Redshift.
It is just a semantic representation of how data moves. So tree schema itself does not actually do any of the data movement. We really sit outside of the data ecosystem and extract as much metadata as possible. From the very beginning, the question of, like, how do we integrate with different data stores and how do we have this common approach to tabular data, unstructured data, or potentially even data stores that don't really have any sort of structure, whether it's, you know, what we consider traditional, unstructured, JSON or Parquet. Maybe it's an email server and you just get emails and it's just free text.
The way that we wanted to handle this integration with these different data sources was to really create a unified perspective of what a schema means. And the reason we did this at the schema level is because this is generally how users are interacting with data. People refer to tables or they refer to files or locations where files are. And on occasion, they'll go into the specific fields and they'll talk about that. But for the most part, your schema is gonna be a representation of the entity, and that entity is going to drive your business in some way, shape, or form. So, traditionally, you have tabular tables that are going to have no structure outside of what's defined.
They're gonna be flat. But what we do is we treat them under the hood as a JSON structure. We use the same open source JSON standard that is used within Kafka and Kafka Connect. And we leverage that for all of our schemas internally in order to give ourselves that unified perspective of what it means to have a data schema. And so by doing this, we can add just a little bit of additional metadata and context to our schemas and whether it's a flat structure or it's some nested structure and object that has arrays and other sorts of embedded fields, we get this really clean method for being able to do comparisons across different schemas.
[00:22:53] Unknown:
And I'm wondering if you can dig a bit more into how the tree schema system itself is actually implemented and some of the internal architecture and the ways that the system has evolved since you first began working on it and began onboarding more people with more users?
[00:23:09] Unknown:
Sure. So I'll break this down sort of into application and databases and then on the deployment side as well. On the application side, we are primarily a Python shop. And with that, we use Django to serve the app. So Django works really well for us because it enables speed to market. We can quickly test and ship new features, and it has excellent integration with Postgres and Redis, which are, you know, 2 of my personal favorite databases. I'm a huge Redis fan, so you'll see that pop up quite a few times in this overview. In addition to Redis, using Redis as our Django cache, we also use it in the background with our Celery app integration because there's a lot that we cannot process synchronously.
For example, when you point Tree Schema to your database, we will extract all the metadata that exists in that database. When you do this, you have the option to allow tree schema to capture sample values. And what this means is that for every field in your database, we'll capture somewhere between 10 to 20 unique values. And for those of you thinking about it, no. We're not gonna bog down your database with a full table scan. So even if you have a massively large table, the impact should be relatively small. Nonetheless, it's not uncommon for an organization to have thousands of tables and Redshift as an example.
And executing these queries on a few hundred to a few thousand tables can take some time. So that's where this async process runs on Celery. There's 2 other major components of the app that we use Redis for. First is data discovery. Users can search their entire data catalog from a single simple but powerful search. The Redis full text search is used here. I'll quickly touch on why this decision was made versus, say, Elasticsearch since I believe that product is being used in some of the open source competitors. I love Elasticsearch, and it's also 1 of my favorite databases. But for us, the queries we're submitting for this tech search are rather simple.
Elasticsearch really shines if you have a complex query such as get me all the results for data lineage where catalog is not within 10 words and it occurred within the past 2 weeks. Given that we currently have simple queries and we're already using Redis, we decided to go with the full text search built into Redis to reduce the overall infrastructure complexity. The second place that we use Redis is going to be in the Redis graph, and we use this as our graph database to query for data lineage. 1 of the downsides to this is that querying the data analytically to understand user behavior is rather difficult. So to get around this, we actually persist all of our links within the transformations.
That's the fields that connect from 1 schema to another. We persist all of them in Postgres, and we check every so often that the 2 are in sync. On the front end side, there's a little bit of JavaScript sprinkled in there to really make things pop. But for the most part, Django templates are used to determine the content layout. Last but not least, NGINX is used as a reverse proxy to route traffic, and that's pretty much it on the application side. For deployments, we are serverless wherever possible. So the Django and Django app and NGINX are deployed together via ECS in a single container definition.
It essentially provides similar network routing to what Docker provides. From time to time, I talk to devs who are not as familiar with this ECS feature. In short, you can just use local hosts in a corresponding port, and you can route traffic between containers that are coupled and deployed this way. It's 1 of my favorite ECS features, and I think it's just a fantastic way to deploy services. There's a few other long running services that we have running on ECS. 1 of which is our node network that predicts whether or not a data asset is considered or personally identifiable information, and it automatically tags the corresponding asset.
Our static content for the app is served via CloudFront. 1 of the great things, again, about Django is that it has such an incredible community built around it. And 1 of the packages, I'm forgetting the specific name right now, but it enables you to change static file source host to a CDM. And by doing this, we actually offloaded so much processing from the Django app to CloudFront that it allowed us to reduce the number of ECS containers running by nearly a quarter. And so that's just been, like, a really great feature, I think, that we've been able to roll out because of the community built around Django. All of our internal microservices are deployed as Lambda as have the external facing rest APIs with the exception that they also leverage the API gateway. And, of course, we leverage RDS and ElastiCache for our databases, respectively.
[00:27:36] Unknown:
In terms of the overall approach or the overall goals of the product, I'm curious if there are any initial assumptions that you had going into this or ideas as to what you thought were going to be the needs of your end users that have had to be updated or changed or a new direction decided upon once you created the initial launch and started bringing people onto the platform?
[00:28:00] Unknown:
I think, by far, the most innovative way that I've seen tree schema used by a client has been and that the client went ahead and and developed this utility, effectively a learning plan on top of TreeSigma. And so as they're onboarding new users, in particular data scientists or analysts, this can be extremely burdensome for teams because you need to have so much knowledge about the data before they can get up to speed and and really provide value. And you see companies put together traditional learning plans that consist of a buddy or, you know, some tables that they need to interact with Right? An introductory problem to force this person to learn the relationship and cardinality of the data and so forth. And so this client was leveraging asset tagging in order to curate a step by step learning plan for each of their 3 data teams.
When their new users logged in to TreeScreamer the first time, they could just search for tags by their name, and these tags may be marketing learning plan or operations learning plan or whatnot. The person who had this idea when I was speaking to them on the phone, they walked me through it on a screen share, and I merely made them an offer to come and be a product manager at TreeSigma on the phone. That was really an interesting and intriguing way, I think, to see the product being used and something that I learned about, like, what are consumers really looking for in this space to drive value on top of just capturing their metadata and cataloging their data. There's been some unexpected uses as well. I've seen 1 company use it to save a large number of emails with attachments as a way for their entire team to quickly find historical metadata that a vendor sent to them. A little bit of context, we have this catchall, quote, unquote, data store called other, and we recommend that it be used when you wanna capture metadata from a database that we haven't integrated with. So going back to this company, they have 100 of these mainframe files that have been emailed to them by their mainframe partner. They want to process them into the data catalog and have them saved so that they can access them and search for them when they need to. On top of that, they wanted to create a very basic data lineage.
And what they did for each of these files was just add a single field and then linked the files together using this single field. A little bit different, I think, than what we were expecting, but I think for them, it works, and that's what we're proud about is to see people using this product how they feel best and getting the value that they need.
[00:30:23] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to dataengineeringpodcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask. In terms of the overall design of the product and how you approach the interfaces and the APIs and the structures that were available for being able to record metadata. I'm wondering what your guiding heuristic was for being able to figure out how it would be presented to the end user and the necessary fields and information that would need to be captured?
[00:31:46] Unknown:
Sure. And just, again, recap what the major entities are within Tree Schema. There's data stores, schemas, transformations. Those are really the big 4. And when you look at Tree Schema and you're interacting with each of these, they all follow a very similar layout. And the reason for that goes back to simplicity. We want the layout of tree schema to be simple. We want it to be repeatable, something that people can understand easily, and this was an area that we spent a lot of time even before we began development, was to look at what do consumers need to have in a data catalog.
What are the ways that they need to extract information, the things they need to capture? And, really, we spent a lot of time thinking about that user experience, and that drove how we were able to design the underlying tables and, really, the models within Django that are gonna be supporting those tables. And so there were 2 things that we wanted to achieve. 1 was this repeatable and consistent interaction for the user. And the second 1 was the ability to have a flexible system that could handle any different type of data store, schema, and fields. So I touched on the the common layout a little bit. In order to solve the second point, the flexibility to have different data types and data stores, what we did was we leveraged Postgres' built in JSON data type.
And to give an example of how this plays out, if we look at just 2 of the data stores that Tree Schema supports, Postgres and DynamoDB, they each have attributes that are unique. Postgres has a host and a password, whereas Dynamo has AWS keys in a region. Nearly everything else about that database can be summarized with a common set of attributes, and therefore, we place those common values into columns. Anything that is unique to this particular database is stored within JSON. I think it just sort of played out that we don't really run any queries against those unique attributes. So indexing and performance impacts of directly querying on the JSON fields are not really big. And in addition, we get further benefits because since we have such a similar layout between data stores and schemas and fields, we can take a lot of those same sets of columns and apply them to the different tables. So we're talking about the description. We have tags, comments.
We have a name, a type. All of these are are relatively ubiquitous across the different entities that we have. And so we're using the Django models under the hood as an abstracted class to really define what are the implementations of the tables that we're gonna have. And so all of these major entities inherit from this same class.
[00:34:32] Unknown:
As far as the flexibility of that structure and in particular being able to track rich information about the data lineage and the transformations that are being performed, I'm curious what the options are, particularly for being able to do things like integrate with a workflow manager along the lines of a DAXTER or an Airflow.
[00:34:51] Unknown:
I think that this is an excellent idea. It's not something that we are currently doing today. And it it's an area, I think, that we've talked a little bit about, but really hasn't been prioritized quite yet. I've seen a couple of other products out there. I think this is something that Amundsen does, and they do it quite well. So I think we're certainly gonna look to take some inspiration from them and more so talk to the people who are using that products to see, like, what is it that they're enjoying most about it. You know, really trying to understand what is the user experience and the problem the consumers need to solve and that that solves for them.
[00:35:25] Unknown:
Continuing on with discussing some of the capabilities of things like Amundsen, another element of that is being able to track the popularity of a given set of data where if you're searching for a particular field, you might be able to find 5 different tables that have relevant information. But based on the previous search activity or user contributed information, you can actually say that this particular table is the 1 that is most actively used or is most up to date. I'm curious if you have any capabilities like that built into tree schema.
[00:36:02] Unknown:
Yeah. We do. So we our full search full catalog search is gonna have pretty good coverage of that. We have a ranking mechanism that I think that we're looking to improve. It's not the best at the moment, but it does have some pretty naive ways to be able to rank data assets that are returned when you search, again, based off of others who have used the product as well. We have a couple of other ways that we approach this. In addition, we call it, 1, power users. So as users are interacting with tree schema and they're interacting with the data, you can see who are the power users, who are using this data most frequently.
And this is a great way because you can look at any particular data asset and see who are the power users that are actually leveraging this data. And that's just, like, a really quick way to understand who can you go to if you have some elevated question that you can't get answered directly from the data catalog. But there's also this other set of people who may be hidden users as well. They may be experts, but they don't necessarily use the data. Maybe they don't look at the documentation. Maybe they created the table or the pipeline some time ago, and the relevance of their interactions has just dropped off in the background because they have worked with it so long ago, and it it's not really fresh anymore. And so we give the ability for users to volunteer as experts.
And so, again, within any single data asset, you can volunteer yourself or remove yourself as an expert, which is just a really great way again for people to say, you know, I don't necessarily use this, but I know a lot about it. You can ask me questions if if there's anything you'd like to know. We also promote this information about who uses what assets in the data catalog within the teammate section. So you can go and and you can effectively shadow what your other teammates are doing. You can see what schemas they're looking at, what fields, what data stores. We're gonna be trying to enhance this a little bit in the next few quarters to show not only what are they using, but to give more specifics around how are they using it, in particular with some of the SQL based products that we support.
[00:38:04] Unknown:
Going back to the earlier point of versioning and evolution of schema information, I'm curious what your capabilities are as far as being able to identify if there's a potential conflict in the evolution of a set of schemas, such as you're changing a column in a database from a text field to a Boolean field or something, or maybe from a Boolean to a float. And then also just being able to track those changes so that if somebody's looking at a table that maybe they had worked with in the past and it's gone through some evolution, and now they're looking to see what its current state is and see what was it at the time that they were using it before, what were the intermediate steps, and what is it at now being able to service that information?
[00:38:51] Unknown:
So currently, what we offer today is we raise what's called a governance alert when there's a breaking change that occurred. And you essentially just, you know, touch the nail on the head whenever you remove a field or you're going to be changing the data type of a field that's no longer compatible. We raise that information so that a data steward can take action and and they can do something with that. And we have a whole range of different governance actions that are effectively used within TreeScma to keep your catalog up to date and to make sure that it is remaining fresh. In terms of schema management in particular, this is an area that we're actively looking at, and the way that we're thinking about how to solve the solution really, again, needs to be comprehensive for our customers. So when I think about schema versioning, it's not just the schema that we need to version, but it's also we need to think about what are the transformations that are impacted as well with this. Because if you're changing the schema, you also have the potential to be changing a transformation.
And so as we think about what are the implicit relationships that we have between schemas and fields and transformations in particular, we really need to make sure that we have a comprehensive way to provide versioning across both of those entities, schemas, and transformations. There are questions that we're looking for feedback on from users such as if a schema changes and it impact the transformation, should the transformation even be updated? Should schema versions automatically be updated or should it prompt a data steward to review when the changes occur? There's another feature that we launched recently where tree schema can be scheduled to automatically sync itself with your database on a set cadence, effectively every day or once a week. You can have tree schema make sure that it has the most recent representation of your data. And so how this plays along with schema versioning is something that's gonna have a really big impact on the end solution as well. I would say be on the lookout for something in early q 1 on this. Still in early discussions, but I think something that we're really excited about as well.
[00:40:48] Unknown:
And then another aspect, what you're building at Tree Schema, is you mentioned that you're focused on small to medium sized organizations. And I'm wondering what you see as being the overall limitations of scale, either in terms of the technical capacity or the cognitive complexity that it can handle, and when somebody might want to go to something else like a data hub that is more architecturally complex, but possibly more flexible?
[00:41:15] Unknown:
I don't know what the true limitations are off the top of my head. You know, given that our deployment is serverless with the exception of the databases, I think we have pretty high technical capacity on the app side. The bottleneck being, you know, with the databases is potentially 1 area that we could potentially run into some constraints before we need to think about different data structures. And as you mentioned, have maybe a slightly different architecture in the way that we're persisting and accessing our data. I think we have a really long ways to go before we're even running into that. And we have, like, a really great read to write ratio in our favor, and that we have significantly more reads to the tune of, I think, like, 15 to 1 reads to writes currently is what we're hovering at. And so we can continue to add more read hosts or continue to scale in that way for a while.
If you consider every level of detail that Tree Schema captures as a data asset, the data sources, schemas, fields, especially the sample values for each field, the tags, every data lineage link. We have some clients that have several million assets in our catalog, and we're consistently monitoring latency throughout our ecosystem to try and make sure that our users have the best possible experience. The number 1 reason right now that we have response times greater than 1 second is actually because of the Lambda cold start, and that's a problem that I'm okay with at the moment. In terms of when should somebody think about, a different data catalog or potentially having something that's architected more for their specific needs. I think once you start to need to interact with tree schema for support on maybe a weekly cadence or you need to continue to suggest features because your team has to have some particular capability within the product, that's when Tree Schema may not necessarily be the right product. If you're looking at potentially having a data ops solution where you've got this really great product who can pull data from your data sources, create virtualization layers, and run dashboards all within 1 place. That's also a place, I think, that tree schema is just not gonna provide as much value because you generally get this really great side benefit from DataOps and that they already extract and sort of capture that metadata both about the data itself as well as data lineage.
[00:43:29] Unknown:
And in your experience of building Tree Schema and providing it as a service and trying to grow a business around it, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:43:42] Unknown:
Okay. I love this question. It's like this brutally honest face yourself in the mirror question. And frankly, it's something that I spend a lot of time on. Promoting a data catalog is tough. There are a lot of companies out there that are just starting. There are larger, well funded companies that are targeting enterprise customers, but their weight really draws organic traffic to them. The incumbents have completely different sales cycles than we do, and they can easily justify bidding up cost per click advertising. Not to mention a data catalog isn't exactly something you wake up in the morning excited to read about new products.
So, yeah, it's tough. In the same vein though, it it has led to new opportunities. It's caused us to think about new ways of engaging with prospective customers. And this engagement has been 1 of the most exciting yet unexpected lessons about building tree schema. The focus for us has really been on 1 on 1 connections with the individuals. When you talk to the data engineers or to the data scientists in the company that is creating or or using data on a daily basis, Tree Schema resonates with them. In addition, if you have these really raw and unfiltered dialogues with prospective customers about what exactly they would need in order for a product in order to purchase it, then you get this really fantastic set of inputs to your prioritization and to your backlog. And that's just, like, a really rewarding thing to see and to feel, especially when you see your product being aligned and heading in the direction that solves a problem that they have. You know, growing up my father was a salesman and I always told myself, no matter what, I'm not gonna go into sales. And now I've become this sort of digital salesman.
So I guess that's just karma. But, yeah, I would say support a small business, dry tree schema. We haven't taken any sort of funding. So would love to, again, help small businesses and and support them as well.
[00:45:24] Unknown:
What are the cases where MetaMapper is the wrong choice and somebody would be better served with a different style of data catalog or the integration requirements that they have don't match with what is possible with Tree Schema?
[00:45:38] Unknown:
I think that Tree Schema is not the right choice if you're looking for a data ops platform. There are a lot of great products that have come out recently that enable users to pump data into a single system. Again, for ELT or ETL processes, you have your data analysis, visualization, access controls, and much more. I think that these terms, data ops and data catalog, end up being conflated just because there's so many companies that do both. Tree schema is a data catalog in the truest sense. It sits outside of your applications and tries to just read the metadata that is already being created in the most lightweight way that is possible.
[00:46:19] Unknown:
As far as the future of the product, what are some of the new features or capabilities or new integrations that you're looking to release in the coming months years?
[00:46:30] Unknown:
So first, I think the biggest 1 that I'm looking forward to release is gonna be our API enhancements to analyze breaking change. I wanna give developers tools that allow them to recursively check all downstream impacts to a field when it's updated or removed from a schema, as you mentioned earlier. There's a whole list of questions that we want to help developers answer so that they can have confidence in their changes before moving it to production. I think that everyone has run into the problem where some ELT job was updated and it impacted a dashboard 2 or 3 steps removed from the actual change. There's no reason this should ever happen, and we're gonna solve this with our API and build it in a way that DES can use in their pre deployment checklist. We're taking a lot of inspiration from how Avro works in this regard to schema compatibility and applying it to data lineage.
2nd is we're building deeper integrations into the visualization tools and data usage in general. We're probably gonna start with some of the big ones such as Tableau and Looker, but we really want to extend this into other dashboarding products as well. And when you think about the personas that exist in the data catalog, there are effectively 4. There's 1, the data creator. This is your engineer or developer. You have a data superuser. This person is gonna be comfortable going directly to the data sources, figuring out problems on their own, probably a data scientist or a data analyst. 3, you have a data non super user, and this is gonna be somebody who is really just using the dashboards, maybe doing some basic SQL, but doesn't really have the ability to figure things out on their own. And then 4, you have the the executive level leadership.
So the integration supports this nontechnical user and some of the more basic use cases for super users. But we're thinking about how do we bring last mile visibility, not just to the non super users, but to your data scientists and data analysts and other super users as well. You know, data scientists in particular tend to have relatively complex workflows and pipelines that exist solely within their models. And having that level of visibility all the way from the source, not just to when a data scientist picks up the data and generates features, but through the decision and the creation of that probabilistic output is a really valuable thing in particular when you have governance or regulation that needs to monitor that information.
So I don't have specifics quite yet on what that's gonna look like but it's something that we're thinking about. 3rd, as we continue to push for simpler and easier integrations with Tree Schema, especially for our data engineers, We're gonna be building now a singer. Io target integration. So, again, this quest is meant to just make Tree schema the easiest way to get your data catalog populated, and singer. Io already has a lot of great connections to existing data source. Their taps are relatively exhaustive. And we're just trying to figure out what's the right way to do this for the tree schema integration.
And, of course, 4th, we touched on this a little bit earlier, is around schema and transformation versioning and how do we enable people to get visibility into their changes and life cycles over time.
[00:49:40] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:55] Unknown:
Yeah. I think that there needs to be standardization across the format and method for collecting data lineage metadata. I think there eventually needs to be some form of REST API or JSON object or something similar because it would need to be ubiquitous across databases, ETL or ELT tools, and dashboarding products. There's a lot of data catalog tools out there, and the lack of centralization in particular around data lineage, I think, really hurts consumers because it prevents data lineage from being able to grow more quickly as a general capability.
I think Apache Atlas probably has the best thing to this would be standard, and I think that the larger community as a whole should really have a conversation around how to bring a structure such as this to a problem on a broader scale.
[00:50:45] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Tree Schema. It's definitely a very interesting product and an important problem area. So I appreciate all of the time and effort you've put into that, and I hope you have a good rest of your day. Pleasure was all mine. Thank you, Tobias. Listening. Don't forget to check out our other show, podcast dot in it atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction to Grant Seward and Tree Schema
Motivation Behind Tree Schema
When to Implement a Data Catalog
Overview of Metadata Management Options
Tree Schema's Approach to Data Integration
Internal Architecture of Tree Schema
Unexpected Uses and Lessons Learned
Designing User Interfaces and APIs
Integrating with Workflow Managers
Handling Schema Evolution and Versioning
Challenges and Lessons in Building Tree Schema
Future Features and Integrations
Biggest Gaps in Data Management Tooling