Building A Cost Effective Data Catalog With Tree Schema

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bogged down by having to manually manage data access controls,

repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification

features eliminate the need for time consuming manual processes,

and their focus on data and compliance team collaboration

empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.

That's immu

t a. Your host is Tobias Macy. And today, I'm interviewing Grant Seward about Tree Schema, a human friendly data catalog. So Grant, can you start by introducing yourself?

Hello. My name is Grant. I come from a relatively general data background. I worked at Capital 1 in the early part of my career developing data products. I got into data science space at startups as well as at Walmart Labs and eventually ended up leading a data engineering team, most recently to build out a digital bank. I don't proclaim to have an overly technical background, and I didn't study engineering in school or really even start to code until I was 23 or 24. You know, I just like building data products, and I like to create value from data and to show it to efficacy. I really just enjoy using data in general.

Do you remember how you first got involved in data management?

So it was really out of frustration, to be honest. I think that if you had even asked me a few years ago

if I would ever see myself in metadata management, I would have said absolutely not. I started off the 1st couple of years again at Capital 1. And just to provide some context, Capital 1 does data very well, and they have well managed data pipelines. They have clean and consistent data data stewards who are really deep technically and have good business knowledge, and they generally just have a really solid data culture. And so that data culture and expectation

of how to treat data, curate it, and extract it extract value from it was really ingrained in me from an early point in my career. It wasn't until after I left Capital 1 I started to understand

how bad the proliferation of poor data management was. I'm talking about the most fundamental

aspects of quality data management, standardization,

documentation,

ownership,

discovery,

access management, and the like.

Then this applied to pretty much every company I went to after Capital 1.

To give an example, 1 of the companies I worked at was startup that had maybe 50 developers and a dozen or so data scientists.

There was absolutely no documentation for the data. The data scientists would send each other Jupyter Notebooks or just point to a link in GitHub and essentially say, check out how this field was used in the past.

Now fast forward a couple of years, I'm working at another start up, and I'm responsible for building out the entire data ecosystem. In the back of my mind, I'm thinking to myself, in order for us to have a sustainable advantage with our data, we need to properly manage our metadata. Good documentation,

easily track data lineage, have clear and direct visibility into who the owners are, etcetera.

All of this is part of, again, building a strong data culture.

So 1 day, I'm tasked a new field to a database, a pretty mundane task all in all. As many of the listeners may know, adding a new field to the database often means updating the source that captures the data or creates the data. So I started to track that down. The full lineage for that from start to finish ended up looking something like receive a file from a vendor, convert the file from fixed width to parquet.

This particular company operated in Latin America, so we did a translation from Spanish to English, and then finally, we saved that file into the database.

It took me over 2 hours to fully backtrack and identify

all the touch points that are needed to be changed here. And the worst part was that I had created that pipeline the year prior. So at this point, I knew that there had to be a better way for us to manage our metadata and to track our data lineage.

So I started to look for products on the market, and I couldn't find anything that met what I was looking for. All the SAS data catalogs were bundled with a bunch of other features that either I didn't want or I didn't need, and those other features really drove the cost up much higher than what the start up could afford. On the flip side, even though the current open source products that we have today had not been released, I didn't want to maintain a data catalog or I didn't want it to be something that my team had to spend time to set up or maintain even with some sort of dockerized or container service.

I just wanted metadata management, and I wanted to use it as a service. After some time, I gave up on the search. And given how clear this problem was to me and how common it seemed to be, I decided to start Tree Schema.

And so you've given a pretty good background as to what led you down the road of building TreeScima, but I'm wondering if you can give a bit more detail on what it is that you've got working there and some more of the motivation

behind actually turning it into a business versus just having it be an internal product that you used with your team at that startup?

Yeah. Sure. So I think the motivation around turning it into a business instead of just an internal product is that this problem really seemed to be persistent

across many of of the companies that my colleagues worked at. And so a lot of the folks that I have worked with in the past had been at start ups, and it's very common again for people at start ups to go to other start ups. And I ended up talking to maybe a dozen or so of my colleagues who worked at unique companies and everywhere

that they work, this was a problem. In 1 shape or form or another, it was just difficult for companies early on to have good solid metadata management practices.

And so Tree Schema is a data catalog that makes the essential metadata management capabilities

available to everyone. This includes catalog basics such as data discovery, rich text documentation,

assigning owners to your datasets, being able to have conversations about your data with your data,

effectively everything you would consider table stakes for a data catalog.

We have really positioned Tree Schema to be the premier data catalog for start ups, small and medium sized businesses, and a lot of that comes through in the pricing. We have a freemium model with a free tier up to 5 users and then 2 other tiers that are $99 a month and $300 a month, and they support up to 50 300 users, respectively.

So with that top tier, you can get to be as cheap as $1 a month per user, which I think is really beneficial to help the small companies get past the hurdle of paying for an additional product.

Tree schema is really heavily focused on providing a service that is simple to the end user. And to this extent,

enables users to sign up for an account and fully populate their data catalog in under 5 minutes.

As far as I know, there's no other product on the market that comes close to being able to set up a data catalog this quickly or easily.

In addition, we've also launched a set of APIs recently that allow teams to interact with Tree Schema programmatically.

So we're not just tailoring this to the business users or the data scientists. We really think that our data engineer partners are gonna be the 1st class citizens

for getting their team into the data catalog.

In terms of the utility of a data catalog, it's definitely necessary once you get to a certain size where you have multiple different people working with the data, where it's not the same person who's generating the datasets, who's also consuming them. But I'm wondering if there's a particular

stage of maturity at which point the data catalog isn't a critical component of the infrastructure where you can get by with just having a conversation

across the table or on Slack real quick for being able to answer quick questions about what is this data, What is it being used for? Where does it come from?

And sort of what the

tipping point is where it does become an absolute necessity to have that as part of your overall data

platform? I think this is a very good question and something that far too many organizations failed to ask themselves and certainly failed to ask early enough.

My personal opinion, and this may be just a little bit biased, is that teams should start to consider a data catalog from the moment that they have data.

The main reason that I say this is that I fundamentally believe that your data catalog should support your data culture.

And it is really the data culture that is going to allow you to continue to drive long term value from your business by using data. And so as a data catalog is just an enabler for data culture, the sooner you can get it into your ecosystem, the sooner you can start to integrate

your culture around the data catalog.

If properly capturing and documenting your metadata is something that you do from day 1, it will be deeply embedded into your data culture. Your teams will include that in their deployment checklist. Analysts will look for self-service first approaches.

Sharing knowledge about your data will be a default activity that your team does, and peers will reinforce this behavior as the team grows,

helping to quickly implant the shared need in new teammates.

Now if you take this from the other perspective, the opportunity cost of not using a data catalog from day 1, what ends up happening almost without exception is that teams inevitably face 3 challenges.

1st is that during the time that teams do not have a data catalog, their productivity suffers from an immeasurable number of interruptions. When someone has a question, they almost always go to a trusted source and ask for a knowledge transfer. These interruptions add up over the course of a day or weeks, and there are numerous studies showing the negative effects of interruptions on performance and their ability to detriment the quality of decision making.

2nd is that knowledge is lost about data. There are not many people who have the ability to remember every single piece of data lineage

and every potential value for every field you have in your different sources or different schemas.

Speaking for myself, there have been many times that I needed to go back and research the data lineage for pipelines that I created. And this particular issue is really exemplified when you consider attrition and reorganizations.

And 3rd, when an organization grows, it inevitably does approach metadata management. So populating and maintaining a data catalog becomes a secondary activity

when it is started to be implemented.

There are immediate hurdles that teams face when trying to get their catalog up to speed with the current state of their data. The biggest 1 being the sheer number of data assets that needs to be documented.

This causes a data catalog to be sparsely populated or lacking overall depth and the quality for the data that is documented.

And in turn, data users do not leverage the data catalog, which prevents a community from being developed around the metadata, which makes it more difficult to build trust in the data. In the end, overall adoption and usage of data to drive the business fails to flourish.

In terms of the

available options for being able to establish a data catalog, there are a number of different

products on the market or strategies involved where some people might just use the internal company Wiki and update it manually,

or they might rely on a managed service or a component of a platform that they're already using, whether that's something like Infoworx or Calibra,

or they might use an open source platform along the lines of an Amundsen or Medicad or MetaMapper.

And then there's also you mentioned data lineage, and there's a whole different set of products that's targeted at that area of things such as Marquez or Datahub.

I'm curious if you can give a bit of an overview as to some of the

relative trade offs of

the different options and your overall

on the current state of the market for both metadata management and data lineage and some of the challenges that exist particularly for the scale of company that you're targeting with Tree Schema?

So I think that Amgen, Data Hub, Medicad, they're all really inspirational products. And you're absolutely right. We're addressing the problem from slightly different perspectives, and I'm looking forward to see how they mature.

There's definitely places where we take inspiration from them, and and we would be very honored as well if at some point they're taking some inspiration from us. And consumers should definitely be excited about the growth of of metadata management and data lineage in particular over the next 2 to 3 years. I think it's gonna be just a really explosive space.

So there's a couple of things that are unique to TreeSigma. 1st, in a word, is simplicity.

And this is gonna be a topic that I'm sort of harping on over and over because for us, it is the most important factor to your start up's small and medium sized businesses.

Tree schema is a 100% turnkey.

From the moment you sign up, it only takes a few moments again to point to your database and extract all of your different metadata into your data catalog.

Even if your data sits in a private network or is behind a firewall, we allow you to connect through jump servers to access your data safely.

So,

historically, for a company to use a data catalog,

there has been 1 single entry point. It's been the data engineer or some other developer of a technical background.

If you look at the absolute best case scenario for implementing a data catalog internally, you're looking at using some form of containerized application that you can deploy,

but even the best developed containers crash for some reason or another if they run long enough.

So then you need to start looking at setting up your own storage or having external storage. And then you're thinking about how do you deploy that application as well to different environments to be able to test it and maintain it. I'm essentially giving the SaaS sales pitch here. But the point being that even in the best case scenario, it takes time to properly set up a data catalog. Your users will be pinging you how to use it. Once it is set up, it will eventually fail, and you'll have to spend time to read the source code and understand, you know, why your company's unique usage patterns are are causing some issues.

All of this, again, is that it takes more time for what's arguably 1 of the most important positions in the company, the engineer.

So TreeScima's paradigm

breaks this pattern completely.

Our product is so simple, we often see data engineers as the first users to sign up because they're the ones who are researching. They've been given the task to bring the data catalog into the company.

But then we see sort of this hand off to the data users, where you'll have the analyst or the data scientist who are going to drive the integration

and the population of the data.

And that ownership

and that switch

from data engineer to data user really frees up more time for the engineer to get back to doing what drives their business forward.

The second feature that is really unique to Tree Schema is our API and in particular, the Python client that we have developed as a wrapper to the API.

There are definitely services and and APIs that other catalogs have.

The most popular, I think, that comes to mind for me is Apache Atlas. But what is unique about Tree Schema's client is that you can interact with your data catalog in an object oriented approach

such as native Python objects.

A lot of time has been spent developing this client to make it as simple as possible, and there are really 2 features of the tree schema Python client that are by far the most popular.

The first 1 is the ability to manage data lineage as code.

So data lineage, we've touched on a couple of times, at its core,

describes how data moves from 1 field in a given schema to another field in a different schema. I hope I can do justice to the simplicity

of our Python client by describing it here. Effectively, users can create links between fields with only a handful of lines of code. We have several examples of this on our website, even 1 example that directly integrates with a live Faust streaming app. Our suggestion to users is to have data lineage script in your CICD pipeline since it has been developed to be idempotent.

Again, we continue to hold on this basic principle of simplicity.

This is really beneficial

because not only do you get to manage your data lineage as code, with our Python client, you don't even have to worry about checking or updating the status of your existing data lineage each time you deploy your application.

When you deploy your app, you can set the state of your data lineage and tree schema to be whatever you want. Whether that has changed or not since your last deployment, doesn't matter. The second feature that we see is wildly popular is the ability to define sample values as code. And 1 of the most critical

aspects to a well curated data catalog is human involvement.

There is a lot that tree schema and other data catalogs can do to infer on their own about the shape and semantics of your data, but we cannot infer the meaning,

not yet at least.

So for example, you may have a field called status code in your customer table with the values

1, 2, 3, etcetera.

Well, what do those mean? This is often 1 of the most common questions that data users have. What does this data

actually mean?

In tree schema, we call these sample values or field values. The Python client we have allows you to define your sample values and to capture their meaning in the same place that the data is actually created,

the code. This, again, is really powerful because just like the data lineage, you can capture and manage specific values and their definitions

in the code, but share that knowledge more broadly in a structured way within tree schema.

There are a couple of points that are worth digging into from that. 1 of them, which I think we can touch on a bit later, is the idea

of versioning and schema evolution so that if you have an existing set of fields for a particular database table, for instance, and then in the Python client, you push a schema that is completely orthogonal to what was there, being able

to have some sort of check to make sure that it wasn't completely an error or that the schemas are at least compatible from an evolutionary sense.

And then the other element is

a lot of data catalogs are focused on dealing with static metadata where there's the database schema. It's relatively consistent. It might change a little bit here and there, but it's not going to be going through a constant rate of change.

And then there's the other element of streaming data, which again is probably going to have fairly consistent schemas as long as you have your pipeline structured deliberately. But I'm curious

in terms of what you have seen as far as the challenges of being able to

reconcile things like database metadata and metadata that is in a data lake, for instance,

alongside

something like data going through

Kafka topic or being processed in the example you gave using FAUST, which is a library for being able to

operate on Kafka streams?

So our data catalog

primarily focuses on where data sits and where it physically resides.

And within Tree Schema, we call that a data store. And so a data store can be anything from Postgres. It could be s 3. It could be Kafka. It is that physical underlying place where the data is sitting. So within a data store, you have a schema, and this can represent itself in many different ways. It could be a table if you have a SQL based schema. It could be JSON. It could be parquet or avro.

It, again, is the semantics and representation

of the shape of the data.

Within the schema, you have fields, and then fields have specific values to them.

The way that we approach

data movement between schemas

is going to be what we call transformations.

And a transformation

is, by itself within Tree Schema,

just a container or a shell, if you will. And transformations

are holders for transformation

links. And that link is how data moves from 1 field in 1 schema

to a corresponding field in another schema. This may be done through some sort of SQL process where you have an event based trigger. It could be done through a Faust app where you have a streaming process. It could be a batch job that's doing

what we call a lift and shift of moving all of your data from 1 table and dumping it into a data lake or into Redshift.

It is just a semantic representation of how data moves. So tree schema itself does not actually do any of the data movement. We really sit outside of

the data ecosystem and extract as much metadata

as possible.

From the very beginning, the question of, like, how do we

integrate with different data stores and how do we have this common approach

to tabular data, unstructured data, or potentially even data stores that don't really have any sort of structure, whether it's, you know, what we consider

traditional, unstructured, JSON or Parquet.

Maybe it's an email server and you just get emails and it's just free text.

The way that we wanted to handle

this integration with these different data sources was to really create a unified perspective of what a schema means.

And the reason we did this at the schema level is because this is generally how users are interacting with data. People refer to tables or they refer to files or locations where files are. And on occasion, they'll go into the specific fields

and they'll talk about that. But for the most part, your schema is gonna be a representation of the entity, and that entity is going to drive your business in some way, shape, or form.

So, traditionally,

you have tabular tables that are going to have no structure outside of what's defined.

They're gonna be

flat. But what we do is we treat them

under the hood as a JSON structure. We use the same open source JSON standard

that is used within Kafka and Kafka Connect.

And we leverage that for all of our schemas internally

in order to give ourselves that unified perspective of what it means to have a data schema. And so by doing this, we can add just a little bit of additional metadata and context to our schemas

and whether it's a flat structure or it's some nested

structure and object that has arrays and other sorts of embedded fields, we get this really clean method for being able to do comparisons across different schemas.

And I'm wondering if you can dig a bit more into how the tree schema system itself is actually implemented and some of the internal architecture

and the ways that the system has evolved since you first began working on it and

began onboarding more people with more users?

Sure. So I'll break this down sort of into application and databases and then on the deployment side as well. On the application side, we are primarily a Python shop. And with that, we use Django to serve the app. So Django works really well for us because it enables speed to market. We can quickly test

and ship new features,

and it has excellent integration with Postgres and Redis, which are, you know, 2 of my personal favorite databases.

I'm a huge Redis fan, so you'll see that pop up quite a few times in this overview.

In addition to Redis, using Redis as our Django cache, we also use it in the background with our Celery app integration

because there's a lot that we cannot process synchronously.

For example, when you point Tree Schema to your database,

we will extract all the metadata that exists in that database.

When you do this, you have the option to allow tree schema to capture sample values.

And what this means is that for every field in your database, we'll capture somewhere between 10 to 20 unique values.

And for those of you thinking about it, no. We're not gonna bog down your database with a full table scan. So even if you have a massively large table, the impact should be relatively small.

Nonetheless, it's not uncommon for an organization to have thousands of tables and Redshift as an example.

And executing these queries on a few hundred to a few thousand tables can take some time. So that's where this async process runs on Celery.

There's 2 other major components of the app that we use Redis for. First is data discovery.

Users can search their entire data catalog from a single simple but powerful search. The Redis full text search is used here. I'll quickly touch on why this decision was made versus, say, Elasticsearch since I believe that product is being used in some of the open source competitors.

I love Elasticsearch, and it's also 1 of my favorite databases. But for us, the queries we're submitting

for this tech search are rather simple.

Elasticsearch really shines if you have a complex query such as get me all the results for data lineage where catalog is not within 10 words and it occurred within the past 2 weeks.

Given that we currently have simple queries and we're already using Redis, we decided to go with the full text search built into Redis to reduce the overall infrastructure complexity.

The second place that we use Redis is going to be in the Redis graph, and we use this as our graph database to query for data lineage.

1 of the downsides to this is that querying the data analytically to understand user behavior is rather difficult.

So to get around this, we actually persist all of our links within the transformations.

That's the fields that connect

from 1 schema to another.

We persist all of them in Postgres,

and we check every so often that the 2 are in sync. On the front end side, there's a little bit of JavaScript sprinkled in there to really make things pop. But for the most part, Django templates are used to determine the content layout. Last but not least, NGINX is used as a reverse proxy to route traffic, and that's pretty much it on the application side. For deployments, we are serverless wherever possible.

So the Django and Django app and NGINX are deployed together via ECS in a single container definition.

It essentially provides similar network routing to what Docker provides. From time to time, I talk to devs who are not as familiar with this ECS feature. In short, you can just use local hosts in a corresponding port, and you can route traffic between containers that are coupled and deployed this way.

It's 1 of my favorite ECS features, and I think it's just a fantastic way to deploy services.

There's a few other long running services that we have running on ECS.

1 of which is our node network that predicts whether or not a data asset is considered

or personally identifiable information, and it automatically tags the corresponding asset.

Our static content for the app is served via CloudFront.

1 of the great things, again, about Django is that it has such an incredible community built around it. And 1 of the packages, I'm forgetting the specific name right now, but it enables you to change static file source host to a CDM.

And by doing this, we actually offloaded

so much processing from the Django app to CloudFront that it allowed us to reduce the number of ECS containers running by nearly a quarter. And so that's just been, like, a really great feature, I think, that we've been able to roll out because of the community built around Django.

All of our internal microservices

are deployed as Lambda as have the external facing rest APIs with the exception that they also leverage the API gateway. And, of course, we leverage RDS and ElastiCache for our databases, respectively.

In terms of the

overall

approach or the overall goals of the product, I'm curious if there are any initial assumptions that you had going into this or ideas as to what you thought were going to be the needs of your end users that have had to be

updated or changed or a new direction decided upon once you created the initial launch and started bringing people onto the platform?

I think, by far, the most innovative way that I've seen tree schema used by a client has

been and that the client went ahead and and developed this utility,

effectively a learning plan on top of TreeSigma.

And so as they're onboarding new users, in particular data scientists or analysts, this can be extremely burdensome for teams because you need to have so much knowledge about the data before they can get up to speed and and really provide value.

And you see companies put together traditional learning plans that consist of a buddy or, you know, some tables that they need to interact with Right? An introductory problem to force this person to learn the relationship and cardinality of the data and so forth. And so this client was leveraging

asset tagging in order to curate a step by step learning plan for each of their 3 data teams.

When their new users logged in to TreeScreamer the first time, they could just search for tags by their name, and these tags may be marketing learning plan or operations learning plan or whatnot.

The person who had this idea when I was speaking to them on the phone, they walked me through it on a screen share, and I merely made them an offer to come and be a product manager at TreeSigma

on the phone. That was really an interesting and intriguing way, I think, to see the product being used and something that I learned about, like, what are consumers

really looking for in this space to drive value

on top of just capturing their metadata and cataloging their data. There's been some unexpected uses as well. I've seen 1 company use it to save a large number of emails with attachments as a way for their entire team to quickly find historical metadata that a vendor sent to them. A little bit of context, we have this catchall, quote, unquote, data store called other, and we recommend that it be used when you wanna capture metadata from a database that we haven't integrated with. So going back to this company, they have 100 of these mainframe files that have been emailed to them by their mainframe partner. They want to process them

into the data catalog and have them saved so that they can access them and search for them when they need to. On top of that, they wanted to create a very basic data lineage.

And what they did for each of these files was just add a single field

and then linked the files together using this single field. A little bit different, I think, than what we were expecting, but I think for them, it works, and that's what we're proud about is to see people using this product how they feel best and getting the value that they need.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical

and down to individual rows and

values. DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to dataengineeringpodcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water flask.

In terms of the overall design of the product and how you approach the

interfaces

and the APIs and the structures that were available for being able to record metadata.

I'm wondering what your guiding heuristic was

for being able to

figure out how it would be presented to the end user and the necessary fields and information that would need to be captured?

Sure. And just, again, recap what the major entities are within Tree Schema. There's data stores, schemas,

transformations.

Those are really the big 4.

And when you look at Tree Schema and you're interacting with each of these, they all follow a very similar layout.

And the reason for that goes back to simplicity.

We want the layout of tree schema to be simple. We want it to be repeatable,

something that people can understand easily,

and this was an area that we spent a lot of time even before we began development,

was to look at what do consumers

need to have in a data catalog.

What are the ways that they need to extract information, the things they need to capture?

And, really, we spent a lot of time thinking about that user experience,

and that drove

how we were able to design the underlying tables and, really, the models within Django that are gonna be supporting those tables. And so there were 2 things that we wanted to achieve.

1 was this repeatable and consistent interaction for the user. And the second 1 was the ability to have a flexible system that could handle any different type of data store, schema, and fields.

So I touched on the the common layout a little bit. In order to solve the second point, the flexibility to have different data types and data stores,

what we did was we leveraged Postgres' built in JSON data type.

And to give an example of how this plays out, if we look at just 2 of the data stores that Tree Schema supports, Postgres and DynamoDB,

they each have attributes that are unique. Postgres has a host and a password, whereas Dynamo has AWS keys in a region.

Nearly everything else about that database

can be summarized with a common set of attributes,

and therefore, we place those common values into columns.

Anything that is unique to this particular database is stored within JSON.

I think it just sort of played out that we don't really run any queries against those unique attributes.

So indexing and performance impacts of directly querying on the JSON fields are not really big. And in addition,

we get further benefits

because since

we have

such a similar layout between data stores and schemas and fields,

we can take a lot of those same sets of columns and apply them to the different tables. So we're talking about the description. We have tags,

comments.

We have a name, a type.

All of these are are relatively ubiquitous

across the different entities that we have. And so we're using the Django models under the hood as an abstracted class to really define

what are the implementations

of the tables that we're gonna have. And so all of these major entities inherit

from this same class.

As far as the flexibility of that structure and in particular being able to track rich information about the data lineage and the transformations that are being performed, I'm curious

what the options are, particularly for being able to do things like integrate with a workflow manager along the lines of a DAXTER or an Airflow.

I think that this is an excellent idea. It's not something that we are currently doing today.

And it it's an area, I think, that we've talked a little bit about, but really hasn't been prioritized quite yet. I've seen a couple of other products out there. I think this is something that Amundsen does, and they do it quite well. So I think we're certainly gonna look to take some inspiration from them

and more so talk to the people who are using that products to see, like, what is it that they're enjoying most about it. You know, really trying to understand

what is the user experience

and the problem the consumers need to solve and that that solves for them.

Continuing on with discussing some of the capabilities of things like Amundsen, another element of that is being able to track the popularity

of a given set of data

where if you're searching for a particular field, you might be able to find 5 different tables that have relevant information. But

based on the previous search activity

or user contributed information, you can actually say that this particular table

is the 1 that is most actively used or is most up to date. I'm curious if you have any capabilities like that built into tree schema.

Yeah. We do. So we our full search full catalog search is gonna have pretty good coverage of that. We have a ranking mechanism that I think that we're looking to improve.

It's not the best at the moment, but it does have some pretty naive ways to be able to rank data assets that are returned when you search, again, based off of others who have used the product as well. We have a couple of other ways that we approach this. In addition, we call it, 1, power users. So as users are interacting with tree schema

and they're interacting with the data, you can see who are the power users, who are using this data most frequently.

And this is a great way because you can look at any particular data asset

and see who are the power users that are actually leveraging this data. And that's just, like, a really quick way to understand who can you go to if you have some elevated question that you can't get answered directly from the data catalog.

But there's also this other set of people who may be hidden users as well. They may be experts, but they don't necessarily use the data. Maybe they don't look at the documentation.

Maybe they created the table or the pipeline some time ago, and the relevance of their interactions has just dropped off in the background because they have worked with it so long ago, and it it's not really fresh anymore.

And so we give the ability for users to volunteer as experts.

And so, again, within any single data asset,

you can volunteer yourself or remove yourself as an expert, which is just a really great way again for people to say, you know, I don't necessarily use this, but I know a lot about it. You can ask me questions if if there's anything you'd like to know. We also promote this information about who uses

what assets in the data catalog within the teammate section. So you can go and and you can effectively shadow

what your other teammates are doing. You can see what schemas they're looking at, what fields, what data stores. We're gonna be trying to enhance this a little bit in the next few quarters to show not only

what are they using, but to give more specifics around how are they using it, in particular with some of the SQL based products that we support.

Going back to the earlier point of versioning and evolution of schema information, I'm curious

what your capabilities are as far as being able

to identify if there's a potential conflict in the evolution of a set of schemas, such as you're changing a

column in a database from a text field to

a Boolean field or something, or maybe from a Boolean to a float.

And then also just being able to track those changes so that if somebody's looking at a table that maybe they had worked with in the past and it's gone through some evolution, and now they're looking to see what its current state is and see what was it at the time that they were using it before, what were the intermediate steps, and what is it at now being able to service that information?

So currently, what we offer today is we raise what's called a governance alert when there's a breaking change that occurred. And you essentially just, you know, touch the nail on the head whenever you remove a field or you're going to be changing the data type of a field that's no longer compatible. We raise that information so that a data steward can take action and and they can do something with that. And we have a whole range of different governance actions that are effectively used within TreeScma to keep your catalog up to date and to make sure that it is remaining fresh. In terms of schema management in particular,

this is an area that we're actively looking at, and the way that we're thinking about how to solve the solution

really, again, needs to be comprehensive for our customers. So when I think about schema versioning, it's not just the schema that we need to version, but it's also we need to think about what are the transformations

that are impacted as well with this. Because if you're changing the schema,

you also have the potential to be changing a transformation.

And so as we think about

what are the implicit relationships

that we have between schemas and fields and transformations in particular,

we really need to make sure that we have a comprehensive way to provide versioning across both of those entities, schemas, and transformations.

There are questions that we're looking for feedback on from users such as if a schema changes and it impact the transformation,

should the transformation even be updated?

Should schema versions automatically be updated or should it prompt a data steward to review when the changes occur?

There's another feature that we launched recently

where tree schema can be scheduled to automatically sync itself with your database on a set cadence, effectively every day or once a week. You can have tree schema make sure that it has the most recent representation

of your data. And so how this plays along with schema versioning

is something that's gonna have a really big impact on the end solution as well. I would say be on the lookout for something in early q 1 on this. Still in early discussions, but I think something that we're really excited about as well.

And then another aspect, what you're building at Tree Schema, is you mentioned that you're focused on small to medium sized organizations.

And I'm wondering what you see as being the overall limitations of scale, either in terms of the technical capacity or the cognitive complexity that it can handle, and when somebody might want to go to something else like a data hub that is more architecturally complex, but possibly more flexible?

I don't know what the true limitations are off the top of my head. You know, given that our deployment is serverless with the exception of the databases, I think we have pretty high technical capacity on the app side. The bottleneck being,

you know, with the databases is potentially 1 area that we could potentially run into some constraints before we need to think about different data structures. And as you mentioned, have maybe a slightly different architecture in the way that we're persisting and accessing our data. I think we have a really long ways to go before we're even running into that. And we have, like, a really great read to write ratio in our favor,

and that we have significantly

more reads to the tune of, I think, like, 15 to 1 reads to writes currently is what we're hovering at. And so we can continue to add more read hosts or continue to scale in that way for a while.

If you consider every level of detail that Tree Schema captures as a data asset, the data sources, schemas, fields,

especially the sample values for each field,

the tags, every data lineage link.

We have some clients that have several million assets in our catalog, and we're consistently

monitoring latency throughout our ecosystem to try and make sure that our users have the best possible experience.

The number 1 reason right now that we have response times greater than 1 second is actually because of the Lambda cold start, and that's a problem that I'm okay with at the moment.

In terms of when should somebody think about, a different data catalog or potentially having something that's architected more for their specific needs. I think once you start to need to interact

with tree schema for support

on maybe a weekly cadence or you need to continue to suggest features because

your team has to have

some particular

capability within the product, that's when Tree Schema may not necessarily be the right product. If you're looking at

potentially having a data ops solution where you've got this really great product who can pull data from your data sources, create virtualization layers, and run dashboards all within 1 place. That's also a place, I think, that tree schema is just not gonna provide as much value because you generally get this really great side benefit from DataOps and that they already extract and sort of capture

that metadata both about the data itself as well as data lineage.

And in your experience of building Tree Schema and providing it as a service and trying to grow a business around it, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in that process?

Okay. I love this question. It's like this brutally honest face yourself in the mirror question. And frankly, it's something that I spend a lot of time on. Promoting a data catalog is tough. There are a lot of companies out there that are just starting. There are larger, well funded companies that are targeting enterprise customers, but their weight really draws organic traffic to them. The incumbents have completely different sales cycles than we do, and they can easily justify bidding up cost per click advertising.

Not to mention a data catalog isn't exactly something you wake up in the morning excited to read about new products.

So, yeah, it's tough. In the same vein though, it it has led to new opportunities.

It's caused us to think about new ways of engaging with prospective customers.

And this engagement has been 1 of the most exciting yet unexpected lessons about building tree schema. The focus for us has really been

on 1 on 1 connections with the individuals.

When you talk to the data engineers or to the data scientists in the company that is creating or or using data on a daily basis, Tree Schema resonates with them. In addition, if you have these really raw and unfiltered dialogues with prospective customers about what exactly

they would need in order for a product in order to purchase

it, then you get this really

fantastic set of inputs to your prioritization and to your backlog. And that's just, like, a really rewarding thing to see and to feel, especially

when you see your product being aligned and heading in the direction that solves a problem that they have. You know, growing up my father was a salesman and I always told myself, no matter what, I'm not gonna go into sales. And now I've become this sort of digital salesman.

So I guess that's just karma. But, yeah, I would say support a small business, dry tree schema. We haven't taken any sort of funding. So would love to, again, help small businesses and and support them as well.

What are the cases where MetaMapper is the wrong choice and somebody would be better served with a different style of data catalog or the

integration requirements that they have don't match with what is possible with Tree Schema?

I think that Tree Schema is not the right choice if you're looking for a data ops platform.

There are a lot of great products that have come out recently that enable users to pump data into a single system. Again, for ELT or ETL processes,

you have your data analysis,

visualization,

access controls, and much more. I think that these terms, data ops and data catalog,

end up being conflated

just because there's so many companies that do both.

Tree schema is a data catalog in the truest sense. It sits outside of your applications

and tries to just read the metadata that is already being created

in the most lightweight way that is possible.

As far as the future of the product, what are some of the new features or capabilities or new integrations that you're looking to

release in the coming months years?

So first, I think the biggest 1 that I'm looking forward to release is gonna be our API enhancements to analyze breaking change. I wanna give developers tools that allow them to recursively check all downstream impacts to a field when it's updated or removed from a schema, as you mentioned earlier. There's a whole list of questions that we want to help developers answer so that they can have confidence in their changes before moving it to production. I think that everyone has run into the problem where some ELT job was updated and it impacted a dashboard 2 or 3 steps removed from the actual change. There's no reason this should ever happen, and we're gonna solve this with our API and build it in a way that DES can use in their pre deployment checklist. We're taking a lot of inspiration from how Avro works in this regard to schema compatibility

and applying it to data lineage.

2nd is we're building deeper integrations into the visualization tools and data usage in general.

We're probably gonna start with some of the big ones such as Tableau and Looker, but we really want to extend this into other dashboarding products as well. And when you think about the personas that exist in the data catalog,

there are effectively 4. There's 1, the data creator. This is your engineer or developer.

You have a data superuser.

This person is gonna be comfortable going directly to the data sources, figuring out problems on their own, probably a data scientist or a data analyst.

3, you have a data non super user,

and this is gonna be somebody who is really just using the dashboards, maybe doing some basic SQL, but doesn't really have the ability to figure things out on their own. And then 4, you have the the executive level leadership.

So the integration

supports this nontechnical

user and some of the more basic use cases for super users.

But we're thinking about how do we bring

last mile visibility,

not just to the non super users, but to your data scientists and data analysts and other super users as well. You know, data scientists in particular

tend to have relatively complex workflows and pipelines

that exist

solely within their models. And having that level of visibility

all the way from the source,

not just to when a data scientist picks up the data and generates features, but through the decision

and the creation of that probabilistic output is a really valuable thing in particular when you have governance or regulation that needs to monitor that information.

So I don't have specifics quite yet on what that's gonna look like but it's something that we're thinking about.

3rd, as we continue

to push for simpler and easier

integrations with Tree Schema, especially for our data engineers, We're gonna be building now a singer. Io target integration.

So, again, this quest is meant to just make Tree schema the easiest way

to get your data catalog populated,

and singer. Io already has a lot of great connections to existing data source. Their taps are relatively exhaustive.

And we're just trying to figure out what's the right way to do this for the tree schema integration.

And, of course, 4th, we touched on this a little bit earlier, is around schema and transformation versioning and how do we enable people to get visibility

into their changes and life cycles over time.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I think that there needs to be standardization

across the format and method for collecting data lineage metadata.

I think there eventually needs to be some form of REST API or JSON object or something similar because it would need to be ubiquitous across databases,

ETL or ELT tools, and dashboarding products.

There's a lot of data catalog tools out there, and the lack of centralization

in particular

around data lineage,

I think, really hurts consumers because it prevents

data lineage from being able to grow

more quickly as a general capability.

I think Apache Atlas probably has the best thing

to this would be standard, and I think that the larger community as a whole should really have a conversation around how to bring a structure such as this

to a problem on a broader scale.

Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Tree Schema. It's definitely a very interesting product and an important problem area. So I appreciate all of the time and effort you've put into that, and I hope you have a good rest of your day. Pleasure was all mine. Thank you, Tobias.

Listening.

Don't forget to check out our other show, podcast dot in it atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links