How Column-Aware Development Tooling Yields Better Data Models

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Your host is Tobias Macy, and today I'm interviewing Satish Jayanti about the practice and promise of building a column aware data architecture through intentional modeling. So, Satish,

for folks who haven't listened to your previous appearance on the show, if you can just start by introducing yourself.

Yes. Thanks, Tobias,

for having me.

My name is, Satish Jayanti. I'm a a cofounder and CTO at Callas.

And do you remember how you first got started working in data?

Oh, yeah. Absolutely. So,

I've started

my career,

you know, in in software engineering,

application development.

And I was,

dabbled in that for a few years and was working for a

a start up in year 2000 in Los Angeles.

And,

this we we were selling online courses for a large financial firms.

You know, there was a lot of,

lot of boom at that time as you know.

You may have heard how, you know, the online,

Internet was taking off and everything was

very, very,

getting very hot. So at that time, I was

working there. And

a lot of times what happened was people would come to me and say, hey. Can you give me this report?

You know,

can can I have that report? Can I have this report? And I I I started kinda

taking those requests as they came in

and started building stuff and and

giving them. But at some point, I realized that

this was just overwhelming.

It was just not I was not able to keep up. And then that's when I started looking more and more into

how this is done usually. And,

you know, that that's where I kinda hit the,

the the topic of data warehousing, how to build it right, how the dimensional modeling and all of that. So I started

like that, accidentally getting into it, and, never looked back.

And so

in terms of the overall space

of data warehousing, data platforms,

obviously, the cloud has been

the most impactful

change in the ecosystem

for probably the last decade, if not longer. And

in the process

of moving some of those data warehousing and data platform workloads,

large organizations

in particular

have taken a bit of a longer time to do that because they have

more existing infrastructure. They have,

bigger risk involved than actually making that migration. There are more moving pieces to coordinate around that.

And so a lot of the

patterns and practices around how to do data platforms and data warehousing in the cloud have been developed by smaller and faster moving organizations.

And I'm curious what you have seen as the overall impact of the cloud and some of that early adoption on the overall practice of data modeling and the data architectures that manifest in cloud environments as compared to some of the on prem environments that these larger organizations are still managing?

Yeah. Absolutely. So

and and and, you know, I mean, you you said it right. So

smaller companies

moved to the cloud much quicker,

obviously, because,

they didn't perceive the the the risk or anything. I mean, they didn't they didn't have that much to begin with. They're getting started. They don't have large volumes of data. So they just quickly

took advantage of the cloud platforms

compared to larger firms. There, you know, there's so many

legacy systems running and security was a big thing. When cloud started, everybody every CTO, every executive was questioning the security. Like, how secure is it going to be once you put it on the on the cloud? So that was a big thing. So that took a while for a lot of companies to even kind of think about how, you know, how to address that. As far as the data modeling goes,

I think when these small companies or or smaller,

you know, smaller,

environments move to cloud,

things or most companies like startups,

which move pretty quickly.

And data modeling was seen as

a somewhat of a hurdle.

And and,

or in some cases, they may not even aware

of data modeling

because they haven't done that in the past. So they may not know that was

an important piece.

So for that reason, it just went on and people started building this really fast, and the data modeling took the back seat.

And it's just, for the last few years,

it's just disappeared from the analytics space. It's coming back now, but that's

that that's what, that's what happened

in in,

as these things move to cloud.

And in particular,

with the data warehousing,

data modeling approach

of,

in particular, dimensional modeling that

was developed in the nineties in the era of

these large and expensive data warehouse appliances that were typically resource constrained. There were a lot of,

conflicting

priorities around what you needed to be able to answer and when.

I'm wondering how you have seen the ongoing conversation around the relative merits of dimensional modeling versus some of the newer paradigms of wide tables, etcetera,

and how that manifests and what the actual requirements are around modeling

priorities

in cloud data warehouse environments?

Yeah. So

if you take the data modeling practice,

I mean, we should probably look at why people used to model in the first place. Like what were the benefits that people were getting out of modeling. Right?

I mean, there were 3 stages of modeling, generally speaking. A lot of people agree with this. The conceptual modeling, the logical modeling, and the physical modeling. Right?

So the conceptual modeling is primarily where you're saying,

hey. I need to understand

how the business,

or what the business needs. So

what what is what is that they want? What are the requirements?

So in a conceptual model, you start putting all the the entities, the business entities,

and you start

drawing lines between these entities to show relationships between them.

And then you use this conceptual modeling as a way to gather requirements as a communication tool

to work with the business.

It's much easier when you, you know, show something visually. You can sit in a room, talk to the business people and say, hey, is this what you're thinking? Is this how the relationships are?

That's a thinking exercise.

That's very important for the business and the people who are building the solutions to get on the same page. And then as you move towards to the logical model, it's more like the solution. You're talking about, okay, this is how I'm going to do it. It's still abstracted away from the physical implementation,

but it's still

a way to,

kind of draw out the solution.

And then finally,

the physical,

model. Physical models are where now they're closer,

and they're very they could be different

depending on the target platform. So if it's a cloud platform,

you may have some decisions that you

would not have done when you were doing an on prem

database platform. So as as you get to the physical aspect

of this, it's coming closer to the target platforms. So because Cloud platforms do perform well well, you can make some

decisions like, hey. I don't care about the storage.

You know? I I think the the system can handle the compute, so this is fine. So you're basically kinda

doing those,

kind of decisions as you move towards this. Now coming to the dimensional models.

Now dimensional

models are are the same. I mean, dimensional models are for analytics mostly,

and they are kind of

report

or analytics oriented models.

Makes it much much easier for the business to understand a model

without

getting overwhelmed with something like an operational model. Like, if you have an operational data model, there's a lot of tables, a lot of relationships, and just become very, very hard to even write queries.

But whereas if you show a domestic model and if you do it right, it becomes much much easier for for not only kinda start with requirements and understand, but also once you do it,

then business consumption, when they look at this,

they can kind of relatively

easily understand this these type of models.

And as far as that dimensional modeling practice, just to expand that a bit more for folks who aren't familiar, this largely refers to the star and snowflake schema

versus the data vaults,

schema approach of the way that you structure the different tables and relationships within the warehouse

as opposed to what's typically 3rd normal form in transactional databases.

And as far as that actual

exercise

of

defining these dimensional models and figuring out what are the actual attributes and dimensions,

That can be a very complex and time consuming process. I'm wondering what are some of the ways that teams have

worked through that to be able to manage the, very rapid pace of

requirements

and the the changing requirements

with the upfront investment needed to be able to actually build out those models and cement them and, some of the ways the teams should be thinking about that balance of, I need to get some answers right now versus

this is going to help me in the long run. Yeah. That that's a very good point, and

that's the challenge. Right? So people fall in these traps. And 1 trap is, hey, don't do modeling because I can do it fast.

Or do modeling and perfect the model

so I'll get it right.

Both are wrong, in my opinion,

because you got to get the right balance. You if you skip the modeling exercise,

regardless of whether we aren't doing our cloud or not,

then you're not taking the time to think. You're not taking the time to

build that foundation correctly. As opposed to saying, okay, I'm gonna spend all my time in modeling, then you're ignoring the business.

You know, you need to do your end goal is to provide high quality data,

that can be, you know, expanded,

upon more requirements. As you get more requirements, you need to have a way to scale this.

To do that,

you need to balance.

So so the I mean, what I have done in the past is

my goal was

to

always kinda take the data

and regardless of I mean, you don't have to model everything. You don't understand. You do that first iteration

of just get a feel for it. Do that concept to model or logical model, whatever that is. Don't try to perfect that. Just get the first iteration of it. Build something. It doesn't have to be purely dimensional model yet. It could be just views. And then put it out there as soon as possible.

And then

let the business kinda and set the expectation

that, hey, this is the first iteration. I want you to take a look at the data first and tell me, you know, I have these questions and do you see anything wrong?

So and then they'll look at the data. It could be just dashboards or whatever way that you want to, you know, give them the data. It could be sort of Excel.

They can download into that. But they can look at the data quickly and then give that feedback.

Then you start thinking about, oh, now I understand

the relationships. Now I can kinda improve my model

and say, hey. This is how this needs to be. And then kinda follow those iterations as you go through. It will never be perfect, and that's okay because everything is changing. Now even though it may seem perfect today, it's not going to be as soon as the company goes and buys another company. You know, everything is like the rules are changed. Now everything has changed. So that's okay. But but understanding that that balance is important and that as long as you're trying to improve it,

continuously,

you'll be successful.

Another

outgrowth

of the cloud and cloud warehouses

and the decreasing storage costs and the decoupling of compute and storage is the shift in the

integration path from ETL, which was very,

upfront investment. You had to make sure that you got things right because, otherwise, you're going to lose information or you're going to have the wrong assumptions and do the transformations wrong, and you had no way of going back and doing it right afterwards. Or if you did, it required a lot of extra engineering to give you that fallback versus the ELT paradigm that has become prevalent of just load all of the data and then transform it so so that you don't have to think about modeling up front, which I think has had this,

downstream impact of saying, well, if I don't have to do all the modeling on load, well, I don't really have to do any modeling at all. Because if I do do it wrong, I can just rewrite it and rebuild everything, and it's fine. And I'm curious what you have seen as the,

relative merits of the upfront investment requirements of ETL versus the very fast moving iterative capability of ELT and some of the ways that that influences

the overall thinking and patterns within a team or an organization as far as that data modeling, data integration, data analytics workflow?

Yeah. Absolutely. So, I mean, e ELT,

definitely is has changed the game in in terms of how you build these things.

In in the in the old days, you know, when you had ETL systems,

they became the bottleneck. They are very rigid tools.

The GUI tools mostly, but,

they brought the efficiencies, but also rigid in terms of, you know, you didn't when you hit a corner case, you didn't have much control on it.

And and also you were not leveraging

the investment that you made in the target platforms. Right? The database platforms. But ELT paradigm has changed that. Now you have this full power,

that you can utilize,

on the the using the database platform to do the transformations.

So what that means is, yeah, you can you can iterate

through much much,

you know,

rapidly,

as opposed to the ETL,

ETL paradigm. So,

yeah, as I said before, I think it's it's much more

easier

to do those kind of iterations,

doing ELT

on a platform like for example, Snowflake or some kind of cloud platforms.

Another aspect of this question of data modeling, particularly with dimensional models where it does make the

understandability and discoverability

of

the core domain objects clearer

than if you have these 1 off models that answer a specific question. But if you wanna do any further exploration, then you have to go back to from the source and build back up to it. Is this question of

data self-service

for people in the organization, people in the business suite who don't necessarily

have the capacity or time to dig deep into the data and understand it semantically.

And I'm curious what you have seen for the teams who

fall into that pit of, I just need to build this 1 model to answer this 1 specific question versus investing in that upfront data modeling or dimensional modeling,

who is driving the actual

effort of doing the analysis,

and how does that how how does that manifest in terms of the ways that the teams think about what their priorities are of doing the upfront modeling versus just answer questions really quick? Classic challenge. Right? I mean, you know, the business units are saying, I know my data. I understand my data.

Even if you give me a model, I'm gonna look at this model for this particular

process. And I'm gonna just build it for this 1 and and I'm happy. But the problem with that is there's not just 1 business process in a company. There's multiple business processes,

And these business processes share a lot of data.

So

you have to have a way to

kind of plan this

from that higher level. Like, to understand,

hey, what are my business processes?

What is the shared data?

What become dimensions? What become facts? And how are these dimensions linked? This is all very, very important to make sure that you're not building silos. Because just because you build a dimensional model doesn't mean that you're doing it right. You know, you need to make sure that

you are,

building

with the intention to share data across these business processes and understand those. And there's a term Kimbal mentions, you know, called conformed dimensions.

That was the idea.

Now we don't wanna get into how we do that and what are the pros and cons and all of that. But generally speaking, the idea is, hey, there's a shared dataset. You build that dataset for 1 process, for 1 unit, and then make sure that you leverage and use that same thing to connect other, you know, other datasets. So that way you kind of build in this integration.

And

as far as the overall

ecosystem

of data modeling practices,

Obviously,

star and snowflake schema and data vault have been around possibly the longest, or they've been the longest lived. But I'm wondering if there are any other

patterns or data modeling approaches that have been developed in recent years that you have found to be useful or beneficial or things that are maybe built off of some of the star and snowflake or Datavault approaches that make it easier to iterate into those dimensional approaches?

Yeah. So,

I mean, to be honest, I haven't seen anything

big like that that, you know, when dimension models came out and data wall came out.

Of course, there was the enterprise data model from Inman. So these were the major kind of data modeling, you know, data modeling methodologies.

Yeah. As far as I mean, there's there's time series modeling. There's like all these subsets of modeling that that you can talk about, but

nothing major in terms of revamping the whole thing. So so if you're talking about data warehouses, you have a few options. You know, you're saying, okay, I want to build my, you know, use Data Vault. If you're if you're seeing that there's a need to build a Data Vault for your business,

then that would be a good approach to build that raw data, organize that. And then on top of that, you build your star schemas.

You're still building the dimensional models, but you're building on top of the the raw data vault models. And there are situations where you may say, I I we don't do Data Vault link in Data Vault.

We can just go straight into,

dimension models. There's pros and cons in each 1 of those. But on top of this, there are things that you can do custom

kind of modeling, I would say. You know, you can create aggregate tables if you think there's,

you know, there is a performance thing. And your data is, like, huge and you want to aggregate it. But make sure you aggregate it after you bring the granular data into your,

into your air warehouse. Right? Because if you if you if you capture the data from the operational systems and you aggregate it right away, then if somebody's asking a deeper question to, hey. Show me how you got this aggregated number, then you don't have a way

to drill into it because now you have to go to the operational system, which is probably already update. So yeah. I think there's not not I have not seen, like, a big big,

model methodology since these 2.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudderstack.

Another element of the conversation today is the question of column awareness in the space of data modeling and data transformation. And I'm curious if you can just start by describing

what you mean by that term and some of the ways that that will impact the approach of data modeling or the ways that data transformations

and table structures are built up? Yeah. So

if you look at the

data, the I think columns

columns are the the building blocks pretty much. That's the lowest grain that you can go to in a in a dataset. Right? I mean, if you if you think look at an Excel sheet, each cell is a,

you know, is is the data value in a particular column. When we say column aware, it's basically column metadata. That's what we're talking about here. But how we can leverage that,

that's the distinction here. You know, how we can use it to its maximum potential

to do automation

is what we are talking about. By storing the column

metadata or being column aware

and using that leveraging that from the ground up

to do whether it's modeling,

whether you're generating code,

whether you're doing column lineage,

whether you're doing impact analysis,

whether you're determining the state of your target database platform. You get all of those benefits just by leveraging that column awareness.

And and and again, that's what we we did at at Colas.

So

if you if you look at anything that you're sharing with the business,

you know, whether it's a KPI or metric,

A KPI is made out of metrics.

Like KPI is a key performance indicator, you know, it's made out of metrics.

And if a metric is usually made out of 1 or more columns,

and these columns are coming from tables. So it's always comes down to columns.

You know, you're applying transformations on columns.

You're adding new columns. You're deleting new columns. You're working with columns. Data quality is about columns. So it just becomes so crucial

to understand the relationships

and the dependencies between these columns. And, of course, columns in tables, in tables search collection of columns. Right? So what what

I we think is that that's the best way

to

kinda

build an analytics system so you can scale.

Because when you're small,

it doesn't matter as much. When you're meaning when you're building a small, system or a small database, small model, it doesn't matter much. Because you were thinking, I can write a query. I can get this done pretty fast. Fine. But usually, that's not the intention. Like, whenever you're working for a business you're building, it it may be small at that time, but the whole intention is to grow. You know, they'll go and get more datasets. You're selling more. You're acquiring other companies. Whatever that's but you have to have a way to scale. So that's where this comes into play. The automation comes into play. And to do proper automation, you need to be column aware. So many different things to dig into there.

So from the perspective of a table is just a collection of columns, and the columns are the thing that you actually care about, I'm wondering if you can talk to some of the more nuanced aspects of that where, yes, a table is a collection of columns, but the columns in within a table typically all relate to each other in some fashion. So there is some contextual awareness that you need to have to be able to operate properly on that column. And I'm wondering

what are some of the pitfalls that somebody might fall into if they say, I only really care about the columns, so this is the thing I'm going to track. What what are the pieces of metadata that they might drop on the floor by the fact that they are maybe ignoring some of the other, contextual and semantic aspects of the table that that column is contained within and the ways that that column is propagated to other differently named tables or collections of columns, etcetera?

Very good question. So so,

you know, when I say columns,

tables, a collection of columns, you know, it's true. However,

there are dependencies. There are relationships between those columns. There's a primary key. There's a foreign key. There's there's all these other relationships that are very, very important. And that's where I guess the modeling comes into play. But there is additional metadata

we store on these columns. You know, it's about not just the name of the data type, but there is, like, you can attach now. You can attach

or you can tag things

as foreign keys, as business keys,

as change type 2 columns in in a dimension.

By doing those,

now we can kinda leverage those additional metadata tags to do even more automation. We can generate scripts that would say, hey, if I know the business key, if I know the change tracking column, if I know the foreign key, I can just, you know, write this code or generate this code automatically.

Same thing with,

data quality rules. I mean, I can take a domain,

like a financial domain, and I can tag these columns as

certain attributes. Like for example, a cussip in a in a financial world which stands for for stock. You know, you you can do those things and apply automation on data quality rules. So it's not just the columns, but also taking into consideration

those relationships with the columns.

What is this column

do in a particular table, is it the business key, is it a composite key, more than 1 column. Right? So all those are very, very,

important. I wanna make sure that people when they do modeling, that those are some of the key,

things that you need to start considering.

And for people who are building their transformations, they're doing their modeling,

what are some of the elements of a given tool or workflow

that are necessary

for column awareness to be an effective component

of that overall effort? So what are the capabilities

in a tool that need be present for it to be column aware? What are some of the ways that that column awareness

will be incorporated into the

transformation and workflow of saying, okay. I've got this information in this column. I need to combine it with this column from this other table to merge it into this new attribute or this new metric that is then being propagated into this new table, which is a collection of columns and some of the ways that lineage,

at the column level needs to be present to inform the transformations and the modeling efforts, etcetera.

Yeah. So so being being column aware,

as I said before,

it's not just storing column metadata

or not just creating a lineage diagram.

Those are just are the byproducts of being column aware.

We have to leverage it from the ground up on everything that you do in the tool. So whether you're generating

code

or you're generating impact analysis

or you're

generating, lineage,

or even when you're modeling, you're selecting business keys. You you're drawing foreign key relationships.

You're applying data quality rules. So all of these should be part of your

of the tool in a way that you can easily interact with the with the metadata

and kinda enhance and enrich the metadata.

So so you have these columns. And now as a user, you understand what these columns are and what they are supposed to do,

and what their function is and what do they mean. So you can enrich that that repository, that metadata repository

by adding additional metadata.

And as you do that and how the tool would take that column metadata plus the enriched metadata

that that the user input

and generate,

these other artifacts that I was talking about, like scripts

or models or

linear diagrams or and so on. And so from that, another question is, how much of the column awareness

is intended for the actual consumer of the tool of somebody who's building the the transformations

versus the column awareness being a function of the tool itself and is something that is used to drive that automation

without the need for as much direct human involvement?

Yeah. Very good question. So

we don't wanna burden the user

by making them,

understand or or or having them

forcing them to use the tool in a certain way because we are calling away. You know, they it should be an easy to use tool, you know, whether you're a GUI person or you're writing

some code, some template.

It should be very easy to use and you don't have to think about the column awareness. You're just given.

But behind the scenes, because we are column aware, we can do a lot of automation for you. So it's mostly how the tool functions,

but there is some

some things that kinda manifest into the UI. For example,

because we know every column, we can show you a list of columns that you can select which columns are business keys. So that that just becomes easy for you. Because because we are column aware, we can show, hey, here are the columns.

Pick the business key, 1 or 2 is the composite key, and also pick what columns are type 2 columns, for example.

And the user can just kind of just go and pick rather than remembering and typing and putting it in a YAML file or whatever. So it's a it's a combination, but, mostly, it's a function of the tool behind the scenes. In terms of the overall tooling ecosystem,

the default mode for a while, at least, has been to operate on the table level or even just on the task level where there is potentially no awareness at all of whether or not there are tables or columns or files being manipulated.

And

as we have gone through successive generations of transformation tools, underlying compute platforms,

We have gotten to a more granular and nuanced level of the tools having awareness of what are the underlying objects or resources that are being operated on. And I'm wondering what are some of the complexities

of managing this column awareness

that have,

taken so long for us to get to where maybe the table was the previous level of

understanding within the tooling and, some of the ways that that poses complexity

in the tool development to be able to expose these capabilities and this semantic understanding of what is being done? So we we,

we

extract this metadata as upon discovery. Right? I mean, as soon as you connect to something, we try to kinda extract this metadata

and and save that metadata. And then you're basing off of that metadata. You're you're you're you're gonna use that as a starting point and build additional objects

and additional workflows.

As far as, you know, there there could be down the road, there could be things where we may not be able to extract this metadata

out of

out of some I don't know, like a video

message. Maybe maybe we don't need to. Or or maybe it's possible because the AI is,

you know, coming on and I don't want to go take this discussion in that direction, but I'm just saying things are changing. We understand.

But regardless,

the importance

of having this column

column awareness,

it's not gonna go away.

But, you know,

whatever

in the near future, I can't see where we say, hey. It's too complex, so let's not do column level.

Just go back to the table level or or or it's just like summarizing the data. You know, once you summarize, you lose the grand narrative. That's that's exactly what what would happen.

Is it difficult for the tools to build this?

I think we should do as as much as we can

as vendors to do this.

1 key thing though, what I've seen in the, how this was done in the past versus how we're doing now is

column metadata was a afterthought.

And lineage diagrams were an afterthought.

They were like, okay. After you do everything, whatever you're doing, then we're gonna run something and scan it and do this.

Well, that's not how we think it should be done. We think the calm awareness should be the building block

for for everything. It should be the starting point for everything,

and that's the fundamental difference. Yeah. It's challenging

problem, but, you know, we have taken upon ourselves

to solve this. Yeah. That was another aspect that I wanted to discuss is this question of

column awareness and lineage tracking within the transformation tool and within the workflow of actually manipulating the data

versus bolting it on or discovering it after the fact with some of these

metadata platforms

and the lineage analysis from the SQL logs in the data warehouse

and some of the ways that pushing that information into the transformation and data manipulation

layer impacts the effectiveness

and utility of these metadata platforms, whether it serves as a means of improving them or just serves as a means of obviating them entirely.

So so you're you're you're basically asking if,

the way that the current tools, like, after

you build something, the way they extract those, is that useful?

Yeah. So, basically,

given the fact that metadata platforms have become popular because of the need for being able to expose this lineage information, etcetera.

Does having that lineage

information and column level awareness within the transformation tool serve to improve those metadata platforms, or does it serve to help make them obsolete?

Yeah. I I yeah. I think it's the former. So it's gonna help improve.

Or the the metadata yeah. Yeah. Definitely improve. Because

the reason for that is I think when you do it as an

exercise at the end of, you know, using a metadata tool to extract everything,

they do a pretty solid job. But however, they won't be 100%.

You know, they they're going to have some, you know, accuracy issues. But when you do it from the ground up like us, because we are using this as a basis to generate the code.

So our lineage is going to be, for example, just since we're talking about lineage. There's more than just the lineage. I wanna make sure that, you know, the common awareness is not lineage is just a byproduct.

But to answer your question,

we would actually improve the metadata extraction that happens with these tools. So it's gonna be more accurate. And another point here is

it also of course, the whole point of this thing is it it improves the productivity

of building this in the 1st place. And and the metadata tools

would would would get more accurate metadata.

Yeah. It's interesting too because of the recent efforts of some of the metadata platforms to use their contextual awareness to feedback into the

transformation process, some of the data platform automation. So figuring out when do you want to schedule your different transformations to happen so that more of them can happen within this given time frame for, you know, cost management, etcetera.

So definitely interesting to see how that plays out. Another avenue that I like to explore in this space of column awareness in the transformation

layer is some of the ways that that manifests outside of the context of the data warehouse where within the warehouse, it's very clear what the benefits are because everything has happened within that bounded context of everything is a column, and it gets moved into another column. And so there there is this similarity in terms of the operations being performed. But as you start to leave the context of the warehouse into something like a machine learning workflow

or a streaming system,

or as the data leaves the warehouse and feeds back into maybe a SaaS platform through things such as reverse ETL,

How does the column awareness

help in those

external contexts?

Yeah. So I don't have a particular use case in mind, but I'm I'm pretty sure that

just like how we are saying we are going to improve the metadata

extraction process to improve the accuracy,

the same thing can be applied to other outside processes.

Because we have this level of information

and the quality of output that we are giving, that basically helps improve any external process that depends on on this.

Whether it's reverse ETL or

some

visualization tools pulling data, you know, it's all connected,

interconnected.

I just don't have a use case on top of my mind.

But but definitely, I see that it's it's it can only benefit these systems.

And in terms of your work at Coalesce and helping to

think through the ways that column awareness

manifest within your tool and just as a general practice of column awareness being this core capability of the transformation layer. What are some of the most interesting or innovative or unexpected ways that you have seen that used either in the context of data modeling or the overall impact that it has had on the engineering workflow to be able to build out these different analytical datasets?

Yeah. So,

I mean, when when we came up with this and,

you know, we you know, our clients started using our product.

What I have seen is they started building these, you know, what we call nodes in in the in the platform. So a node is just a, you know, a pattern. Basically, a node type is a pattern. And behind the pattern, you can write a template. And the template is actually leveraging the metadata behind the scenes, this column level metadata. So what I'm super excited is

about,

you know, these customers coming up with innovative

templates. They are coming up with this their own ideas

of how they can leverage this

column level metadata

and start creating templates to meet their needs. So every time you write a new template, you're basically creating a new note type that can be leveraged by anybody.

If they share it in the in the future with other clients,

I don't have to rebuild anything. I just use the note type.

But but the way that they're doing is is,

is very exciting.

How they are able to come up with innovative solutions to meet their needs riding

these no pips.

And in your own work of building

Coalesce and building this column aware tooling, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?

It was it was definitely a big it was a journey for us. You know, we have we have done went back and forth in a lot of deep technical

architectural things.

What what I have learned, let's see.

I think the number 1 thing is

the potential. I did not realize how much we can do with this, but the

more we do it, the more we think we can do with this. For example, when we when we came out with it, we had some things in mind. Okay. Yeah. We'll get lineage. Yeah. We'll get lineage. Lineage is gonna be a great thing. But then we're saying, okay. Not only that, we can start building this, what we call custom node types.

That's another thing. But now we're saying

how we can, you know, apply data quality, for example,

with with the same column level metadata.

And and now we're entering this AI world, and I can already see how we can

even do some of those

automation using AI

with the column level metadata.

So so so my point is as we do more, we see that this architecture that we chose is

is still has a lot of

potential,

to be used in the future use cases.

And for people who are

building out their

data transformation workflows, they're deciding on how best to approach their modeling, what are some of the cases where a column aware approach is the wrong choice? I I think there's no such thing in my opinion as

column awareness is incorrect

because especially

we're not asking them to

spend a lot of time

to do this. Now, of course, if you have a tool that is column aware,

they don't have to do it. But if if they have to build something,

then from the ground up in their in their environment,

and

they have to make it column aware, then it's a different story.

Because then there might be situations where you might you might say, I don't need the column awareness for this. Yeah. It's possible that I can see some use cases where they,

maybe they might it might be an overkill for them. But but if you adapt a tool that already has it,

then, you know, then it's not a big deal because you're not you're not you're only leveraging it. You're not trying to build everything from the ground up. And for teams who are

looking to improve their data modeling approaches, they're looking to

simplify their overall workflow, what are some of the,

resources that you recommend that they look to for being able to learn more about some of the data modeling principles, some of the ways that column awareness can be added to or integrated into their, overall workflow or tool chains, etcetera.

Yeah. You know, I've I've been in this space for a long time. So my recommendations

would be, you know, the the classic books, the data warehouse toolkit from Ralph Kimball and and, you know, things like that.

I don't expect everybody to just go through the whole Ralph Kimball

book, which might be, first of all, a very

lengthy book and some people might feel like

some of the concepts are outdated.

But I think, as I said before,

just focus on, like, the 1 on 1 of modeling for us. Start there. There's I can't name 1 resource. You can find so much on the on the Internet.

But make sure that you understand why what's the benefit of modeling?

Why do we need to model? And don't fall into the trap of perfecting the model. That's another thing I my advice.

Always keep iterating because it's going to change regardless. There's no perfect

model. And,

I I would say, you know, there's a there's a lot of resources, but but start small and and kinda play with them. Alright. Are there any other aspects of the overall space of data modeling and data architecture in the context of cloud warehousing in particular

and the

capabilities

and impact of column aware tooling that we didn't discuss yet that you'd like to cover before we close out the show?

1 thing that that I didn't talk about with the Calm awareness is the the documentation that you get out of this. I mean,

the governance aspect. I mean,

you you we have seen a lot of times where people take pride in coding,

which is great. And you quote all these things, and as you get your environments get bigger and bigger, you have a lot more code,

and then the parsing leaves. And then there you go. It's becomes very, very difficult for anybody to kinda go and understand the code. And this is a classic thing. We all know this. But column awareness again, another byproduct of that is the

the documentation that you can generate out of this.

Now how this was built.

And you can also add context to also show why this was built the way it was built. And by just pushing a button, it can read this column metadata and generate this documentation for you. And it's always going to be accurate whether you agree or not with that, but that's the truth. This is how it was built,

and this is what we're showing.

So that's another big value

that you get from that.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I think,

a tool

or a platform, I would say. We we like to call ourselves a platform because of our we think big. And

where business and IT

power business users

and IT architects,

data architects, data engineers, where they can all work together

on 1 single platform to build their data foundation

and data assets.

And that's the that's the gap right now. Either the tools are too extreme to to the business needs, so the business users are just using it, using those tools,

or it's the other way around. Like, the IT

oriented tools that

business doesn't have any visibility or an or or have the training to understand how these tools work. What Coalesce, for example, does is to bring these 2 types of personas,

or many other personas into the same platform. So everything is being built there,

and it's all governed in 1 place. Alright. Well, thank you very much for taking the time today to join me and share your thoughts on

the benefits and challenges of data modeling and some of the ways that column aware tooling can help with that overall effort. Definitely appreciate the time and energy that you and your team are putting into making that available. So, thank you for that. I hope you enjoy the rest of your day. Thanks, Tobias. Thanks

for having me.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links