Summary
Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-
- Your host is Tobias Macey and today I'm interviewing Satish Jayanthi about the practice and promise of building a column-aware data architecture through intentional modeling
Interview
- Introduction
- How did you get involved in the area of data management?
- How has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling?
- There are ongoing conversations about the continued merits of dimensional modeling techniques in modern warehouses. What are the modeling practices that you have found to be most useful in large and complex data environments?
- Can you describe what you mean by the term column-aware in the context of data modeling/data architecture?
- What are the capabilities that need to be built into a tool for it to be effectively column-aware?
- What are some of the ways that tools like dbt miss the mark in managing large/complex transformation workloads?
- Column-awareness is obviously critical in the context of the warehouse. What are some of the ways that that information can be fed into other contexts? (e.g. ML, reverse ETL, etc.)
- What is the importance of embedding column-level lineage awareness into transformation tool vs. layering on top w/ dedicated lineage/metadata tooling?
- What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on building column-aware tooling?
- When is column-aware modeling the wrong choice?
- What are some additional resources that you recommend for individuals/teams who want to learn more about data modeling/column aware principles?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Your host is Tobias Macy, and today I'm interviewing Satish Jayanti about the practice and promise of building a column aware data architecture through intentional modeling. So, Satish, for folks who haven't listened to your previous appearance on the show, if you can just start by introducing yourself.
[00:00:30] Unknown:
Yes. Thanks, Tobias, for having me. My name is, Satish Jayanti. I'm a a cofounder and CTO at Callas.
[00:00:39] Unknown:
And do you remember how you first got started working in data?
[00:00:42] Unknown:
Oh, yeah. Absolutely. So, I've started my career, you know, in in software engineering, application development. And I was, dabbled in that for a few years and was working for a a start up in year 2000 in Los Angeles. And, this we we were selling online courses for a large financial firms. You know, there was a lot of, lot of boom at that time as you know. You may have heard how, you know, the online, Internet was taking off and everything was very, very, getting very hot. So at that time, I was working there. And a lot of times what happened was people would come to me and say, hey. Can you give me this report?
You know, can can I have that report? Can I have this report? And I I I started kinda taking those requests as they came in and started building stuff and and giving them. But at some point, I realized that this was just overwhelming. It was just not I was not able to keep up. And then that's when I started looking more and more into how this is done usually. And, you know, that that's where I kinda hit the, the the topic of data warehousing, how to build it right, how the dimensional modeling and all of that. So I started like that, accidentally getting into it, and, never looked back.
[00:02:15] Unknown:
And so in terms of the overall space of data warehousing, data platforms, obviously, the cloud has been the most impactful change in the ecosystem for probably the last decade, if not longer. And in the process of moving some of those data warehousing and data platform workloads, large organizations in particular have taken a bit of a longer time to do that because they have more existing infrastructure. They have, bigger risk involved than actually making that migration. There are more moving pieces to coordinate around that. And so a lot of the patterns and practices around how to do data platforms and data warehousing in the cloud have been developed by smaller and faster moving organizations.
And I'm curious what you have seen as the overall impact of the cloud and some of that early adoption on the overall practice of data modeling and the data architectures that manifest in cloud environments as compared to some of the on prem environments that these larger organizations are still managing?
[00:03:22] Unknown:
Yeah. Absolutely. So and and and, you know, I mean, you you said it right. So smaller companies moved to the cloud much quicker, obviously, because, they didn't perceive the the the risk or anything. I mean, they didn't they didn't have that much to begin with. They're getting started. They don't have large volumes of data. So they just quickly took advantage of the cloud platforms compared to larger firms. There, you know, there's so many legacy systems running and security was a big thing. When cloud started, everybody every CTO, every executive was questioning the security. Like, how secure is it going to be once you put it on the on the cloud? So that was a big thing. So that took a while for a lot of companies to even kind of think about how, you know, how to address that. As far as the data modeling goes, I think when these small companies or or smaller, you know, smaller, environments move to cloud, things or most companies like startups, which move pretty quickly.
And data modeling was seen as a somewhat of a hurdle. And and, or in some cases, they may not even aware of data modeling because they haven't done that in the past. So they may not know that was an important piece. So for that reason, it just went on and people started building this really fast, and the data modeling took the back seat. And it's just, for the last few years, it's just disappeared from the analytics space. It's coming back now, but that's that that's what, that's what happened in in, as these things move to cloud.
[00:05:04] Unknown:
And in particular, with the data warehousing, data modeling approach of, in particular, dimensional modeling that was developed in the nineties in the era of these large and expensive data warehouse appliances that were typically resource constrained. There were a lot of, conflicting priorities around what you needed to be able to answer and when. I'm wondering how you have seen the ongoing conversation around the relative merits of dimensional modeling versus some of the newer paradigms of wide tables, etcetera, and how that manifests and what the actual requirements are around modeling priorities in cloud data warehouse environments?
[00:05:49] Unknown:
Yeah. So if you take the data modeling practice, I mean, we should probably look at why people used to model in the first place. Like what were the benefits that people were getting out of modeling. Right? I mean, there were 3 stages of modeling, generally speaking. A lot of people agree with this. The conceptual modeling, the logical modeling, and the physical modeling. Right? So the conceptual modeling is primarily where you're saying, hey. I need to understand how the business, or what the business needs. So what what is what is that they want? What are the requirements?
So in a conceptual model, you start putting all the the entities, the business entities, and you start drawing lines between these entities to show relationships between them. And then you use this conceptual modeling as a way to gather requirements as a communication tool to work with the business. It's much easier when you, you know, show something visually. You can sit in a room, talk to the business people and say, hey, is this what you're thinking? Is this how the relationships are? That's a thinking exercise. That's very important for the business and the people who are building the solutions to get on the same page. And then as you move towards to the logical model, it's more like the solution. You're talking about, okay, this is how I'm going to do it. It's still abstracted away from the physical implementation, but it's still a way to, kind of draw out the solution.
And then finally, the physical, model. Physical models are where now they're closer, and they're very they could be different depending on the target platform. So if it's a cloud platform, you may have some decisions that you would not have done when you were doing an on prem database platform. So as as you get to the physical aspect of this, it's coming closer to the target platforms. So because Cloud platforms do perform well well, you can make some decisions like, hey. I don't care about the storage. You know? I I think the the system can handle the compute, so this is fine. So you're basically kinda doing those, kind of decisions as you move towards this. Now coming to the dimensional models.
Now dimensional models are are the same. I mean, dimensional models are for analytics mostly, and they are kind of report or analytics oriented models. Makes it much much easier for the business to understand a model without getting overwhelmed with something like an operational model. Like, if you have an operational data model, there's a lot of tables, a lot of relationships, and just become very, very hard to even write queries. But whereas if you show a domestic model and if you do it right, it becomes much much easier for for not only kinda start with requirements and understand, but also once you do it, then business consumption, when they look at this, they can kind of relatively easily understand this these type of models.
[00:08:55] Unknown:
And as far as that dimensional modeling practice, just to expand that a bit more for folks who aren't familiar, this largely refers to the star and snowflake schema versus the data vaults, schema approach of the way that you structure the different tables and relationships within the warehouse as opposed to what's typically 3rd normal form in transactional databases. And as far as that actual exercise of defining these dimensional models and figuring out what are the actual attributes and dimensions, That can be a very complex and time consuming process. I'm wondering what are some of the ways that teams have worked through that to be able to manage the, very rapid pace of requirements and the the changing requirements with the upfront investment needed to be able to actually build out those models and cement them and, some of the ways the teams should be thinking about that balance of, I need to get some answers right now versus this is going to help me in the long run. Yeah. That that's a very good point, and
[00:10:02] Unknown:
that's the challenge. Right? So people fall in these traps. And 1 trap is, hey, don't do modeling because I can do it fast. Or do modeling and perfect the model so I'll get it right. Both are wrong, in my opinion, because you got to get the right balance. You if you skip the modeling exercise, regardless of whether we aren't doing our cloud or not, then you're not taking the time to think. You're not taking the time to build that foundation correctly. As opposed to saying, okay, I'm gonna spend all my time in modeling, then you're ignoring the business. You know, you need to do your end goal is to provide high quality data, that can be, you know, expanded, upon more requirements. As you get more requirements, you need to have a way to scale this.
To do that, you need to balance. So so the I mean, what I have done in the past is my goal was to always kinda take the data and regardless of I mean, you don't have to model everything. You don't understand. You do that first iteration of just get a feel for it. Do that concept to model or logical model, whatever that is. Don't try to perfect that. Just get the first iteration of it. Build something. It doesn't have to be purely dimensional model yet. It could be just views. And then put it out there as soon as possible. And then let the business kinda and set the expectation that, hey, this is the first iteration. I want you to take a look at the data first and tell me, you know, I have these questions and do you see anything wrong?
So and then they'll look at the data. It could be just dashboards or whatever way that you want to, you know, give them the data. It could be sort of Excel. They can download into that. But they can look at the data quickly and then give that feedback. Then you start thinking about, oh, now I understand the relationships. Now I can kinda improve my model and say, hey. This is how this needs to be. And then kinda follow those iterations as you go through. It will never be perfect, and that's okay because everything is changing. Now even though it may seem perfect today, it's not going to be as soon as the company goes and buys another company. You know, everything is like the rules are changed. Now everything has changed. So that's okay. But but understanding that that balance is important and that as long as you're trying to improve it, continuously, you'll be successful.
[00:12:26] Unknown:
Another outgrowth of the cloud and cloud warehouses and the decreasing storage costs and the decoupling of compute and storage is the shift in the integration path from ETL, which was very, upfront investment. You had to make sure that you got things right because, otherwise, you're going to lose information or you're going to have the wrong assumptions and do the transformations wrong, and you had no way of going back and doing it right afterwards. Or if you did, it required a lot of extra engineering to give you that fallback versus the ELT paradigm that has become prevalent of just load all of the data and then transform it so so that you don't have to think about modeling up front, which I think has had this, downstream impact of saying, well, if I don't have to do all the modeling on load, well, I don't really have to do any modeling at all. Because if I do do it wrong, I can just rewrite it and rebuild everything, and it's fine. And I'm curious what you have seen as the, relative merits of the upfront investment requirements of ETL versus the very fast moving iterative capability of ELT and some of the ways that that influences the overall thinking and patterns within a team or an organization as far as that data modeling, data integration, data analytics workflow?
[00:13:44] Unknown:
Yeah. Absolutely. So, I mean, e ELT, definitely is has changed the game in in terms of how you build these things. In in the in the old days, you know, when you had ETL systems, they became the bottleneck. They are very rigid tools. The GUI tools mostly, but, they brought the efficiencies, but also rigid in terms of, you know, you didn't when you hit a corner case, you didn't have much control on it. And and also you were not leveraging the investment that you made in the target platforms. Right? The database platforms. But ELT paradigm has changed that. Now you have this full power, that you can utilize, on the the using the database platform to do the transformations.
So what that means is, yeah, you can you can iterate through much much, you know, rapidly, as opposed to the ETL, ETL paradigm. So, yeah, as I said before, I think it's it's much more easier to do those kind of iterations, doing ELT on a platform like for example, Snowflake or some kind of cloud platforms.
[00:14:54] Unknown:
Another aspect of this question of data modeling, particularly with dimensional models where it does make the understandability and discoverability of the core domain objects clearer than if you have these 1 off models that answer a specific question. But if you wanna do any further exploration, then you have to go back to from the source and build back up to it. Is this question of data self-service for people in the organization, people in the business suite who don't necessarily have the capacity or time to dig deep into the data and understand it semantically. And I'm curious what you have seen for the teams who fall into that pit of, I just need to build this 1 model to answer this 1 specific question versus investing in that upfront data modeling or dimensional modeling, who is driving the actual effort of doing the analysis, and how does that how how does that manifest in terms of the ways that the teams think about what their priorities are of doing the upfront modeling versus just answer questions really quick? Classic challenge. Right? I mean, you know, the business units are saying, I know my data. I understand my data.
[00:16:07] Unknown:
Even if you give me a model, I'm gonna look at this model for this particular process. And I'm gonna just build it for this 1 and and I'm happy. But the problem with that is there's not just 1 business process in a company. There's multiple business processes, And these business processes share a lot of data. So you have to have a way to kind of plan this from that higher level. Like, to understand, hey, what are my business processes? What is the shared data? What become dimensions? What become facts? And how are these dimensions linked? This is all very, very important to make sure that you're not building silos. Because just because you build a dimensional model doesn't mean that you're doing it right. You know, you need to make sure that you are, building with the intention to share data across these business processes and understand those. And there's a term Kimbal mentions, you know, called conformed dimensions.
That was the idea. Now we don't wanna get into how we do that and what are the pros and cons and all of that. But generally speaking, the idea is, hey, there's a shared dataset. You build that dataset for 1 process, for 1 unit, and then make sure that you leverage and use that same thing to connect other, you know, other datasets. So that way you kind of build in this integration.
[00:17:29] Unknown:
And as far as the overall ecosystem of data modeling practices, Obviously, star and snowflake schema and data vault have been around possibly the longest, or they've been the longest lived. But I'm wondering if there are any other patterns or data modeling approaches that have been developed in recent years that you have found to be useful or beneficial or things that are maybe built off of some of the star and snowflake or Datavault approaches that make it easier to iterate into those dimensional approaches?
[00:18:03] Unknown:
Yeah. So, I mean, to be honest, I haven't seen anything big like that that, you know, when dimension models came out and data wall came out. Of course, there was the enterprise data model from Inman. So these were the major kind of data modeling, you know, data modeling methodologies. Yeah. As far as I mean, there's there's time series modeling. There's like all these subsets of modeling that that you can talk about, but nothing major in terms of revamping the whole thing. So so if you're talking about data warehouses, you have a few options. You know, you're saying, okay, I want to build my, you know, use Data Vault. If you're if you're seeing that there's a need to build a Data Vault for your business, then that would be a good approach to build that raw data, organize that. And then on top of that, you build your star schemas.
You're still building the dimensional models, but you're building on top of the the raw data vault models. And there are situations where you may say, I I we don't do Data Vault link in Data Vault. We can just go straight into, dimension models. There's pros and cons in each 1 of those. But on top of this, there are things that you can do custom kind of modeling, I would say. You know, you can create aggregate tables if you think there's, you know, there is a performance thing. And your data is, like, huge and you want to aggregate it. But make sure you aggregate it after you bring the granular data into your, into your air warehouse. Right? Because if you if you if you capture the data from the operational systems and you aggregate it right away, then if somebody's asking a deeper question to, hey. Show me how you got this aggregated number, then you don't have a way to drill into it because now you have to go to the operational system, which is probably already update. So yeah. I think there's not not I have not seen, like, a big big, model methodology since these 2.
[00:19:58] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudderstack. Another element of the conversation today is the question of column awareness in the space of data modeling and data transformation. And I'm curious if you can just start by describing what you mean by that term and some of the ways that that will impact the approach of data modeling or the ways that data transformations and table structures are built up? Yeah. So
[00:20:51] Unknown:
if you look at the data, the I think columns columns are the the building blocks pretty much. That's the lowest grain that you can go to in a in a dataset. Right? I mean, if you if you think look at an Excel sheet, each cell is a, you know, is is the data value in a particular column. When we say column aware, it's basically column metadata. That's what we're talking about here. But how we can leverage that, that's the distinction here. You know, how we can use it to its maximum potential to do automation is what we are talking about. By storing the column metadata or being column aware and using that leveraging that from the ground up to do whether it's modeling, whether you're generating code, whether you're doing column lineage, whether you're doing impact analysis, whether you're determining the state of your target database platform. You get all of those benefits just by leveraging that column awareness.
And and and again, that's what we we did at at Colas. So if you if you look at anything that you're sharing with the business, you know, whether it's a KPI or metric, A KPI is made out of metrics. Like KPI is a key performance indicator, you know, it's made out of metrics. And if a metric is usually made out of 1 or more columns, and these columns are coming from tables. So it's always comes down to columns. You know, you're applying transformations on columns. You're adding new columns. You're deleting new columns. You're working with columns. Data quality is about columns. So it just becomes so crucial to understand the relationships and the dependencies between these columns. And, of course, columns in tables, in tables search collection of columns. Right? So what what I we think is that that's the best way to kinda build an analytics system so you can scale.
Because when you're small, it doesn't matter as much. When you're meaning when you're building a small, system or a small database, small model, it doesn't matter much. Because you were thinking, I can write a query. I can get this done pretty fast. Fine. But usually, that's not the intention. Like, whenever you're working for a business you're building, it it may be small at that time, but the whole intention is to grow. You know, they'll go and get more datasets. You're selling more. You're acquiring other companies. Whatever that's but you have to have a way to scale. So that's where this comes into play. The automation comes into play. And to do proper automation, you need to be column aware. So many different things to dig into there.
[00:23:30] Unknown:
So from the perspective of a table is just a collection of columns, and the columns are the thing that you actually care about, I'm wondering if you can talk to some of the more nuanced aspects of that where, yes, a table is a collection of columns, but the columns in within a table typically all relate to each other in some fashion. So there is some contextual awareness that you need to have to be able to operate properly on that column. And I'm wondering what are some of the pitfalls that somebody might fall into if they say, I only really care about the columns, so this is the thing I'm going to track. What what are the pieces of metadata that they might drop on the floor by the fact that they are maybe ignoring some of the other, contextual and semantic aspects of the table that that column is contained within and the ways that that column is propagated to other differently named tables or collections of columns, etcetera?
[00:24:17] Unknown:
Very good question. So so, you know, when I say columns, tables, a collection of columns, you know, it's true. However, there are dependencies. There are relationships between those columns. There's a primary key. There's a foreign key. There's there's all these other relationships that are very, very important. And that's where I guess the modeling comes into play. But there is additional metadata we store on these columns. You know, it's about not just the name of the data type, but there is, like, you can attach now. You can attach or you can tag things as foreign keys, as business keys, as change type 2 columns in in a dimension.
By doing those, now we can kinda leverage those additional metadata tags to do even more automation. We can generate scripts that would say, hey, if I know the business key, if I know the change tracking column, if I know the foreign key, I can just, you know, write this code or generate this code automatically. Same thing with, data quality rules. I mean, I can take a domain, like a financial domain, and I can tag these columns as certain attributes. Like for example, a cussip in a in a financial world which stands for for stock. You know, you you can do those things and apply automation on data quality rules. So it's not just the columns, but also taking into consideration those relationships with the columns.
What is this column do in a particular table, is it the business key, is it a composite key, more than 1 column. Right? So all those are very, very, important. I wanna make sure that people when they do modeling, that those are some of the key, things that you need to start considering.
[00:25:56] Unknown:
And for people who are building their transformations, they're doing their modeling, what are some of the elements of a given tool or workflow that are necessary for column awareness to be an effective component of that overall effort? So what are the capabilities in a tool that need be present for it to be column aware? What are some of the ways that that column awareness will be incorporated into the transformation and workflow of saying, okay. I've got this information in this column. I need to combine it with this column from this other table to merge it into this new attribute or this new metric that is then being propagated into this new table, which is a collection of columns and some of the ways that lineage, at the column level needs to be present to inform the transformations and the modeling efforts, etcetera.
[00:26:47] Unknown:
Yeah. So so being being column aware, as I said before, it's not just storing column metadata or not just creating a lineage diagram. Those are just are the byproducts of being column aware. We have to leverage it from the ground up on everything that you do in the tool. So whether you're generating code or you're generating impact analysis or you're generating, lineage, or even when you're modeling, you're selecting business keys. You you're drawing foreign key relationships. You're applying data quality rules. So all of these should be part of your of the tool in a way that you can easily interact with the with the metadata and kinda enhance and enrich the metadata.
So so you have these columns. And now as a user, you understand what these columns are and what they are supposed to do, and what their function is and what do they mean. So you can enrich that that repository, that metadata repository by adding additional metadata. And as you do that and how the tool would take that column metadata plus the enriched metadata that that the user input and generate, these other artifacts that I was talking about, like scripts or models or
[00:28:09] Unknown:
linear diagrams or and so on. And so from that, another question is, how much of the column awareness is intended for the actual consumer of the tool of somebody who's building the the transformations versus the column awareness being a function of the tool itself and is something that is used to drive that automation without the need for as much direct human involvement?
[00:28:36] Unknown:
Yeah. Very good question. So we don't wanna burden the user by making them, understand or or or having them forcing them to use the tool in a certain way because we are calling away. You know, they it should be an easy to use tool, you know, whether you're a GUI person or you're writing some code, some template. It should be very easy to use and you don't have to think about the column awareness. You're just given. But behind the scenes, because we are column aware, we can do a lot of automation for you. So it's mostly how the tool functions, but there is some some things that kinda manifest into the UI. For example, because we know every column, we can show you a list of columns that you can select which columns are business keys. So that that just becomes easy for you. Because because we are column aware, we can show, hey, here are the columns.
Pick the business key, 1 or 2 is the composite key, and also pick what columns are type 2 columns, for example. And the user can just kind of just go and pick rather than remembering and typing and putting it in a YAML file or whatever. So it's a it's a combination, but, mostly, it's a function of the tool behind the scenes. In terms of the overall tooling ecosystem,
[00:29:50] Unknown:
the default mode for a while, at least, has been to operate on the table level or even just on the task level where there is potentially no awareness at all of whether or not there are tables or columns or files being manipulated. And as we have gone through successive generations of transformation tools, underlying compute platforms, We have gotten to a more granular and nuanced level of the tools having awareness of what are the underlying objects or resources that are being operated on. And I'm wondering what are some of the complexities of managing this column awareness that have, taken so long for us to get to where maybe the table was the previous level of understanding within the tooling and, some of the ways that that poses complexity in the tool development to be able to expose these capabilities and this semantic understanding of what is being done? So we we,
[00:30:51] Unknown:
we extract this metadata as upon discovery. Right? I mean, as soon as you connect to something, we try to kinda extract this metadata and and save that metadata. And then you're basing off of that metadata. You're you're you're you're gonna use that as a starting point and build additional objects and additional workflows. As far as, you know, there there could be down the road, there could be things where we may not be able to extract this metadata out of out of some I don't know, like a video message. Maybe maybe we don't need to. Or or maybe it's possible because the AI is, you know, coming on and I don't want to go take this discussion in that direction, but I'm just saying things are changing. We understand.
But regardless, the importance of having this column column awareness, it's not gonna go away. But, you know, whatever in the near future, I can't see where we say, hey. It's too complex, so let's not do column level. Just go back to the table level or or or it's just like summarizing the data. You know, once you summarize, you lose the grand narrative. That's that's exactly what what would happen. Is it difficult for the tools to build this? I think we should do as as much as we can as vendors to do this. 1 key thing though, what I've seen in the, how this was done in the past versus how we're doing now is column metadata was a afterthought.
And lineage diagrams were an afterthought. They were like, okay. After you do everything, whatever you're doing, then we're gonna run something and scan it and do this. Well, that's not how we think it should be done. We think the calm awareness should be the building block for for everything. It should be the starting point for everything, and that's the fundamental difference. Yeah. It's challenging problem, but, you know, we have taken upon ourselves
[00:32:48] Unknown:
to solve this. Yeah. That was another aspect that I wanted to discuss is this question of column awareness and lineage tracking within the transformation tool and within the workflow of actually manipulating the data versus bolting it on or discovering it after the fact with some of these metadata platforms and the lineage analysis from the SQL logs in the data warehouse and some of the ways that pushing that information into the transformation and data manipulation layer impacts the effectiveness and utility of these metadata platforms, whether it serves as a means of improving them or just serves as a means of obviating them entirely.
[00:33:29] Unknown:
So so you're you're you're basically asking if, the way that the current tools, like, after you build something, the way they extract those, is that useful?
[00:33:40] Unknown:
Yeah. So, basically, given the fact that metadata platforms have become popular because of the need for being able to expose this lineage information, etcetera. Does having that lineage information and column level awareness within the transformation tool serve to improve those metadata platforms, or does it serve to help make them obsolete?
[00:33:59] Unknown:
Yeah. I I yeah. I think it's the former. So it's gonna help improve. Or the the metadata yeah. Yeah. Definitely improve. Because the reason for that is I think when you do it as an exercise at the end of, you know, using a metadata tool to extract everything, they do a pretty solid job. But however, they won't be 100%. You know, they they're going to have some, you know, accuracy issues. But when you do it from the ground up like us, because we are using this as a basis to generate the code. So our lineage is going to be, for example, just since we're talking about lineage. There's more than just the lineage. I wanna make sure that, you know, the common awareness is not lineage is just a byproduct. But to answer your question, we would actually improve the metadata extraction that happens with these tools. So it's gonna be more accurate. And another point here is it also of course, the whole point of this thing is it it improves the productivity of building this in the 1st place. And and the metadata tools would would would get more accurate metadata.
[00:35:05] Unknown:
Yeah. It's interesting too because of the recent efforts of some of the metadata platforms to use their contextual awareness to feedback into the transformation process, some of the data platform automation. So figuring out when do you want to schedule your different transformations to happen so that more of them can happen within this given time frame for, you know, cost management, etcetera. So definitely interesting to see how that plays out. Another avenue that I like to explore in this space of column awareness in the transformation layer is some of the ways that that manifests outside of the context of the data warehouse where within the warehouse, it's very clear what the benefits are because everything has happened within that bounded context of everything is a column, and it gets moved into another column. And so there there is this similarity in terms of the operations being performed. But as you start to leave the context of the warehouse into something like a machine learning workflow or a streaming system, or as the data leaves the warehouse and feeds back into maybe a SaaS platform through things such as reverse ETL, How does the column awareness help in those external contexts?
[00:36:14] Unknown:
Yeah. So I don't have a particular use case in mind, but I'm I'm pretty sure that just like how we are saying we are going to improve the metadata extraction process to improve the accuracy, the same thing can be applied to other outside processes. Because we have this level of information and the quality of output that we are giving, that basically helps improve any external process that depends on on this. Whether it's reverse ETL or some visualization tools pulling data, you know, it's all connected, interconnected.
I just don't have a use case on top of my mind. But but definitely, I see that it's it's it can only benefit these systems.
[00:36:58] Unknown:
And in terms of your work at Coalesce and helping to think through the ways that column awareness manifest within your tool and just as a general practice of column awareness being this core capability of the transformation layer. What are some of the most interesting or innovative or unexpected ways that you have seen that used either in the context of data modeling or the overall impact that it has had on the engineering workflow to be able to build out these different analytical datasets?
[00:37:29] Unknown:
Yeah. So, I mean, when when we came up with this and, you know, we you know, our clients started using our product. What I have seen is they started building these, you know, what we call nodes in in the in the platform. So a node is just a, you know, a pattern. Basically, a node type is a pattern. And behind the pattern, you can write a template. And the template is actually leveraging the metadata behind the scenes, this column level metadata. So what I'm super excited is about, you know, these customers coming up with innovative templates. They are coming up with this their own ideas of how they can leverage this column level metadata and start creating templates to meet their needs. So every time you write a new template, you're basically creating a new note type that can be leveraged by anybody.
If they share it in the in the future with other clients, I don't have to rebuild anything. I just use the note type. But but the way that they're doing is is, is very exciting. How they are able to come up with innovative solutions to meet their needs riding these no pips.
[00:38:36] Unknown:
And in your own work of building Coalesce and building this column aware tooling, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?
[00:38:47] Unknown:
It was it was definitely a big it was a journey for us. You know, we have we have done went back and forth in a lot of deep technical architectural things. What what I have learned, let's see. I think the number 1 thing is the potential. I did not realize how much we can do with this, but the more we do it, the more we think we can do with this. For example, when we when we came out with it, we had some things in mind. Okay. Yeah. We'll get lineage. Yeah. We'll get lineage. Lineage is gonna be a great thing. But then we're saying, okay. Not only that, we can start building this, what we call custom node types. That's another thing. But now we're saying how we can, you know, apply data quality, for example, with with the same column level metadata.
And and now we're entering this AI world, and I can already see how we can even do some of those automation using AI with the column level metadata. So so so my point is as we do more, we see that this architecture that we chose is is still has a lot of potential, to be used in the future use cases.
[00:40:02] Unknown:
And for people who are building out their data transformation workflows, they're deciding on how best to approach their modeling, what are some of the cases where a column aware approach is the wrong choice? I I think there's no such thing in my opinion as
[00:40:19] Unknown:
column awareness is incorrect because especially we're not asking them to spend a lot of time to do this. Now, of course, if you have a tool that is column aware, they don't have to do it. But if if they have to build something, then from the ground up in their in their environment, and they have to make it column aware, then it's a different story. Because then there might be situations where you might you might say, I don't need the column awareness for this. Yeah. It's possible that I can see some use cases where they, maybe they might it might be an overkill for them. But but if you adapt a tool that already has it, then, you know, then it's not a big deal because you're not you're not you're only leveraging it. You're not trying to build everything from the ground up. And for teams who are
[00:41:10] Unknown:
looking to improve their data modeling approaches, they're looking to simplify their overall workflow, what are some of the, resources that you recommend that they look to for being able to learn more about some of the data modeling principles, some of the ways that column awareness can be added to or integrated into their, overall workflow or tool chains, etcetera.
[00:41:32] Unknown:
Yeah. You know, I've I've been in this space for a long time. So my recommendations would be, you know, the the classic books, the data warehouse toolkit from Ralph Kimball and and, you know, things like that. I don't expect everybody to just go through the whole Ralph Kimball book, which might be, first of all, a very lengthy book and some people might feel like some of the concepts are outdated. But I think, as I said before, just focus on, like, the 1 on 1 of modeling for us. Start there. There's I can't name 1 resource. You can find so much on the on the Internet. But make sure that you understand why what's the benefit of modeling?
Why do we need to model? And don't fall into the trap of perfecting the model. That's another thing I my advice. Always keep iterating because it's going to change regardless. There's no perfect model. And,
[00:42:25] Unknown:
I I would say, you know, there's a there's a lot of resources, but but start small and and kinda play with them. Alright. Are there any other aspects of the overall space of data modeling and data architecture in the context of cloud warehousing in particular and the capabilities and impact of column aware tooling that we didn't discuss yet that you'd like to cover before we close out the show?
[00:42:48] Unknown:
1 thing that that I didn't talk about with the Calm awareness is the the documentation that you get out of this. I mean, the governance aspect. I mean, you you we have seen a lot of times where people take pride in coding, which is great. And you quote all these things, and as you get your environments get bigger and bigger, you have a lot more code, and then the parsing leaves. And then there you go. It's becomes very, very difficult for anybody to kinda go and understand the code. And this is a classic thing. We all know this. But column awareness again, another byproduct of that is the the documentation that you can generate out of this.
Now how this was built. And you can also add context to also show why this was built the way it was built. And by just pushing a button, it can read this column metadata and generate this documentation for you. And it's always going to be accurate whether you agree or not with that, but that's the truth. This is how it was built, and this is what we're showing. So that's another big value that you get from that.
[00:43:55] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I think,
[00:44:12] Unknown:
a tool or a platform, I would say. We we like to call ourselves a platform because of our we think big. And where business and IT power business users and IT architects, data architects, data engineers, where they can all work together on 1 single platform to build their data foundation and data assets. And that's the that's the gap right now. Either the tools are too extreme to to the business needs, so the business users are just using it, using those tools, or it's the other way around. Like, the IT oriented tools that business doesn't have any visibility or an or or have the training to understand how these tools work. What Coalesce, for example, does is to bring these 2 types of personas, or many other personas into the same platform. So everything is being built there,
[00:45:08] Unknown:
and it's all governed in 1 place. Alright. Well, thank you very much for taking the time today to join me and share your thoughts on the benefits and challenges of data modeling and some of the ways that column aware tooling can help with that overall effort. Definitely appreciate the time and energy that you and your team are putting into making that available. So, thank you for that. I hope you enjoy the rest of your day. Thanks, Tobias. Thanks
[00:45:31] Unknown:
for having me.
[00:45:38] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Satish Jayanti's Journey into Data
Impact of Cloud on Data Warehousing
Stages and Benefits of Data Modeling
Challenges in Dimensional Modeling
ETL vs ELT Paradigms
Data Self-Service and Modeling Priorities
Evolution of Data Modeling Approaches
Column Awareness in Data Modeling
Tooling and Workflow for Column Awareness
Complexities of Column Awareness
Metadata Platforms and Column Awareness
Column Awareness Beyond Data Warehousing
Innovative Uses of Column Awareness
Documentation and Governance with Column Awareness
Biggest Gap in Data Management Tooling