Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/

rudderstack.

Your host is Tobias Macy, and today, I'm interviewing Boeschman about the concept of entity centric data modeling for analytical use cases. So, Max, for people who haven't heard any of your past appearances on the show, can you just give a bit of an introduction?

Yeah. For sure. And I I was, trying to think how many times I've been on the show before

and then, how many times we're planning together to get me back on the show too. So this is like a

ace you know, part of a series of multiple dozens of episodes that

we'll we'll be doing together. Little bit of background on me. So I think to this to this, community and audience, I'm probably best known for being the creator of Apache Airflow. So I started this project while I was at Airbnb

in 2014.

A little bit after that, also while at Airbnb, I started Apache Superset. So Apache Superset,

for those not familiar with it, is a business intelligence, data visualization,

dashboarding,

web application that's open source. Right? So you can

essentially,

download that, install that, and run that in your company just like you would run Tableau and Looker. And,

people are probably familiar with with Superset in the audience, but, like, it has gotten really, really freaking good over the past, like, year or 2. So if you haven't checked it out, you know, since I first started talking about it in

2017,

2018 is is probably when we launched it. It is good now. It is really it's it's really solid, really competitive,

better than, you know,

some of the vendor tools in in many, many regards,

especially for a technical audience. Right. So for for people that are more technical, no sequel,

the tool is more designed for the, you know, the the modern data practitioner too. So invite people to to try it. And then I don't do much of a commercial bit, but, like, if you wanna try Apache Super Set, the best way to try it is on presets of so I am a startup founder. So I started

a a a company called Preset

around Apache Superset. And the goal there is to offer a really awesome managed

service around Apache Superset with bells and whistles.

And it's all it's all managed and there's a freemium component so you can try it, you know, for for free today. And you can try it and go and use open source if you want to, but you could the easiest way to try superset

latest greatest version is on preset. Io that you get out.

Yeah. So found so founder now, I guess, I'm more of a founder, but, like, very much an open source enthusiast.

Love writing code, building communities,

and and more recently, building a company

to that can that can fuel open source. Right? So the idea is much more like, how do we create something

sustainable,

around the open source project so we can be a little bit the the parent or the the the foster parent for the open source project and do all the the difficult things and and, you know, really help the community. So for me, it was, like, how business can help open source disrupt

in this symbiotic kind of way much more than, like, oh, let's extract, you know, the the the money out of this community. Like, that that's really not the goal. The goal is to have this symbiotic like, you know, capital infusion and open source and helping

helping us to get open source to to, to disrupt and and eventually

dominate, you know, in,

in in business intelligence.

Yeah. I I definitely see Superset all over the place these days. Pretty much everywhere I look, if there's any sort of open source aspect at all to a stack, Superset's usually involved there. So definitely great work on that.

And, just a brief recap again for people who haven't listened to any of your past appearances,

how you first got involved with working in data.

Yeah.

And, you know, we could look at look back at the topics we talked about together too as a series of, like, my shifting and ever evolving interest over time.

But, but I first started a while ago. So, like, you you could tell if we were on video how much, like, gray hair

I have at this point. But I started,

early 2000

as

well, first my career started, I I did 1 year of web development, which is kind of a it's a great thing to be able to build apps. Right? Regardless what your

engineering career is gonna be like,

being able to build little website and little tools is is super helpful. So I did a year of that early on. And then and then I got into pretty much data warehousing. So I use I was at Ubisoft,

a bit of a company that's doesn't need to be introduced anymore probably,

at this at this point in time. But they,

yeah. So there I started doing, like, traditional data warehousing. So I picked up,

you know, I'd had some classes in school. And I'm a college drop out, but I I had, like, an IT program

equivalent of an associate degree where I learned some SQL, some data modeling,

a little bit of everything,

in terms of building apps and co encoding. I guess we didn't call them apps at the time. But, but yeah. So I had done some SQL, some data modeling, and then got into data warehousing, picked up the the rough Kimball books

at the time,

around the dimensional modeling and and started building a

data warehouse,

for for Ubisoft, all of their financial data, retail data,

you know, pretty much covering the the realm of the data they had at the time.

And that was the start of a, you know,

a career in

everything

data pretty much. Right? Data. So I was the data warehouse architect. I was the the business intelligence

engineer.

And then, you know, later on in my career, I joined,

more modern I moved to Silicon Valley and then joined companies like Facebook, Airbnb, Lyft,

And that's where I started, you know, working working on open source and doing what we call today data engineering in in in a more

a more modern way, you know.

Yeah. And, a lot of people have credited you as being, to some extent, the godfather of data engineering as it's defined today given your well timed blog posts of the rise and fall of the data engineer, which roughly coincided with the start of this podcast. So

Yeah. Yeah. I remember that was right, like, our first 1 of our first conversation

together.

Yes. I remember at the time and, you know, it's not like I invented

that engineering, but, yeah, I wrote the blog post that try

to, you know, try to define what it was and and what it was not

at the time. And clearly, you know, for me, my experience is I joined Facebook a data warehouse architect.

And by the time I left a few years later, I I left as a data engineer. We realized while we were there that what we were doing was

was not, you know, data warehouse architecture

or business intelligence. It was a whole new discipline

with different like, with different where different skills were

leveraged in different ways, and the approach was completely different.

So at some point, internally on Facebook, we decided to call ourselves

data engineer. And that I think I was inspired. There was an influential blog post called the Rise of the Data Scientists or something like that.

I dug that out. That was that was pretty

critical to to that,

to kinda cementing, defining that term and and creating

that that new role and persona.

So I was like, hey, I'd I'd like to do that for data engineering. I wrote that blog post. And,

yeah, rest is a little bit of history.

Bringing us now to the topic at hand for today,

brings us around to another 1 of your blog posts. So writing is definitely 1 of your strong suits. I appreciate all of the effort you put into that. And so you wrote a post about this concept of entity centric data modeling.

And so for people who haven't already read the post, we'll have the link in the show notes so you can either pause now, read it, come back, or just read it afterwards.

But for people who have don't don't wanna take the time, can you just give a bit of an overview about what it is when you describe entity centric data modeling and

some of your process of coming to this formulation

of the problem?

Yeah. So first, a bit of a disclaimer. Right? Like, talking about data modeling without schematics and image is a little difficult,

or it's not difficult for me to talk about. It might be difficult for for people to make sense of whatever we're gonna be saying here. So I encourage people to to read the blog post too, and, and we'll try to stay in the

the language territory as opposed to visuals,

to to talk about it. But

the the idea the core idea behind entity centric data modeling is to,

well, first, like, maybe I'll talk about the driver of, like, why I wrote the blog post. But, I've been a dimensional modeling aficionado for, like, 20 years. Right? So I read the Kimball books a while ago.

You know, facts and dimensions make sense to me. For those not familiar with dimensional modeling,

go pick up that book. Though it is like

fading and relevance in some ways. Right? But it's, you know, if you're a data engineer, it's probably good to go and read the maybe that's the the the original, the old testament. You know?

I don't know what the New Testament the New Testament might have to be written still, but, it's not a bad thing to go and and read that. Still, the general idea behind dimensional modeling is star schemas where,

you know, some of the core principles is you have fact tables and you have dimension tables.

Fact

mostly contain foreign keys to dimensions and metrics,

and then dimensions

have, you know, the the primary key of

the entity and then, all sorts of denormalized,

things at the entity level.

I'm not gonna get into snowflake schema because I feel like it starts to get into, like, the area where visuals are helpful. If you wanna learn more about dimensional modeling, check out the books.

But, yeah. So I've been do practitioning dimensional modeling. I think, over the past

decade or so, we started to be much more pro denormalization

in the Hadoop era.

If you're not familiar with normalization denormalization,

you're probably in trouble for this blog post. You probably need to go and, you know, pursue your career in data engineering for a little while, read some articles, and come back to this blog post. But but we've been pro denormalizing

the idea to have, like, more flat tables,

as opposed to a bunch of small tables with a lot of foreign keys.

Right? Another concept we might talk about in this blog post is, like, you know, normal forms, first normal form. It's like a normal, full thermal, normal form.

But, yeah. So going back to dimensional modeling,

1 thing that I've been doing and I've seen a lot of folks do,

in the the the past, like, 5 to 10 years is starting to bring

metrics into their dimensions.

And, that can be

really, really powerful

pattern.

In a lot of ways, we can get into why it is powerful. It's a little bit of a discretion as far as, like, say, if Ralph Kimball was on the show, he'd be like, wait, you did you did what? You brought some metrics into your dimension.

Is that is that okay? Is that not okay? Like, you know, why are you doing that? So that's 1 of the ideas of, like, yes, no, it is an entity centered data data modeling. It is virtuous and encouraged to bring,

metrics and dimensions. We have ways to talk about this and how to do that properly. So that we'll get into.

Another idea

is,

legitimizing

also bringing

the data structures, net native data

nested data structures

inside your dimension tables.

Right? So now we have, databases like BigQuery, Snowflake, or all databases have learned, you know, JSON and

complex data structures

over,

over the past decade, which was not the case when the those early books were written. Right? So it's a new

set of tools that we have.

And,

you know, by the book,

starting to put an array of objects into a table

was probably discouraged.

And and now I'm saying like, hey, it actually sometimes it does make a lot of sense

to do that. Right?

So those are some of the core ideas that I've seen done. I wanted to legitimize and offer a little bit of a framework as to how

to do that. Right? You bring metrics in your dimension.

What do you do with time? You know, like, do you have, like, a customer table? You're gonna bring some some metrics. Like, yeah, you might have,

you know, cost of acquisition or lifetime

value of the customer,

but you might also wanna pivot time, like, how many visits and as this user,

you know, visited the site over the past 28 days or 7 days. So

so I wanted to, like, legitimize some new set of practices

and offer a bit of a framework or hints as to how to do that

properly.

Here's another thought too that's pretty foundational

is, and some of the these ideas come from the field of feature engineering. And if you're a data engineering a data engineer today, you may or may not know depending on how close you sit to your fellow

data scientists and your team.

And

whether, you know, they they have a feature store and do feature engineering,

might differ

depending on your organization. But

I also wanted to take some of these concepts and practices that I've seen used in the field of feature stores and feature engineering

and then

mash that into, you know, data modeling for analytics as I've seen, you know, a a lot of of benefits

of doing that. And feature engineering is extremely

entity centric too. So that, you know, that kinda fits together. So in the blog post, I talk about the the parallels between,

you know, feature engineering and entity centric data modeling too. So that's a lot of stuff. Right? That's a lot of information. I'm speaking fast. Hopefully, people are still still with us. You know? I've lost, 80% of the audience here. But, but, yeah, we can unroll to and go deeper into these different aspects of of it. Yeah. There there's

any number of directions that I would love to go with this. Before we get too far down any of those paths, I also think it's maybe worth

giving a concrete sense of what we mean when we're talking about an entity. Obviously, this is going to be different depending on your business, but from the I guess, not the broadest sense, but from a relatable sense, what are some of the ways that you think about how

to go about defining an entity? What are the heuristics that you use to say, this is a, a business object or a domain object within my the context of my organization?

Yeah. Yeah. So so clearly, you know, this notion of an entity,

we've talked about it in the abstract

here. Right?

But but the reality is in the in a lot of cases, you know, analytics are applied to businesses.

And then your core entity as a company,

should be really

clear to to you. Right? And, you know, while they may vary from a business to another or from, like, a business vertical

to another,

there there are some that are very universal and like most companies have that are very relatable. So the customer, right, like, is is a very clear business entity.

Right? So if you think entity centric model, data modeling for a lot of people, that might be, like, entities like your core business entities, like your products,

your customers,

your users,

your perhaps your departments.

If you're in an a more of an HR world

and finance, it might be your business units.

Right?

So these, I think, to me, if you look at the the definition of what's a dimension

in dimensional modeling,

it's fairly aligned

with that. 1 thing that that's a little bit of discretion here that that might be interesting yet a little bit confusing. When you think about facts,

right, is an event an entity, right, or is a and when you look at your your sales

fact table,

usually behind the scene, you know, it's not that far from an invoice or an invoice line fact table. So I would say in data modeling, most things

are entities whether

you want it or not.

And the idea of, like, entity centric data modeling, when I named this approach, I was like, oh, do I go with entity? Because even if you look at, like, 3rd normal form, like your OLTP

type data modeling, like data modeling for apps,

This is also very, very entity centric. So entity centric seems like it calls for high normal form and third normal form.

But but in this case, it's really anti centric for,

for analytics. So, yeah, you're what are the big dimensions

and entities

focus on too much, is what you were talking about with this juxtaposition of analytical use cases for data versus machine learning use cases for data where machine learning has largely been dominated by this concept of feature engineering

unless you happen to be doing deep learning, in which case you just throw a whole bunch of data and hope something useful comes out the other side.

Yeah. So I think in deep learning

too, I think, like, really often you'll use so feature source can be used for deep learning. I believe it's just LLMs now is the whole new field of unsupervised

learning

where now you can throw a bunch of text. Right? And then, you know, so it's not as you don't need to have as structured data if you have a language model. But, but I think still and, you know, if you wanna train a neural net for a special purpose

kinda type use cases,

you you still need a feature store or you still need, you know, things to be, like, organized in a similar way. When you think about it, like the you know, know, for data engineers on the call here, they think about, like, what's so different about feature engineering or ML use cases and what are

some of the needs there. So so typically,

in

machine learning, I think you want to have, like, fairly, like,

entity centric

thing. So you want these very, very flat table with, like, a 1000000 columns. So let's say if you're trying to make prediction on

is a user likely to churn. Right? So what you're gonna need is the for the user entity,

a 1000000 metrics that might be related or correlated to the the fact that whether they may churn or not. So you're gonna say, okay. So I'm gonna train a model

with you know, or I'm gonna give it data for a bunch of users. And for for each users, I'm gonna give them a 1000000 different attributes or probably not a 1000000. Like, most likely, like, you know, a hunt at least a 100, maybe a 1000,

attributes

related to these users that may or may not be related to their lihood

to churn. And it might be thing like time since last visit. Right? That's a metric.

How many visits did they have over the past year

and which part of the product they engage with or they don't? Usually usually as a column, like, did they engage with with

feature a and the product? Did they engage with feature b and the product? All pivoted. So typically, you know, the models like to receive data that is all pivoted. And then what, you know, the the under what the machine learning model is gonna do is try to find correlation, like, which are the columns or metrics or combination of columns or metric

learning is a really clear

entity, in this case, the user,

a metric that you're trying to predict, like, whether they churn they will churn or not, and then a 1000000 attributes that may or may not be related to that. Let the machine learning, you know, model figure out what correlates with that. And

given those 2 different use cases,

1 of the things that I've been curious about of late is whether both of those different teams, so the data analyst and the ML engineer,

are they both going to the same location to get the data, or does the ML engineer need their own feature store, feature repository,

feature pipelines, and the data analyst uses the data warehouse and they're off in their own world? Or are we getting to a point where

the data warehouse is powerful enough and flexible enough where everybody's actually just working on top of that

same substrate of the data warehouse, but maybe with a different set of tables that are derived from those core sources.

Yeah. I think it it's it's very intuitive to think, like, hey, man. We should share. We should be dry. Right? We should, like, only have, like, 1 definition

of these metrics and 1 way to model these things. And we should all be consistent, managed in the same

repo.

And we all, you know, combine,

you know, between, like, data engineers, data scientists. But I would say the needs are, like, intricately similar

and and different,

and similar to the needs and the constraints for on both sides are

intricately

different too. So there's definitely, like, should they share? Should they talk? Should they

align?

Yes. Then can they use the same technology and processes and live in the same repo?

In my experience,

that doesn't tend to work so well. So I can try to explain,

like, some of the the bigger

divergences

on both sides. So on the future side of things, there's a line there's a need often for online prediction, like, you know, live kinda streaming

streaming datasets. So stream that batch, that need may or may not exist on on the analytics side.

The kind of metrics they care about versus the ones we care

about and the call it the 90 day, you know, analytics versus the,

you know, machine learning type use cases,

They're not there's a lot of common metrics. There's a lot of metrics they care about when you might not. Right? Like and then just the the practices, the mindset, the personas,

the approaches tend to be differentiated enough that it's hard to agree and settle on a place to work together. So I'd say give it a shot, but it's,

I I would say your likelihood to succeed

is,

is

I don't know. Not guarantee. I'd be really curious, like, if there are people that do that successfully to an extent, that would be a really interesting, you know, thing for the conversation for the the show notes. But in my experience, they you know, these different percent tends to live in slightly different world. And they do share at some point, they'll share some source table. Right? Like, you'll you'll have, like, you know, you have your staging area, your work tables, your, you know,

your your more curated schemas.

You know, they might tap into that or you might have a shared layer somewhere. You're working particularly with the same data. The question is, like, how far along you go together before you you diverge into

different systems or different use cases.

And so bringing us back now to the space of data modeling,

dimensional modeling,

We talked a little little bit already about the juxtaposition of entity centric modeling with some of the more so called traditional dimensional modeling strategies of star and snowflake schema, and we can maybe even lump the data vault in there as well, if you're feeling, adventurous.

And beyond those

established practices, what are some of the other competing modeling approaches that you've seen cropping up?

1 in particular that came to mind when I was preparing for this episode was the activity schema

that's been popularized

by a company whose name is escaping me at the moment where every event just goes into 1 table, and you just have an event type, and you just query based on event type. I was wondering if you can talk to some of the other ways that you've seen people try to tackle this problem and come to grips with the increased compute power that we have now with the,

structural benefits that are offered by these dimensional strategies?

Yeah.

So first, it's like, what is the inventory

of

of, like, the different data modeling, you know, techniques that you might wanna study or use as a data engineer today? Right? Like, like, what what are the relevant ones that you mentioned?

Activity schema that I would probably best describe as, like, the criminal in in 1 table with, you know, more complex data structure, which has has some virtues. And and, you know, frankly, as a pattern that I've used for some parts of the data warehouse for frameworks that, you know, pivot,

time metrics and things like that. So we could we could talk about how

an activate an activity schema type approach

can be user leveraged to

produce entity centric practices

in some ways, just like trying to enumerate. So there's I know there's Data Vault, I think, that has some

some traction and some good literature. Right? There's data mesh, which to me is I would dismiss as not data modeling and more like it's more like organizational

and it's about, like, how should you structure your

your organization?

It's not really prescriptive in terms of data modeling technique.

Yeah. So activity schema, they they have all data mesh. There's probably a bunch

of other ones more traditionally than there's,

know, dimensional modeling. That's the Kimball 1. There's the corporate corporate information factory.

It sounds like an awful term. That's the Bill Inman approach,

which this 1, I think, suggests a normalized data warehouse and then building data marts off of it.

Kimball, dimensional modeling talks about conformed dimensions.

But, yeah, I think it's a it's a bit of a messy

world, and there's nothing that says you have to pick only 1 of these religions

too. And then there's been very little innovation in that field. Right? Like, so feel like, hey, what are the the best data modeling for analytics,

blog posts, or books that I should read from the past 10 years?

I don't think you're gonna end up with a lot of reading material, you know.

Maybe because the the the topic is like I don't know. Or maybe it's, like, partially solved or partially unsolvable

or not that interesting

to people. And yet you have, like, now with the the rise of the analyst engineer and and people basically doing whatever the heck they want in terms of data modeling. Right? Like, I'll just write a bunch of sequel. I'll just write mountains of sequel that derive derivatives and

of other dataset. And I I don't know what normalization, denormalization

means. I just run queries and produce dataset, and

we try to make sense of of all of this. So given that, I saw that as a bit of an opportunity to to write, you know, some content on data modeling. You know, there's probably an audience out there.

You know, there's people thinking like, is it okay to put metrics in my dimension? Or should I should I read a page from the feature engineering book and,

you know, apply some of these techniques on my side? So, yeah, it's a it's a bit of a jungle

out there. The the most comprehensive,

I think, complete

thing that's self standing to me is the Kimball

books. Right? And it's what? It's 20 years old now. It predates parquet. It predates, like, column store maybe in in many ways. Right? So

so things some of the stuff in there needs to be freshened up. Another interesting thing that I

see in the community and have experienced firsthand a little bit is also the

propensity

for teams who are working with data to

not even go through the modeling process of just saying particularly with ELT and tools like DBT to say, I've already got my data. I'm just gonna write some SQL to pull together a few things into a table, and there we go. And then that works for a little while until it doesn't. And I'm curious what you see as the overall impact of tooling on the

ways that teams approach that thought process of data modeling, data transformation,

the downstream usage and impact of the ways that they approach that transformation and modeling stage and also

lumping into the tooling category, the underlying database engines as well and the capabilities they bring. Just the way that all of that jumbles together to either encourage teams to think about modeling or to just go forth and hope for the best.

Yeah. I mean, there's you know, 1 1 thing, there's a lot of virtues in in making it easier for people to do things. Right? And then Airflow did that in many ways. Like, you know, Airflow made it easy

ish. I mean, not not that it's super easy, but for people

to manage and maintain

and grow their network of data pipelines faster.

Right? So with that comes a lot of enablement, like more people can create more pipeline faster and manage more because they have the this nice, you know, UI to understand what fails and retrigger things, do change management in a a decent or half decent way. I think dbt elevates that to a a next level for SQL specific use cases. Right? So making it really easy for anyone to that knows a little bit of SQL, a little bit of git to start producing dataset and scheduling them running every day. You know, the downside of making that so accessible

is, well, if it's really easy to to produce a dataset, then, you know, people can create many datasets per day. And I think this controversial,

person on Twitter, Lauren, I think she talks quite a bit about how, like, the explosion

of, like, there's a conspiracy theory on, like, you know, Snowflake making money off of

dbt, you know, empowering people to create, you know, just these mountains of sequel. And I I certainly don't think there's a conspiracy theory, but I do think that that there's a conspiracy going on. Like, making it easy for more people to do certain things

brings a new wave of people that may not be expert in the art of, say, data modeling

and can lead to

technical debt being created faster than it will, you know, in a way that there's no hope to maybe fix it ever.

Right? You know? So yeah. You know? And specifically about

about dbt, something that's, like, very different from, like, with Airflow. Airflow has a very kinda scheduler approach. It's like, this job runs every day. So then it becomes,

kinda natural to say, like, I'm gonna process a day worth of data every day. If I need to reprocess a day, I can go and just reprocess that day. DBT has a a simpler approach around that. It's just like write code that either rebuilds the whole table every day or or you can set up these incremental loads and the approach for incremental load is catch up. Right? So, like, write code that if you're in incremental mode, it will figure out what range to run and basic. It looks like a compiler. I did a build tool. It's a build tool. So that makes it maybe a little bit easier in a way to

to author these jobs. I mean, make it extremely easy to be lazy too and just, like, refresh

the whole table every day. So when you run dbt run-in a lot of places, you know, it rebuilds the entire data warehouse every day.

That works for a start up. That probably doesn't work,

you know, when you start having a little bit more data. You have to to go and start thinking as a retrofitting, like, how do we bring in incremental to incremental load to to that world? So, yeah, I think of this new wave of, like, analyst engineers,

you know, which know SQL, they wanna visualize data,

how much do they know about and care

about data modeling? And how much do they know and care about data modeling? I'd say, like, a little bit less than what I have I have observed personally. It'd be it'd be better

if everyone

knew a little bit more about data modeling and add a little bit more rigor

around how they, you know, they think about creating and managing and maintaining these datasets for them. Another interesting aspect of the modeling question is, who are you doing it

for? Where in the Kimbell era, it was very much for the

DBA and the business intelligence engineer maybe working together to say, okay. This is the report that I'm trying to build, so these are the facts and dimensions that I need. This is the way that I need this to be transformed to be able to write this report. 6 months later, okay. I've got a new report, so we're gonna go back and update to the set of tables. Whereas now, there isn't that clear

connection between the way that the data is coming in, the way it's being transformed, and the way it's being used. There are sort of interfaces between each of those stages of, okay. I've done my job. I don't know what you're gonna do with it, but good luck.

And

But no contract to us. So, like, is this private? Is that public? Right. Who uses it? It's like, I have baked a thing, and you can break it all on top of it. You don't have to let me know if you do. And and, you know, change management is just hard and data too. So

I don't think we

respect the problem of of change

management well, you know, and then engineering. Just put it out there. People will latch something else on top of it, and it's all SQL all the way down.

And so to that point as well, dimensional modeling, star, and snowflake schemas,

those are very technically oriented. There's a lot of detail to it. Once it's there, it's very valuable if you know how to use it and what you're trying to get out of it.

I'm wondering

going back to the entity centric modeling that you're proposing,

what are the

players in that space of saying, okay. I'm going to build this entity centric model because now it's easier for you to be able to self discover what are the things that you need to know, what are the ways that you're going to use it, what are some of the downstream benefits of the entity centric model

in terms of building that interface between the data engineer or the analytics engineer who constructs it and the downstream consumers of that data?

So first, like, going back to, like, who was the star schema serving or who did we did we data model for in the past?

And certainly, we're not catering to a

SQL savvy organization that people are gonna go and run c you know, connect to the database and run SQL. We did it in service of the BI tool. Like, that that is really clear to me that

star

star schema's dimensional modeling snowflake models

were there to power

a BI tool that had the semantic layer.

Right? And, I guess now we're getting into the semantic layer, which I I think we I think we have we have to. But historically,

you know,

tools like business objects, micro strategy,

SQL Server Analysis Services,

even like tool a tool like like Looker today.

They expect to work with

fairly complex schemas or data set. They they won't do super well typically with, like, 3rd normal form, you know,

type model, but they do they do very well with these star schemas.

Now,

do humans do well with star schemas is a question too. So if I deliver

well, let's let's stay on the semantic layer for for a moment too. So,

so if you build the the

the schema or data model and service the BI tool,

then that assumes that you have to put this semantic layer on top. Right? And that promise, the semantic layer is self-service.

So in tools like, you know, this is object market strategy, Looker, you can

drag and drop these business objects like your your dimensional attribute and your metric. And then the tool will figure out how to query your underlying

complex star schema

to serve

that result set or to eventually power, you know, your visualization

or

dashboard.

Now there's some big problems with the semantic layer. So first and I wrote a different blog post on this 1. I encourage people if people are interested more in the semantic layer. I I wrote something about dataset centric data modeling.

Entity centric builds on top of dataset centric. But dataset centric says, in general,

the interface for the BI for consumption,

whether it be a BI tool,

a data scientist,

or someone writing SQL,

it is a dataset or can like, it's really great when it's a dataset. It's easy for people to reason about tabular flat tables.

It's nice to not have to figure out which joints to make,

when you don't have to. And that's some of the ideas behind the entity centric data modeling too, where if you have everything around a customer in a customer table, including

metrics, activity, engagement, likely at the churn,

then you're in a good place in terms of,

having just to understand

the table you're working with, the column it has, and their definition.

Then you have to go to a map to figure out, like, what do I need to join here to augment this data. Yeah. So get getting

a bit deeper into

and and to, like, you know, serving the VI tools. So now

you have multiple BI tools. So most organizations have multiple BI tools for a variety of reasons. We could or cannot get into, but,

the reason why is usually, like, you know, there's different generation of BI tools. There's different personas in your organization that

prefer there's new BI tools coming on the market.

There's a great open source 1 called Superset. You can go and

use it and start using today.

So people accumulate multiple BI tools even through acquisition sometimes or different business units who wants different BI tool for whatever reason. So

then if you the semantic layer are not universal and open source, they are

proprietary

to the BI tool that you use.

And they are fairly complex, hard to manage.

I talked a little bit about change management before. Change management in data engineering is really hard in the data warehouse and the transform layer. It's also hard

in the semantic layer,

and keeping, you know, these multiple BI vendor tools in sync

with your evolving schema

is quite a challenge and is duplicative

of work. Right? So if you put a a lot of that, how do you query this star schema

in that semantic layer and you have multiple semantic layers to manage,

you're a bit in trouble. So for me, I argue that in this era where this we don't really have a good universal reusable semantic layer across tools.

It's

better to put more of that logic in the transform layer

and to denormalize

further.

And then you end up with this,

who you are serving. So

in a a perfect entity centric modeling type world,

you have these very rich, very comprehensive

entity centric

datasets that can be used and reused across

BI tools by different personas, by people who write SQL,

you know, without having to really understand complex

underlying

data models.

Right? So I I wanna do a customer bound analysis.

Then I'm gonna pull the customer table. And great. It has

200 well documented columns

in different groups and with some really clear naming conventions and descriptions.

And it has everything it I need for me to do

really deep analytics for my customers. Like, I want my customers who

are likely to change, who haven't been active in the past 90 days, who who have acted in response of this,

you know, ad campaign or whatever it may be and to have that at your

fingertips or SQL tips,

you know,

there. And then that can be used in your, you know, as a source for your superset dashboard

or your Tableau as a Tableau extract

or, you know, by a science data scientist in a notebook

and can be much more comprehensive

because tabular datasets are just easy and natural to reason about.

And so for people who are

trying to figure out what is the modeling strategy that I'm going to use or they are in that situation

of, I've got a whole bunch of source tables. I've got a whole bunch of derived tables, but there's not really any rhyme or reason to it. What are some of the tactical aspects of iterating towards

a workable

entity centric domain model that they can build and maintain and evolve for their organization?

Yes. I would say you probably and

and your warehouse

today

already have like, your core business entities are probably very, very well represented

in slightly

flat table. Right? So you probably have

some sort of,

central

or, you know, master central

customer table. Or if

your business is in supply chain, you have your private very rich, like, warehouse table somewhere

that,

that has the bulk of everything that you should know

about a warehouse.

If you don't have 1, then I would argue, like, you should build 1, you know. And then how you build it, you know, whether you use a staging area and work tables in the, you know, back room and and and front room

type approach

there, I would say, probably stick to your

internal practices as to how to do that, whether you do it in airflow or in VBT

or

in ELT

or in Spark, I think is also

something that is very kind of cultural and, like, you know, unique to your organization.

If you have a blank slate, which, like, I don't think many of the people like, who on the call if we had people on the call? Who on the call here as a blank slate and gets to define, you know, their nomenclature and data modeling

practices for their organization. That happens very few times in

that era. That doesn't last very long usually. But I think that that'd be an interesting question,

too. If you have nothing today, where do you start? Could do an episode on that. But assuming you already have practices in place, tooling in place,

and a reflection of your

core entities

somewhere in a queryable place. The question is, like, how do you augment it

to be even more useful?

Some some of the ideas behind entity centric

is, yes, you can and should bring

metrics, like numerical metrics to your dimension tables. And that we can talk about,

how how to do that. I I touch on it in a blog post,

but it it's it's fairly complicated and abstract. The big question too is, like, how do you bring time into that? Right? Like, so fact tables are usually combination of foreign keys,

an event, some metric, and a time 1 or multiple time dimension.

Now here we know we're in a table in the entity entity centric model, we're in a table that has a really clear grain. The grain is 1 row

per entity

per instance. Right? So in the customer,

entity centric table, there's 1 row per customer.

Now

if there's probably

dozens, if not, you know, dozens of metrics that are very, very key and relevant

to qualifying that customers.

How do you bring the time dimension

and colonize it is an interesting question that I talk about in the blog post. The first thing is to bring these, I call them town time bound

metrics.

So say for a user table,

you know, call it visits

or

certain types of actions seem

really important. And,

the time dimension for that is, well, okay, I can bring the total number of visits that this user

has had. Right? So it could be like

me on your podcast, you know, I have, add on your podcast website.

I visited in total

5th 50 times over the past,

over the past 10 years, like almost a decade now. Right? So, so that's a metric. So that that I would call life to

life to

date visits. But then it's really interesting to get something like 7 day visits, 28 day visits, you know, 90 day visits.

So then for each cuss for each,

user in this case, you have

a new set of columns, 1 that might be called,

say, 90 day visits. Now you probably want something similar with time pivoted lessons. Right? Complete lessons, partial lessons,

average

completion

of

does this user, you know, typically

listen to the whole episode and snippets or how much they

might fast forward,

how many times did they fast forward on their last listen.

So there you have to think about what are the metrics that are most useful in relation to that entity.

In this case, say, visit, listens,

percentage completion of listen.

And you have to think about what are the time frames that are relevant to your business and time analysis.

Typically, you know, 7 days, 28 days,

90

days, 1 year, life to date, and then start

pivoting these things. Right? Start computing and pivoting these,

these columns.

Now, like, 1 thing I talk about in the blog post is, like,

having all of these time bound metrics

that are

rich and useful

make a new class of analysis,

very, very natural and simple. Right? So if you have to join to multiple fact tables

to go to and to do, like, intricate time filtering

to try to gather that together to run an analysis, you you might not bother. Right? Or you might not bother on, like, okay. I'm gonna do this complex joint to get the number of, like, 28 day, you know, listens for that user.

But if they are there at your fingertip, all of a sudden,

it becomes really easy and natural to do segmentation

and engagement analysis and, like, give me the users

or targeting. Right? But give me the users that have not listened to any of my podcasts in the past

year,

but I've demonstrated

interest on something else that is relevant and use that for targeting, like, I'm I'm gonna shoot them, you know, this

email,

for them to reengage with this episode. So so as these these tables have a mix of, say, in the case of users, they have user attributes, like the the traditional user

demographics type things, but you enrich that at your fingertips with a bunch of, like, metrics that are very relevant to your business and timely too. And that's, that's that can be extremely useful.

And as an extension to that exercise,

what are some of the

specific elements that maybe introduce added complexity or maybe it's all the same thing when you're talking about working across multiple different

sub organizations or business units within a larger organization, particularly thinking in terms of enterprise

users where they have multiple different,

entities

or categories of entities or bucketing of entities that they maybe then want to stitch together across the entire organization

to be able to get a lateral view of everything

at maybe a higher level. Yeah. I mean, in in some ways, you know, this type of data modeling, like, enables you to have a more, like, you know, a 3 60 view of the entity. I hate that term. It sounds so, like, mart marketing y. But, like, okay, well, we have, you take a really core entity,

the user.

And then, of course, there's some metrics that matter most to, say, the marketing department and to the sales department or to the product department. They're like, oh, how are the people engaging with the product? Which features are they using? But I would argue that in the end, they're all, like, useful feature

features

of the user that may or may not predict their behavior or may or may not, you know, be relevant to another department. So in the context of a feature storage, like, cram more features in there,

that may or may not be useful or predict

the behavior you're trying to predict through machine learning. But I think in this case, you know, 1 challenge is clearly the well, for on the consumption size

side, if you have, like, hundreds and hundreds of columns for a user,

then you're like, okay, which ones of these columns are actually make sense, are usable, are fresh?

There's some challenges around

just latency

too because

now if you wait for the you have to wait for the entire data warehouse and all the facts to come together for your user table to be ready. So and and and then there's, like, a documentation issue of, like, what does the column called m_ltv_,

you know,

visit qda. Like, what is qda? Like, I don't so so I think there's a documentation

and

issue around it of how do you,

you know, document

these these these metrics

so that they are actually relevant to people.

There's also

some issues potentially around,

you know, performance. And what are the column stores? You know, if you have super wide table, it doesn't really matter. But, you know, in the parquet file world,

now you end up with these, like, parquet chunks that don't have a lot of rows because they have a lot of columns in them. So there's, like, some some techniques to be applied

there. The latency 1, the spaghetti issue, is like now your dimension table and your DAG depends on your fact table

and your and your DAG. And the fact tables depend on other things. So you can get into circular

dependency

type challenges

too because now

the complexity score of building your user table

is not just like, I'll get all the demographics and be done with it. It's build all of your fact tables and pivot time and build, you know, a 100 metrics and crown them into your user tables. So that brings a,

you know, ETL kind of that complexity

set of challenges

too. So in the blog post, I talk about some of these things. I talk about, what I called

vertical partitioning.

That's 1 idea to say different groups of attributes can actually be stored in different

user table. So you would have you use

a table maybe called use, you know, dim user extended with all the fields, but you might have dim user, you know, marketing or with different subsets or groups of columns.

There's

probably a little bit more to it, but I I talk in the blog post too around, like, how to

how to deal with the circular dependency

issue. And the idea there is to to have, an extra layer. And

the warehouse so people familiar with the idea of a staging area, that's where you receive all your your data that's very raw from the different system,

mostly unprocessed

and raw ingredients,

the same way that a staging area for our warehouse is just where, you know, the trucks unload. And then, you know, then there's, you know, the back room and the work table. So let's say, like, there you would do a bunch of entity bound work and fact bound work.

And then you have, I think I I forgot if I call it shuffle layer in the blog post, but that's where you pick from these, like, premade assets and and

merge,

call it, like, entity attributes and facts together.

There's no perfect way to do it, you know,

and, it's a little bit at at your discretion given your your your style and practices. But

but yeah. So those are definitely, like, some of the challenges that come with with this approach. 1 thing is, like, I think progressive adapt

adaptability

of any method methodology is really important. Right? You're not gonna go from

what I do, start schema now, and I we're moving to entity centric modeling. We have to rewrite the whole warehouse. No. Right? It's like, okay.

At first, like, how do we sprinkle some more metrics and nested structure into our our dimension tables? Like, now that we know it's okay to do it, where do we start? Like, what would be what are what are the top 3

user

useful metrics for this business? And let's just start with those.

Absolutely.

And also

the entire concept of building the building the table structures around these different entities also brings the question of entity extraction, which is a whole another episode.

And for and and so in your experience

of working through this problem space,

documenting

these concepts of entity centric modeling, working with the customers that you have at preset who are trying to drive these different insights and visualizations

of their data based on the different domain structures that they've built up in their warehouse. What are some of the most interesting or innovative or unexpected ways that you've seen people approaching this practice of entity centric modeling?

Yeah. So we have limited visibility as to, like, exactly, like, what our customers and users do, you know, at preset, like, in and for good reasons. Right? Like, so we we cannot get into your, you know, your private data warehouse in any way. 1 thing I can talk about is how I've seen I would say, over the past decade, like, since the beginning of the the Hadoop era,

how people who didn't come from a background

of, say, star schema or dimensional modeling have been modeling their table on some some interesting use cases that we've seen emerge

over the years. But I think when I first saw the what I call the time bound metrics was at Facebook in a table called dim user. And dim user was the most used table in the entire data warehouse

of, of all of Facebook. And

this dimension table,

like, had 100 by 1,000 of columns that had

to do with,

with their behavior on Facebook or, like, basically, everything related to user was

crammed

back in to this or not not everything was crammed back into this. But, like, things like so say,

if you work on the photos team, you might wanna know how many photos does this person have uploaded over the past 28 days. Right? Are they a tagger? Do they tag people in photos?

Are they do they use search? Do they you know, and on what time frame? So we start seeing a lot of, like, metric,

underscore 7 day, underscore 28 days, underscore 90 days. And for the data engineering, so I was part of the data engineering team.

And, someone in my team was working on

making sure this table this table is so large and so useful and so

multidimensional

that it became like it had a dependency on just about everything else in the warehouse and everything dependent on it. So it became like a very, very, very, very key

hot spot or focal point

in the warehouse, both in, like, what it took to build it

and how much it was used downstream.

But then once you have that I think the real thing that's really interesting is, like, once you have that at your fingertips,

then

doing things like cohort analysis

and segmentation

becomes very, very natural. Right? You don't need to be go and figure out, where's the data mark for photos and photo upload and how the heck am I gonna join this?

It's more like, okay, now I have a table where there's 1 Roper table

and there's the bulk of everything I might wanna know about this user, you know, is,

is there at my fingertips. So I can start saying, like, hey, the users we did this, we did not do that in a certain time frame or in a different time frame. There's a lot you can do there.

1 thing too that,

that existed

there that was, like, very much a pattern. That's something I talked about in my functional data engineering,

blog post a while ago too is the idea of snapshotting dimensions too. So that means not only

you enrich your dimension with,

with with things like time bound metrics and nested structure.

And and we could get into nested structure a little bit more, like, you know, like, what what are some examples, like, why is that relevant. But so not only you have a super rich

entity centric table, they call it demuser@facebook

with, you know, billions of users and a 1000000 attributes about them. But we also would snapshot these dimensions. So you could do basically, every day, there's a new partition added to the dim table

with all of the users. So if, you know, today, you have 10,000,000,000 user. Yesterday, you had 9,000,000,000, 99100,

you know. And and then you can do time series analysis on these rich

tables too. So that's something I do talk about in the blog post. And in general, I think is a good practice. So if you treat your,

your datasets as immutable, something that's very natural when you build your dimension is to create a new

snapshot

of your dimension table every day. And when you have that, you know, when you have this table, it becomes really natural to do

time series analysis on your populations.

Right? So here you could say, give me the percentage of users who visited more than 6 days over the past 7 days

and how that's evolving

over time.

So there you would use a metric called l 7 visits, which represent the number of visits over the past 7 days.

Then you could say, which percentage of my user visit more than 6 out of the past 7 days? Okay. I got that. I I can give you a nice big percentage. But how is that evolving over time? You do that through the idea of snapshotting your dimensions,

being able to do time series on,

on your dimension table. So that's that 1 is a little bit more controversial. You don't have to do it if you do entity centric data modeling, but I think it's just a great

practice to,

to do that. And once you do that, of course, there's compute costs associated to that. Right? Because now you're storing your whole snapshot of your whole dimension every day. But then the empowerment

in terms of, like, the kind of analysis you can do very easily on these tables

is, is is really, really powerful.

And

in your experience

of working in this space and

trying to

distill the different requirements and use cases and capabilities around data modeling, what are some of the most interesting or unexpected or challenging lessons that you've worked through?

Yeah.

Well, also,

I think the thing that's really difficult with data modeling is that there's there's no the best practices are not clear. They tend to emerge culturally within an organization.

There's different like,

now everyone's been invited to the party,

and there's a there's a wide spectrum from

amateur and people trying to figure it out

all the way to,

you know, experts that really know their art.

And,

and right now, we're all

working together. Right? Like, we've seen people I've seen people make, like,

gigantic

mistakes in terms of, like, just basic, like, SQL writing and data modeling logic. Just like joining on a field that's not unique, like doing these Cartesian joins. Right? That just like and they're just not aware that

this is gonna be a problem or right? So so there's some, like, very basic mistakes that get made. There's also people who don't know Git. So we've seen people that kinda get put get push master on on, in certain areas that, you know,

they they they use a git GUI, and they don't really understand,

you know, how git and branching

works too. So 1 thing is, like, you know, if you look at the distribution from amateur to expert, there's a lot more

amateur now that are empowered,

and that comes with empowerment,

but some some risk

and some issues too. So that's definitely an emerging pattern

there. But yeah. And then in terms of, like, how to teach

the data modeling type of, you know, how to educate people on that is kinda unclear because there's no there's no bible you can point to, really.

So people have to look at, like, how other people do it, do a little code review, and and try to figure it out. But, yeah. It's it's as if, like, we had invited a bunch of,

of of people to software engineering that, you know, barely know how to code, barely know how to get. And, that comes with a set of challenges. And the challenge is is education,

I think, in a context where,

you know, there's not a lot of material to to train people.

So that I I guess, like, the the the DBT community has empowered a lot of that. Right? So people can go and get a lot of resources from, say, the dbt community

and,

and, you know, the dbt slack and the dbt documentation.

But, you know, still there's still a gap in making people,

experts

and, tech depth.

To to me, even for a seasoned expert data engineer that's really good at data modeling,

it does feel still like data pipelines are tech depth from day 0. So now

we we're dealing with a much more much bigger issue at scale, like these mountain of SQL that are growing and,

brittle. And,

I don't know if we'll

ever kinda dig ourselves out of that hole.

And for people who are working through these exercises

of building up their data architecture,

working in their data warehouse, what are the cases where entity centric modeling is the wrong choice and they're better suited with

standard

star schema or activity schema or wide tables or just no modeling at all.

Yeah. Yeah. No modeling at all or, you know, I think I think there's a bunch of things that have become,

you know, practices that we used to do that don't make sense anymore. I would say don't start with entity centric. Just start by I mean, I I think anything you're gonna do is gonna be in some way entity centric. Right? You're gonna look at what are my

business

objects, like what are my business entities and how do I make sense of them. I know 1 1 thing that's a bit unrelated

to modeling per se, but 1 thing that's that's happening is, like, with the rise of the Fivetran, Airbyte,

Meltano, and other, like, the data sync. Like, 1 thing we don't do anymore is, like, writing data sync. Right? Like, in in there, we enter it comprehensive models from these tools. So at least you'd have to data model

the staging area. So the staging area, you you get it it lands into your data warehouse.

I think 1 mistake that

people tend to do and a bit of a pitfall is to go straight from that

data sync to report. Right? Like, what is the SQL that I can run to create the report

directly from my raw tables?

And I think we all

figured out, you know, over the years that you need some sort of, like, middleware there. I mean, like, middle section where you're gonna go from staging area into, like, pre transformed data and clean up and have a bit of a layered approach

to your warehouse. I would say, like, don't go straight from

your raw table into

entity centric or

or into reports. Right? Like, you think about, like, a layer that's much more like a probably something like a star schema to start with. So I would say, like, thinking about facts and dimensions

in that first layer and the early layer is good. And then how do you bring your facts and your measures into your dimension? And maybe is a is a concern once you've solved,

this the area of more like modeling your facts and dimension.

And as you continue

to work in this space

and try to distill these various ideas and workflows and use cases,

what are some of the predictions that you have for the future direction and adoption of entity centric modeling or

maybe some other modeling techniques that are,

waiting in the wings to come and

obtain dominance?

Yes. I would say I like to see the, I believe it's called activity schema idea. Right? I think there is a place

for these generic

metrics

table. So we talked about, like, the metrics layer in the past and the future repository.

As much as, like, what we're talking about here today, in terms of, like, in into the centric modeling is, like, an entity bound table with a lot of columns.

The question of, like, how you get to a lot of columns, I think somewhere in your data warehouse, you need more of a metric centralized metric table, where

that 1 would be a very thin table that is very

long, that very, very high. Right? And and that, or by high, I mean, they're just the formatting of that table is not wide as thin. And and what's typically in those tables, it will be something like what is the time, what is the metric name,

what is the entity name,

or the entity type, the entity ID, and what is the value for that metric. And these are generic schemas

that allow you to store

things like events or metrics in time in a very kind of thin ways without having to change your schema. So I think we see that

tends to exist quite a bit called a metrics repository or a feature repository,

but it it's pretty common to go

have that table that is

metric bound, entity bound, and time bound. So I'll try to repeat it, like, but, yeah, it's very common pattern. You have a table that that has something like entity type customer,

entity ID, you know, entity 1,

metric visit,

and value,

and and time. So it would be on that date, this person for this metric at as this value. So once you have these very generic tables,

you can do a lot of that pivoting I was talking about before.

That's more like, okay. Now I want to I want to compute

7 day visits for all my user, 28 day visits for my all my users. I wanna do that in 1 big scan. So once you have this table, it becomes much more natural to pick and choose how you're gonna bring metrics into your dimensions.

And that table,

or that that pattern is is close to what I believe, you know, act

activity schema is about, and I believe is a place to loop back to what we were talking about in the beginning of episode. That may be a place where

data scientists and data engineers can collaborate and

share definitions

upstream for their own pipeline or to agree of, like, what are the metrics all the metrics we track for each entity type and on what time frame. So there there's probably much more to to be written there. Right? Or maybe there's a question, like, how do you take something that's activity schema,

entity centric, and and, you know, produce

entity centric table

off of those. I I wanna say too, like, there's 1 thing we,

we didn't talk about, which is, you know, I said early in the episode that the I wanted to legitimize bringing metrics and time bound metrics to dimension tables.

Another thing I think that's important is bringing or allowing people to bring nested complex data structure

inside

dimension tables too. And going back to the podcast

user listener table we're talking about, like, maybe it does make sense to

have an array of what are the top the last 5 episodes the user listened to. Right? Like, for whatever reason, like, and if you wanted to do that, well, you're not gonna create 1 column called, you know,

last episode they listened to, previous episode they listened to. If you wanna track, like, what percentage of the episode they listened to, right, 1 okay. What is the episode ID, episode name, and what percentage of the episode did they listen to?

Okay. Well, now you'd have 3 columns

times 10 episodes

that we have to manage this dynamically. So then they're, like, well, this is, to me, really valid information, really core information to the user, like, 10 last episode. I wanna put that in the table. Well, put an array of, of

obstructs in there and put it in there.

And, it will be easy for retrieval with a little bit of a JSON extract type function or,

you know, whatever the database of choice, you know, that use,

offers to query this.

Now you could do a nest type operation, and you've seen those in SQL. But you could say, hey, if I have an array of 10 things, you can say generate pivot that into 10 rows into my query. And people know how to do that now. Like it comes naturally for software engineer to to work with data structures and

and pivot things. So so don't necessarily abuse that. You don't wanna put all the facts, you know, about all the users as necessary structure, but it's something that you can do for things that are best represented that way too.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, as usual, this is, I don't know how many times you've answered this, but from your perspective of today, what do you see as being the biggest gap of the tooling our technology that's available for data management?

Yeah. Maybe I'll I'll go on, you know, we talked about this just to stay with the theme of the episode. I think I would say, like, education

and data modeling. Like, we it'd be great if we had, like, data modeling 101, and and maybe that's, like, somewhat provided in in and around, like, say, the DBT community.

But, you know, it's like what what even if it was just, like, the the the 30 things that you should know before you go and pile on your SQL to the amount in a SQL that your organization has

in terms of I I think there's much more to data modeling in that, but data modeling is a component of it. So, you know, things like if if you work with SQL

and you create this set, you should know what the like, normalization, denormalization

means. You should know, you know, what the dimension and effect is. You you know, so what is that vocabulary.

And there's a there's there's a bit of a gap there. Right? If you talk to 10 different DBT users, they might not use the same

nomenclature

and practices

and convention. So we need to 0 in here on a on a set of, like, knowledge that people agree about or even just, like,

definitions and and, terms that everyone knows that is close to that that area.

Alright. Well, thank you as always for taking the time today to join me and share the work that you're doing and your thoughts on this overall space of domain modeling, the benefits of entity centric approaches to that, and the ways that our newer compute substrates enable more detailed approaches to data exploration. So, appreciate you taking the time, and I hope you enjoy the rest of your day. Yeah. It was awesome to be on the show as always, and, I'm sure I'll be back soon enough. So enjoy your day too.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links