Summary
For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases
Interview
- Introduction
- How did you get involved in the area of data management?
Can you describe what entity-centric modeling (ECM) is and the story behind it?
- How does it compare to dimensional modeling strategies?
- What are some of the other competing methods
- Comparison to activity schema
What impact does this have on ML teams? (e.g. feature engineering)
What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)
- What is the impact on the underlying compute engine on the modeling strategies used?
What are some examples of data sources or problem domains for which this approach is well suited?
- What are some cases where entity centric modeling techniques might be counterproductive?
What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?
What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?
- How does this work across business domains within a given organization (especially at "enterprise" scale)?
What are the most interesting, innovative, or unexpected ways that you have seen ECM used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?
When is ECM the wrong choice?
What are your predictions for the future direction/adoption of ECM or other modeling techniques?
Contact Info
- mistercrunch on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Entity Centric Modeling Blog Post
- Max's Previous Apperances
- Apache Airflow
- Apache Superset
- Preset
- Ubisoft
- Ralph Kimball
- The Rise Of The Data Engineer
- The Downfall Of The Data Engineer
- The Rise Of The Data Scientist
- Dimensional Data Modeling
- Star Schema
- Database Normalization
- Feature Engineering
- DRY == Don't Repeat Yourself
- Activity Schema
- Corporate Information Factory (affiliate link)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/ rudderstack.
Your host is Tobias Macy, and today, I'm interviewing Boeschman about the concept of entity centric data modeling for analytical use cases. So, Max, for people who haven't heard any of your past appearances on the show, can you just give a bit of an introduction?
[00:00:55] Unknown:
Yeah. For sure. And I I was, trying to think how many times I've been on the show before and then, how many times we're planning together to get me back on the show too. So this is like a ace you know, part of a series of multiple dozens of episodes that we'll we'll be doing together. Little bit of background on me. So I think to this to this, community and audience, I'm probably best known for being the creator of Apache Airflow. So I started this project while I was at Airbnb in 2014. A little bit after that, also while at Airbnb, I started Apache Superset. So Apache Superset, for those not familiar with it, is a business intelligence, data visualization, dashboarding, web application that's open source. Right? So you can essentially, download that, install that, and run that in your company just like you would run Tableau and Looker. And, people are probably familiar with with Superset in the audience, but, like, it has gotten really, really freaking good over the past, like, year or 2. So if you haven't checked it out, you know, since I first started talking about it in 2017, 2018 is is probably when we launched it. It is good now. It is really it's it's really solid, really competitive, better than, you know, some of the vendor tools in in many, many regards, especially for a technical audience. Right. So for for people that are more technical, no sequel, the tool is more designed for the, you know, the the modern data practitioner too. So invite people to to try it. And then I don't do much of a commercial bit, but, like, if you wanna try Apache Super Set, the best way to try it is on presets of so I am a startup founder. So I started a a a company called Preset around Apache Superset. And the goal there is to offer a really awesome managed service around Apache Superset with bells and whistles.
And it's all it's all managed and there's a freemium component so you can try it, you know, for for free today. And you can try it and go and use open source if you want to, but you could the easiest way to try superset latest greatest version is on preset. Io that you get out. Yeah. So found so founder now, I guess, I'm more of a founder, but, like, very much an open source enthusiast. Love writing code, building communities, and and more recently, building a company to that can that can fuel open source. Right? So the idea is much more like, how do we create something sustainable, around the open source project so we can be a little bit the the parent or the the the foster parent for the open source project and do all the the difficult things and and, you know, really help the community. So for me, it was, like, how business can help open source disrupt in this symbiotic kind of way much more than, like, oh, let's extract, you know, the the the money out of this community. Like, that that's really not the goal. The goal is to have this symbiotic like, you know, capital infusion and open source and helping helping us to get open source to to, to disrupt and and eventually dominate, you know, in, in in business intelligence.
[00:03:59] Unknown:
Yeah. I I definitely see Superset all over the place these days. Pretty much everywhere I look, if there's any sort of open source aspect at all to a stack, Superset's usually involved there. So definitely great work on that. And, just a brief recap again for people who haven't listened to any of your past appearances, how you first got involved with working in data.
[00:04:20] Unknown:
Yeah. And, you know, we could look at look back at the topics we talked about together too as a series of, like, my shifting and ever evolving interest over time. But, but I first started a while ago. So, like, you you could tell if we were on video how much, like, gray hair I have at this point. But I started, early 2000 as well, first my career started, I I did 1 year of web development, which is kind of a it's a great thing to be able to build apps. Right? Regardless what your engineering career is gonna be like, being able to build little website and little tools is is super helpful. So I did a year of that early on. And then and then I got into pretty much data warehousing. So I use I was at Ubisoft, a bit of a company that's doesn't need to be introduced anymore probably, at this at this point in time. But they, yeah. So there I started doing, like, traditional data warehousing. So I picked up, you know, I'd had some classes in school. And I'm a college drop out, but I I had, like, an IT program equivalent of an associate degree where I learned some SQL, some data modeling, a little bit of everything, in terms of building apps and co encoding. I guess we didn't call them apps at the time. But, but yeah. So I had done some SQL, some data modeling, and then got into data warehousing, picked up the the rough Kimball books at the time, around the dimensional modeling and and started building a data warehouse, for for Ubisoft, all of their financial data, retail data, you know, pretty much covering the the realm of the data they had at the time.
And that was the start of a, you know, a career in everything data pretty much. Right? Data. So I was the data warehouse architect. I was the the business intelligence engineer. And then, you know, later on in my career, I joined, more modern I moved to Silicon Valley and then joined companies like Facebook, Airbnb, Lyft, And that's where I started, you know, working working on open source and doing what we call today data engineering in in in a more a more modern way, you know.
[00:06:35] Unknown:
Yeah. And, a lot of people have credited you as being, to some extent, the godfather of data engineering as it's defined today given your well timed blog posts of the rise and fall of the data engineer, which roughly coincided with the start of this podcast. So
[00:06:52] Unknown:
Yeah. Yeah. I remember that was right, like, our first 1 of our first conversation together. Yes. I remember at the time and, you know, it's not like I invented that engineering, but, yeah, I wrote the blog post that try to, you know, try to define what it was and and what it was not at the time. And clearly, you know, for me, my experience is I joined Facebook a data warehouse architect. And by the time I left a few years later, I I left as a data engineer. We realized while we were there that what we were doing was was not, you know, data warehouse architecture or business intelligence. It was a whole new discipline with different like, with different where different skills were leveraged in different ways, and the approach was completely different.
So at some point, internally on Facebook, we decided to call ourselves data engineer. And that I think I was inspired. There was an influential blog post called the Rise of the Data Scientists or something like that. I dug that out. That was that was pretty critical to to that, to kinda cementing, defining that term and and creating that that new role and persona. So I was like, hey, I'd I'd like to do that for data engineering. I wrote that blog post. And, yeah, rest is a little bit of history.
[00:08:12] Unknown:
Bringing us now to the topic at hand for today, brings us around to another 1 of your blog posts. So writing is definitely 1 of your strong suits. I appreciate all of the effort you put into that. And so you wrote a post about this concept of entity centric data modeling. And so for people who haven't already read the post, we'll have the link in the show notes so you can either pause now, read it, come back, or just read it afterwards. But for people who have don't don't wanna take the time, can you just give a bit of an overview about what it is when you describe entity centric data modeling and some of your process of coming to this formulation of the problem?
[00:08:52] Unknown:
Yeah. So first, a bit of a disclaimer. Right? Like, talking about data modeling without schematics and image is a little difficult, or it's not difficult for me to talk about. It might be difficult for for people to make sense of whatever we're gonna be saying here. So I encourage people to to read the blog post too, and, and we'll try to stay in the the language territory as opposed to visuals, to to talk about it. But the the idea the core idea behind entity centric data modeling is to, well, first, like, maybe I'll talk about the driver of, like, why I wrote the blog post. But, I've been a dimensional modeling aficionado for, like, 20 years. Right? So I read the Kimball books a while ago. You know, facts and dimensions make sense to me. For those not familiar with dimensional modeling, go pick up that book. Though it is like fading and relevance in some ways. Right? But it's, you know, if you're a data engineer, it's probably good to go and read the maybe that's the the the original, the old testament. You know?
I don't know what the New Testament the New Testament might have to be written still, but, it's not a bad thing to go and and read that. Still, the general idea behind dimensional modeling is star schemas where, you know, some of the core principles is you have fact tables and you have dimension tables. Fact mostly contain foreign keys to dimensions and metrics, and then dimensions have, you know, the the primary key of the entity and then, all sorts of denormalized, things at the entity level. I'm not gonna get into snowflake schema because I feel like it starts to get into, like, the area where visuals are helpful. If you wanna learn more about dimensional modeling, check out the books. But, yeah. So I've been do practitioning dimensional modeling. I think, over the past decade or so, we started to be much more pro denormalization in the Hadoop era.
If you're not familiar with normalization denormalization, you're probably in trouble for this blog post. You probably need to go and, you know, pursue your career in data engineering for a little while, read some articles, and come back to this blog post. But but we've been pro denormalizing the idea to have, like, more flat tables, as opposed to a bunch of small tables with a lot of foreign keys. Right? Another concept we might talk about in this blog post is, like, you know, normal forms, first normal form. It's like a normal, full thermal, normal form. But, yeah. So going back to dimensional modeling, 1 thing that I've been doing and I've seen a lot of folks do, in the the the past, like, 5 to 10 years is starting to bring metrics into their dimensions.
And, that can be really, really powerful pattern. In a lot of ways, we can get into why it is powerful. It's a little bit of a discretion as far as, like, say, if Ralph Kimball was on the show, he'd be like, wait, you did you did what? You brought some metrics into your dimension. Is that is that okay? Is that not okay? Like, you know, why are you doing that? So that's 1 of the ideas of, like, yes, no, it is an entity centered data data modeling. It is virtuous and encouraged to bring, metrics and dimensions. We have ways to talk about this and how to do that properly. So that we'll get into. Another idea is, legitimizing also bringing the data structures, net native data nested data structures inside your dimension tables.
Right? So now we have, databases like BigQuery, Snowflake, or all databases have learned, you know, JSON and complex data structures over, over the past decade, which was not the case when the those early books were written. Right? So it's a new set of tools that we have. And, you know, by the book, starting to put an array of objects into a table was probably discouraged. And and now I'm saying like, hey, it actually sometimes it does make a lot of sense to do that. Right? So those are some of the core ideas that I've seen done. I wanted to legitimize and offer a little bit of a framework as to how to do that. Right? You bring metrics in your dimension.
What do you do with time? You know, like, do you have, like, a customer table? You're gonna bring some some metrics. Like, yeah, you might have, you know, cost of acquisition or lifetime value of the customer, but you might also wanna pivot time, like, how many visits and as this user, you know, visited the site over the past 28 days or 7 days. So so I wanted to, like, legitimize some new set of practices and offer a bit of a framework or hints as to how to do that properly. Here's another thought too that's pretty foundational is, and some of the these ideas come from the field of feature engineering. And if you're a data engineering a data engineer today, you may or may not know depending on how close you sit to your fellow data scientists and your team.
And whether, you know, they they have a feature store and do feature engineering, might differ depending on your organization. But I also wanted to take some of these concepts and practices that I've seen used in the field of feature stores and feature engineering and then mash that into, you know, data modeling for analytics as I've seen, you know, a a lot of of benefits of doing that. And feature engineering is extremely entity centric too. So that, you know, that kinda fits together. So in the blog post, I talk about the the parallels between, you know, feature engineering and entity centric data modeling too. So that's a lot of stuff. Right? That's a lot of information. I'm speaking fast. Hopefully, people are still still with us. You know? I've lost, 80% of the audience here. But, but, yeah, we can unroll to and go deeper into these different aspects of of it. Yeah. There there's
[00:15:10] Unknown:
any number of directions that I would love to go with this. Before we get too far down any of those paths, I also think it's maybe worth giving a concrete sense of what we mean when we're talking about an entity. Obviously, this is going to be different depending on your business, but from the I guess, not the broadest sense, but from a relatable sense, what are some of the ways that you think about how to go about defining an entity? What are the heuristics that you use to say, this is a, a business object or a domain object within my the context of my organization?
[00:15:42] Unknown:
Yeah. Yeah. So so clearly, you know, this notion of an entity, we've talked about it in the abstract here. Right? But but the reality is in the in a lot of cases, you know, analytics are applied to businesses. And then your core entity as a company, should be really clear to to you. Right? And, you know, while they may vary from a business to another or from, like, a business vertical to another, there there are some that are very universal and like most companies have that are very relatable. So the customer, right, like, is is a very clear business entity. Right? So if you think entity centric model, data modeling for a lot of people, that might be, like, entities like your core business entities, like your products, your customers, your users, your perhaps your departments.
If you're in an a more of an HR world and finance, it might be your business units. Right? So these, I think, to me, if you look at the the definition of what's a dimension in dimensional modeling, it's fairly aligned with that. 1 thing that that's a little bit of discretion here that that might be interesting yet a little bit confusing. When you think about facts, right, is an event an entity, right, or is a and when you look at your your sales fact table, usually behind the scene, you know, it's not that far from an invoice or an invoice line fact table. So I would say in data modeling, most things are entities whether you want it or not.
And the idea of, like, entity centric data modeling, when I named this approach, I was like, oh, do I go with entity? Because even if you look at, like, 3rd normal form, like your OLTP type data modeling, like data modeling for apps, This is also very, very entity centric. So entity centric seems like it calls for high normal form and third normal form. But but in this case, it's really anti centric for, for analytics. So, yeah, you're what are the big dimensions and entities
[00:17:57] Unknown:
focus on too much, is what you were talking about with this juxtaposition of analytical use cases for data versus machine learning use cases for data where machine learning has largely been dominated by this concept of feature engineering unless you happen to be doing deep learning, in which case you just throw a whole bunch of data and hope something useful comes out the other side.
[00:18:16] Unknown:
Yeah. So I think in deep learning too, I think, like, really often you'll use so feature source can be used for deep learning. I believe it's just LLMs now is the whole new field of unsupervised learning where now you can throw a bunch of text. Right? And then, you know, so it's not as you don't need to have as structured data if you have a language model. But, but I think still and, you know, if you wanna train a neural net for a special purpose kinda type use cases, you you still need a feature store or you still need, you know, things to be, like, organized in a similar way. When you think about it, like the you know, know, for data engineers on the call here, they think about, like, what's so different about feature engineering or ML use cases and what are some of the needs there. So so typically, in machine learning, I think you want to have, like, fairly, like, entity centric thing. So you want these very, very flat table with, like, a 1000000 columns. So let's say if you're trying to make prediction on is a user likely to churn. Right? So what you're gonna need is the for the user entity, a 1000000 metrics that might be related or correlated to the the fact that whether they may churn or not. So you're gonna say, okay. So I'm gonna train a model with you know, or I'm gonna give it data for a bunch of users. And for for each users, I'm gonna give them a 1000000 different attributes or probably not a 1000000. Like, most likely, like, you know, a hunt at least a 100, maybe a 1000, attributes related to these users that may or may not be related to their lihood to churn. And it might be thing like time since last visit. Right? That's a metric.
How many visits did they have over the past year and which part of the product they engage with or they don't? Usually usually as a column, like, did they engage with with feature a and the product? Did they engage with feature b and the product? All pivoted. So typically, you know, the models like to receive data that is all pivoted. And then what, you know, the the under what the machine learning model is gonna do is try to find correlation, like, which are the columns or metrics or combination of columns or metric learning is a really clear entity, in this case, the user, a metric that you're trying to predict, like, whether they churn they will churn or not, and then a 1000000 attributes that may or may not be related to that. Let the machine learning, you know, model figure out what correlates with that. And
[00:20:57] Unknown:
given those 2 different use cases, 1 of the things that I've been curious about of late is whether both of those different teams, so the data analyst and the ML engineer, are they both going to the same location to get the data, or does the ML engineer need their own feature store, feature repository, feature pipelines, and the data analyst uses the data warehouse and they're off in their own world? Or are we getting to a point where the data warehouse is powerful enough and flexible enough where everybody's actually just working on top of that same substrate of the data warehouse, but maybe with a different set of tables that are derived from those core sources.
[00:21:35] Unknown:
Yeah. I think it it's it's very intuitive to think, like, hey, man. We should share. We should be dry. Right? We should, like, only have, like, 1 definition of these metrics and 1 way to model these things. And we should all be consistent, managed in the same repo. And we all, you know, combine, you know, between, like, data engineers, data scientists. But I would say the needs are, like, intricately similar and and different, and similar to the needs and the constraints for on both sides are intricately different too. So there's definitely, like, should they share? Should they talk? Should they align?
Yes. Then can they use the same technology and processes and live in the same repo? In my experience, that doesn't tend to work so well. So I can try to explain, like, some of the the bigger divergences on both sides. So on the future side of things, there's a line there's a need often for online prediction, like, you know, live kinda streaming streaming datasets. So stream that batch, that need may or may not exist on on the analytics side. The kind of metrics they care about versus the ones we care about and the call it the 90 day, you know, analytics versus the, you know, machine learning type use cases, They're not there's a lot of common metrics. There's a lot of metrics they care about when you might not. Right? Like and then just the the practices, the mindset, the personas, the approaches tend to be differentiated enough that it's hard to agree and settle on a place to work together. So I'd say give it a shot, but it's, I I would say your likelihood to succeed is, is I don't know. Not guarantee. I'd be really curious, like, if there are people that do that successfully to an extent, that would be a really interesting, you know, thing for the conversation for the the show notes. But in my experience, they you know, these different percent tends to live in slightly different world. And they do share at some point, they'll share some source table. Right? Like, you'll you'll have, like, you know, you have your staging area, your work tables, your, you know, your your more curated schemas.
You know, they might tap into that or you might have a shared layer somewhere. You're working particularly with the same data. The question is, like, how far along you go together before you you diverge into different systems or different use cases.
[00:24:00] Unknown:
And so bringing us back now to the space of data modeling, dimensional modeling, We talked a little little bit already about the juxtaposition of entity centric modeling with some of the more so called traditional dimensional modeling strategies of star and snowflake schema, and we can maybe even lump the data vault in there as well, if you're feeling, adventurous. And beyond those established practices, what are some of the other competing modeling approaches that you've seen cropping up? 1 in particular that came to mind when I was preparing for this episode was the activity schema that's been popularized by a company whose name is escaping me at the moment where every event just goes into 1 table, and you just have an event type, and you just query based on event type. I was wondering if you can talk to some of the other ways that you've seen people try to tackle this problem and come to grips with the increased compute power that we have now with the, structural benefits that are offered by these dimensional strategies?
[00:25:01] Unknown:
Yeah. So first, it's like, what is the inventory of of, like, the different data modeling, you know, techniques that you might wanna study or use as a data engineer today? Right? Like, like, what what are the relevant ones that you mentioned? Activity schema that I would probably best describe as, like, the criminal in in 1 table with, you know, more complex data structure, which has has some virtues. And and, you know, frankly, as a pattern that I've used for some parts of the data warehouse for frameworks that, you know, pivot, time metrics and things like that. So we could we could talk about how an activate an activity schema type approach can be user leveraged to produce entity centric practices in some ways, just like trying to enumerate. So there's I know there's Data Vault, I think, that has some some traction and some good literature. Right? There's data mesh, which to me is I would dismiss as not data modeling and more like it's more like organizational and it's about, like, how should you structure your your organization?
It's not really prescriptive in terms of data modeling technique. Yeah. So activity schema, they they have all data mesh. There's probably a bunch of other ones more traditionally than there's, know, dimensional modeling. That's the Kimball 1. There's the corporate corporate information factory. It sounds like an awful term. That's the Bill Inman approach, which this 1, I think, suggests a normalized data warehouse and then building data marts off of it. Kimball, dimensional modeling talks about conformed dimensions. But, yeah, I think it's a it's a bit of a messy world, and there's nothing that says you have to pick only 1 of these religions too. And then there's been very little innovation in that field. Right? Like, so feel like, hey, what are the the best data modeling for analytics, blog posts, or books that I should read from the past 10 years?
I don't think you're gonna end up with a lot of reading material, you know. Maybe because the the the topic is like I don't know. Or maybe it's, like, partially solved or partially unsolvable or not that interesting to people. And yet you have, like, now with the the rise of the analyst engineer and and people basically doing whatever the heck they want in terms of data modeling. Right? Like, I'll just write a bunch of sequel. I'll just write mountains of sequel that derive derivatives and of other dataset. And I I don't know what normalization, denormalization means. I just run queries and produce dataset, and we try to make sense of of all of this. So given that, I saw that as a bit of an opportunity to to write, you know, some content on data modeling. You know, there's probably an audience out there.
You know, there's people thinking like, is it okay to put metrics in my dimension? Or should I should I read a page from the feature engineering book and, you know, apply some of these techniques on my side? So, yeah, it's a it's a bit of a jungle out there. The the most comprehensive, I think, complete thing that's self standing to me is the Kimball books. Right? And it's what? It's 20 years old now. It predates parquet. It predates, like, column store maybe in in many ways. Right? So so things some of the stuff in there needs to be freshened up. Another interesting thing that I
[00:28:24] Unknown:
see in the community and have experienced firsthand a little bit is also the propensity for teams who are working with data to not even go through the modeling process of just saying particularly with ELT and tools like DBT to say, I've already got my data. I'm just gonna write some SQL to pull together a few things into a table, and there we go. And then that works for a little while until it doesn't. And I'm curious what you see as the overall impact of tooling on the ways that teams approach that thought process of data modeling, data transformation, the downstream usage and impact of the ways that they approach that transformation and modeling stage and also lumping into the tooling category, the underlying database engines as well and the capabilities they bring. Just the way that all of that jumbles together to either encourage teams to think about modeling or to just go forth and hope for the best.
[00:29:22] Unknown:
Yeah. I mean, there's you know, 1 1 thing, there's a lot of virtues in in making it easier for people to do things. Right? And then Airflow did that in many ways. Like, you know, Airflow made it easy ish. I mean, not not that it's super easy, but for people to manage and maintain and grow their network of data pipelines faster. Right? So with that comes a lot of enablement, like more people can create more pipeline faster and manage more because they have the this nice, you know, UI to understand what fails and retrigger things, do change management in a a decent or half decent way. I think dbt elevates that to a a next level for SQL specific use cases. Right? So making it really easy for anyone to that knows a little bit of SQL, a little bit of git to start producing dataset and scheduling them running every day. You know, the downside of making that so accessible is, well, if it's really easy to to produce a dataset, then, you know, people can create many datasets per day. And I think this controversial, person on Twitter, Lauren, I think she talks quite a bit about how, like, the explosion of, like, there's a conspiracy theory on, like, you know, Snowflake making money off of dbt, you know, empowering people to create, you know, just these mountains of sequel. And I I certainly don't think there's a conspiracy theory, but I do think that that there's a conspiracy going on. Like, making it easy for more people to do certain things brings a new wave of people that may not be expert in the art of, say, data modeling and can lead to technical debt being created faster than it will, you know, in a way that there's no hope to maybe fix it ever.
Right? You know? So yeah. You know? And specifically about about dbt, something that's, like, very different from, like, with Airflow. Airflow has a very kinda scheduler approach. It's like, this job runs every day. So then it becomes, kinda natural to say, like, I'm gonna process a day worth of data every day. If I need to reprocess a day, I can go and just reprocess that day. DBT has a a simpler approach around that. It's just like write code that either rebuilds the whole table every day or or you can set up these incremental loads and the approach for incremental load is catch up. Right? So, like, write code that if you're in incremental mode, it will figure out what range to run and basic. It looks like a compiler. I did a build tool. It's a build tool. So that makes it maybe a little bit easier in a way to to author these jobs. I mean, make it extremely easy to be lazy too and just, like, refresh the whole table every day. So when you run dbt run-in a lot of places, you know, it rebuilds the entire data warehouse every day.
That works for a start up. That probably doesn't work, you know, when you start having a little bit more data. You have to to go and start thinking as a retrofitting, like, how do we bring in incremental to incremental load to to that world? So, yeah, I think of this new wave of, like, analyst engineers, you know, which know SQL, they wanna visualize data, how much do they know about and care about data modeling? And how much do they know and care about data modeling? I'd say, like, a little bit less than what I have I have observed personally. It'd be it'd be better if everyone knew a little bit more about data modeling and add a little bit more rigor around how they, you know, they think about creating and managing and maintaining these datasets for them. Another interesting aspect of the modeling question is, who are you doing it
[00:32:58] Unknown:
for? Where in the Kimbell era, it was very much for the DBA and the business intelligence engineer maybe working together to say, okay. This is the report that I'm trying to build, so these are the facts and dimensions that I need. This is the way that I need this to be transformed to be able to write this report. 6 months later, okay. I've got a new report, so we're gonna go back and update to the set of tables. Whereas now, there isn't that clear connection between the way that the data is coming in, the way it's being transformed, and the way it's being used. There are sort of interfaces between each of those stages of, okay. I've done my job. I don't know what you're gonna do with it, but good luck. And
[00:33:36] Unknown:
But no contract to us. So, like, is this private? Is that public? Right. Who uses it? It's like, I have baked a thing, and you can break it all on top of it. You don't have to let me know if you do. And and, you know, change management is just hard and data too. So I don't think we respect the problem of of change management well, you know, and then engineering. Just put it out there. People will latch something else on top of it, and it's all SQL all the way down.
[00:34:04] Unknown:
And so to that point as well, dimensional modeling, star, and snowflake schemas, those are very technically oriented. There's a lot of detail to it. Once it's there, it's very valuable if you know how to use it and what you're trying to get out of it. I'm wondering going back to the entity centric modeling that you're proposing, what are the players in that space of saying, okay. I'm going to build this entity centric model because now it's easier for you to be able to self discover what are the things that you need to know, what are the ways that you're going to use it, what are some of the downstream benefits of the entity centric model in terms of building that interface between the data engineer or the analytics engineer who constructs it and the downstream consumers of that data?
[00:34:49] Unknown:
So first, like, going back to, like, who was the star schema serving or who did we did we data model for in the past? And certainly, we're not catering to a SQL savvy organization that people are gonna go and run c you know, connect to the database and run SQL. We did it in service of the BI tool. Like, that that is really clear to me that star star schema's dimensional modeling snowflake models were there to power a BI tool that had the semantic layer. Right? And, I guess now we're getting into the semantic layer, which I I think we I think we have we have to. But historically, you know, tools like business objects, micro strategy, SQL Server Analysis Services, even like tool a tool like like Looker today.
They expect to work with fairly complex schemas or data set. They they won't do super well typically with, like, 3rd normal form, you know, type model, but they do they do very well with these star schemas. Now, do humans do well with star schemas is a question too. So if I deliver well, let's let's stay on the semantic layer for for a moment too. So, so if you build the the the schema or data model and service the BI tool, then that assumes that you have to put this semantic layer on top. Right? And that promise, the semantic layer is self-service. So in tools like, you know, this is object market strategy, Looker, you can drag and drop these business objects like your your dimensional attribute and your metric. And then the tool will figure out how to query your underlying complex star schema to serve that result set or to eventually power, you know, your visualization or dashboard.
Now there's some big problems with the semantic layer. So first and I wrote a different blog post on this 1. I encourage people if people are interested more in the semantic layer. I I wrote something about dataset centric data modeling. Entity centric builds on top of dataset centric. But dataset centric says, in general, the interface for the BI for consumption, whether it be a BI tool, a data scientist, or someone writing SQL, it is a dataset or can like, it's really great when it's a dataset. It's easy for people to reason about tabular flat tables. It's nice to not have to figure out which joints to make, when you don't have to. And that's some of the ideas behind the entity centric data modeling too, where if you have everything around a customer in a customer table, including metrics, activity, engagement, likely at the churn, then you're in a good place in terms of, having just to understand the table you're working with, the column it has, and their definition.
Then you have to go to a map to figure out, like, what do I need to join here to augment this data. Yeah. So get getting a bit deeper into and and to, like, you know, serving the VI tools. So now you have multiple BI tools. So most organizations have multiple BI tools for a variety of reasons. We could or cannot get into, but, the reason why is usually, like, you know, there's different generation of BI tools. There's different personas in your organization that prefer there's new BI tools coming on the market. There's a great open source 1 called Superset. You can go and use it and start using today.
So people accumulate multiple BI tools even through acquisition sometimes or different business units who wants different BI tool for whatever reason. So then if you the semantic layer are not universal and open source, they are proprietary to the BI tool that you use. And they are fairly complex, hard to manage. I talked a little bit about change management before. Change management in data engineering is really hard in the data warehouse and the transform layer. It's also hard in the semantic layer, and keeping, you know, these multiple BI vendor tools in sync with your evolving schema is quite a challenge and is duplicative of work. Right? So if you put a a lot of that, how do you query this star schema in that semantic layer and you have multiple semantic layers to manage, you're a bit in trouble. So for me, I argue that in this era where this we don't really have a good universal reusable semantic layer across tools.
It's better to put more of that logic in the transform layer and to denormalize further. And then you end up with this, who you are serving. So in a a perfect entity centric modeling type world, you have these very rich, very comprehensive entity centric datasets that can be used and reused across BI tools by different personas, by people who write SQL, you know, without having to really understand complex underlying data models. Right? So I I wanna do a customer bound analysis. Then I'm gonna pull the customer table. And great. It has 200 well documented columns in different groups and with some really clear naming conventions and descriptions.
And it has everything it I need for me to do really deep analytics for my customers. Like, I want my customers who are likely to change, who haven't been active in the past 90 days, who who have acted in response of this, you know, ad campaign or whatever it may be and to have that at your fingertips or SQL tips, you know, there. And then that can be used in your, you know, as a source for your superset dashboard or your Tableau as a Tableau extract or, you know, by a science data scientist in a notebook and can be much more comprehensive because tabular datasets are just easy and natural to reason about.
[00:40:59] Unknown:
And so for people who are trying to figure out what is the modeling strategy that I'm going to use or they are in that situation of, I've got a whole bunch of source tables. I've got a whole bunch of derived tables, but there's not really any rhyme or reason to it. What are some of the tactical aspects of iterating towards a workable entity centric domain model that they can build and maintain and evolve for their organization?
[00:41:25] Unknown:
Yes. I would say you probably and and your warehouse today already have like, your core business entities are probably very, very well represented in slightly flat table. Right? So you probably have some sort of, central or, you know, master central customer table. Or if your business is in supply chain, you have your private very rich, like, warehouse table somewhere that, that has the bulk of everything that you should know about a warehouse. If you don't have 1, then I would argue, like, you should build 1, you know. And then how you build it, you know, whether you use a staging area and work tables in the, you know, back room and and and front room type approach there, I would say, probably stick to your internal practices as to how to do that, whether you do it in airflow or in VBT or in ELT or in Spark, I think is also something that is very kind of cultural and, like, you know, unique to your organization.
If you have a blank slate, which, like, I don't think many of the people like, who on the call if we had people on the call? Who on the call here as a blank slate and gets to define, you know, their nomenclature and data modeling practices for their organization. That happens very few times in that era. That doesn't last very long usually. But I think that that'd be an interesting question, too. If you have nothing today, where do you start? Could do an episode on that. But assuming you already have practices in place, tooling in place, and a reflection of your core entities somewhere in a queryable place. The question is, like, how do you augment it to be even more useful?
Some some of the ideas behind entity centric is, yes, you can and should bring metrics, like numerical metrics to your dimension tables. And that we can talk about, how how to do that. I I touch on it in a blog post, but it it's it's fairly complicated and abstract. The big question too is, like, how do you bring time into that? Right? Like, so fact tables are usually combination of foreign keys, an event, some metric, and a time 1 or multiple time dimension. Now here we know we're in a table in the entity entity centric model, we're in a table that has a really clear grain. The grain is 1 row per entity per instance. Right? So in the customer, entity centric table, there's 1 row per customer.
Now if there's probably dozens, if not, you know, dozens of metrics that are very, very key and relevant to qualifying that customers. How do you bring the time dimension and colonize it is an interesting question that I talk about in the blog post. The first thing is to bring these, I call them town time bound metrics. So say for a user table, you know, call it visits or certain types of actions seem really important. And, the time dimension for that is, well, okay, I can bring the total number of visits that this user has had. Right? So it could be like me on your podcast, you know, I have, add on your podcast website.
I visited in total 5th 50 times over the past, over the past 10 years, like almost a decade now. Right? So, so that's a metric. So that that I would call life to life to date visits. But then it's really interesting to get something like 7 day visits, 28 day visits, you know, 90 day visits. So then for each cuss for each, user in this case, you have a new set of columns, 1 that might be called, say, 90 day visits. Now you probably want something similar with time pivoted lessons. Right? Complete lessons, partial lessons, average completion of does this user, you know, typically listen to the whole episode and snippets or how much they might fast forward, how many times did they fast forward on their last listen.
So there you have to think about what are the metrics that are most useful in relation to that entity. In this case, say, visit, listens, percentage completion of listen. And you have to think about what are the time frames that are relevant to your business and time analysis. Typically, you know, 7 days, 28 days, 90 days, 1 year, life to date, and then start pivoting these things. Right? Start computing and pivoting these, these columns. Now, like, 1 thing I talk about in the blog post is, like, having all of these time bound metrics that are rich and useful make a new class of analysis, very, very natural and simple. Right? So if you have to join to multiple fact tables to go to and to do, like, intricate time filtering to try to gather that together to run an analysis, you you might not bother. Right? Or you might not bother on, like, okay. I'm gonna do this complex joint to get the number of, like, 28 day, you know, listens for that user.
But if they are there at your fingertip, all of a sudden, it becomes really easy and natural to do segmentation and engagement analysis and, like, give me the users or targeting. Right? But give me the users that have not listened to any of my podcasts in the past year, but I've demonstrated interest on something else that is relevant and use that for targeting, like, I'm I'm gonna shoot them, you know, this email, for them to reengage with this episode. So so as these these tables have a mix of, say, in the case of users, they have user attributes, like the the traditional user demographics type things, but you enrich that at your fingertips with a bunch of, like, metrics that are very relevant to your business and timely too. And that's, that's that can be extremely useful.
[00:47:33] Unknown:
And as an extension to that exercise, what are some of the specific elements that maybe introduce added complexity or maybe it's all the same thing when you're talking about working across multiple different sub organizations or business units within a larger organization, particularly thinking in terms of enterprise users where they have multiple different, entities or categories of entities or bucketing of entities that they maybe then want to stitch together across the entire organization to be able to get a lateral view of everything
[00:48:09] Unknown:
at maybe a higher level. Yeah. I mean, in in some ways, you know, this type of data modeling, like, enables you to have a more, like, you know, a 3 60 view of the entity. I hate that term. It sounds so, like, mart marketing y. But, like, okay, well, we have, you take a really core entity, the user. And then, of course, there's some metrics that matter most to, say, the marketing department and to the sales department or to the product department. They're like, oh, how are the people engaging with the product? Which features are they using? But I would argue that in the end, they're all, like, useful feature features of the user that may or may not predict their behavior or may or may not, you know, be relevant to another department. So in the context of a feature storage, like, cram more features in there, that may or may not be useful or predict the behavior you're trying to predict through machine learning. But I think in this case, you know, 1 challenge is clearly the well, for on the consumption size side, if you have, like, hundreds and hundreds of columns for a user, then you're like, okay, which ones of these columns are actually make sense, are usable, are fresh?
There's some challenges around just latency too because now if you wait for the you have to wait for the entire data warehouse and all the facts to come together for your user table to be ready. So and and and then there's, like, a documentation issue of, like, what does the column called m_ltv_, you know, visit qda. Like, what is qda? Like, I don't so so I think there's a documentation and issue around it of how do you, you know, document these these these metrics so that they are actually relevant to people. There's also some issues potentially around, you know, performance. And what are the column stores? You know, if you have super wide table, it doesn't really matter. But, you know, in the parquet file world, now you end up with these, like, parquet chunks that don't have a lot of rows because they have a lot of columns in them. So there's, like, some some techniques to be applied there. The latency 1, the spaghetti issue, is like now your dimension table and your DAG depends on your fact table and your and your DAG. And the fact tables depend on other things. So you can get into circular dependency type challenges too because now the complexity score of building your user table is not just like, I'll get all the demographics and be done with it. It's build all of your fact tables and pivot time and build, you know, a 100 metrics and crown them into your user tables. So that brings a, you know, ETL kind of that complexity set of challenges too. So in the blog post, I talk about some of these things. I talk about, what I called vertical partitioning.
That's 1 idea to say different groups of attributes can actually be stored in different user table. So you would have you use a table maybe called use, you know, dim user extended with all the fields, but you might have dim user, you know, marketing or with different subsets or groups of columns. There's probably a little bit more to it, but I I talk in the blog post too around, like, how to how to deal with the circular dependency issue. And the idea there is to to have, an extra layer. And the warehouse so people familiar with the idea of a staging area, that's where you receive all your your data that's very raw from the different system, mostly unprocessed and raw ingredients, the same way that a staging area for our warehouse is just where, you know, the trucks unload. And then, you know, then there's, you know, the back room and the work table. So let's say, like, there you would do a bunch of entity bound work and fact bound work.
And then you have, I think I I forgot if I call it shuffle layer in the blog post, but that's where you pick from these, like, premade assets and and merge, call it, like, entity attributes and facts together. There's no perfect way to do it, you know, and, it's a little bit at at your discretion given your your your style and practices. But but yeah. So those are definitely, like, some of the challenges that come with with this approach. 1 thing is, like, I think progressive adapt adaptability of any method methodology is really important. Right? You're not gonna go from what I do, start schema now, and I we're moving to entity centric modeling. We have to rewrite the whole warehouse. No. Right? It's like, okay.
At first, like, how do we sprinkle some more metrics and nested structure into our our dimension tables? Like, now that we know it's okay to do it, where do we start? Like, what would be what are what are the top 3 user useful metrics for this business? And let's just start with those.
[00:53:06] Unknown:
Absolutely. And also the entire concept of building the building the table structures around these different entities also brings the question of entity extraction, which is a whole another episode. And for and and so in your experience of working through this problem space, documenting these concepts of entity centric modeling, working with the customers that you have at preset who are trying to drive these different insights and visualizations of their data based on the different domain structures that they've built up in their warehouse. What are some of the most interesting or innovative or unexpected ways that you've seen people approaching this practice of entity centric modeling?
[00:53:45] Unknown:
Yeah. So we have limited visibility as to, like, exactly, like, what our customers and users do, you know, at preset, like, in and for good reasons. Right? Like, so we we cannot get into your, you know, your private data warehouse in any way. 1 thing I can talk about is how I've seen I would say, over the past decade, like, since the beginning of the the Hadoop era, how people who didn't come from a background of, say, star schema or dimensional modeling have been modeling their table on some some interesting use cases that we've seen emerge over the years. But I think when I first saw the what I call the time bound metrics was at Facebook in a table called dim user. And dim user was the most used table in the entire data warehouse of, of all of Facebook. And this dimension table, like, had 100 by 1,000 of columns that had to do with, with their behavior on Facebook or, like, basically, everything related to user was crammed back in to this or not not everything was crammed back into this. But, like, things like so say, if you work on the photos team, you might wanna know how many photos does this person have uploaded over the past 28 days. Right? Are they a tagger? Do they tag people in photos?
Are they do they use search? Do they you know, and on what time frame? So we start seeing a lot of, like, metric, underscore 7 day, underscore 28 days, underscore 90 days. And for the data engineering, so I was part of the data engineering team. And, someone in my team was working on making sure this table this table is so large and so useful and so multidimensional that it became like it had a dependency on just about everything else in the warehouse and everything dependent on it. So it became like a very, very, very, very key hot spot or focal point in the warehouse, both in, like, what it took to build it and how much it was used downstream.
But then once you have that I think the real thing that's really interesting is, like, once you have that at your fingertips, then doing things like cohort analysis and segmentation becomes very, very natural. Right? You don't need to be go and figure out, where's the data mark for photos and photo upload and how the heck am I gonna join this? It's more like, okay, now I have a table where there's 1 Roper table and there's the bulk of everything I might wanna know about this user, you know, is, is there at my fingertips. So I can start saying, like, hey, the users we did this, we did not do that in a certain time frame or in a different time frame. There's a lot you can do there. 1 thing too that, that existed there that was, like, very much a pattern. That's something I talked about in my functional data engineering, blog post a while ago too is the idea of snapshotting dimensions too. So that means not only you enrich your dimension with, with with things like time bound metrics and nested structure.
And and we could get into nested structure a little bit more, like, you know, like, what what are some examples, like, why is that relevant. But so not only you have a super rich entity centric table, they call it demuser@facebook with, you know, billions of users and a 1000000 attributes about them. But we also would snapshot these dimensions. So you could do basically, every day, there's a new partition added to the dim table with all of the users. So if, you know, today, you have 10,000,000,000 user. Yesterday, you had 9,000,000,000, 99100, you know. And and then you can do time series analysis on these rich tables too. So that's something I do talk about in the blog post. And in general, I think is a good practice. So if you treat your, your datasets as immutable, something that's very natural when you build your dimension is to create a new snapshot of your dimension table every day. And when you have that, you know, when you have this table, it becomes really natural to do time series analysis on your populations.
Right? So here you could say, give me the percentage of users who visited more than 6 days over the past 7 days and how that's evolving over time. So there you would use a metric called l 7 visits, which represent the number of visits over the past 7 days. Then you could say, which percentage of my user visit more than 6 out of the past 7 days? Okay. I got that. I I can give you a nice big percentage. But how is that evolving over time? You do that through the idea of snapshotting your dimensions, being able to do time series on, on your dimension table. So that's that 1 is a little bit more controversial. You don't have to do it if you do entity centric data modeling, but I think it's just a great practice to, to do that. And once you do that, of course, there's compute costs associated to that. Right? Because now you're storing your whole snapshot of your whole dimension every day. But then the empowerment in terms of, like, the kind of analysis you can do very easily on these tables is, is is really, really powerful.
[00:59:01] Unknown:
And in your experience of working in this space and trying to distill the different requirements and use cases and capabilities around data modeling, what are some of the most interesting or unexpected or challenging lessons that you've worked through?
[00:59:16] Unknown:
Yeah. Well, also, I think the thing that's really difficult with data modeling is that there's there's no the best practices are not clear. They tend to emerge culturally within an organization. There's different like, now everyone's been invited to the party, and there's a there's a wide spectrum from amateur and people trying to figure it out all the way to, you know, experts that really know their art. And, and right now, we're all working together. Right? Like, we've seen people I've seen people make, like, gigantic mistakes in terms of, like, just basic, like, SQL writing and data modeling logic. Just like joining on a field that's not unique, like doing these Cartesian joins. Right? That just like and they're just not aware that this is gonna be a problem or right? So so there's some, like, very basic mistakes that get made. There's also people who don't know Git. So we've seen people that kinda get put get push master on on, in certain areas that, you know, they they they use a git GUI, and they don't really understand, you know, how git and branching works too. So 1 thing is, like, you know, if you look at the distribution from amateur to expert, there's a lot more amateur now that are empowered, and that comes with empowerment, but some some risk and some issues too. So that's definitely an emerging pattern there. But yeah. And then in terms of, like, how to teach the data modeling type of, you know, how to educate people on that is kinda unclear because there's no there's no bible you can point to, really.
So people have to look at, like, how other people do it, do a little code review, and and try to figure it out. But, yeah. It's it's as if, like, we had invited a bunch of, of of people to software engineering that, you know, barely know how to code, barely know how to get. And, that comes with a set of challenges. And the challenge is is education, I think, in a context where, you know, there's not a lot of material to to train people. So that I I guess, like, the the the DBT community has empowered a lot of that. Right? So people can go and get a lot of resources from, say, the dbt community and, and, you know, the dbt slack and the dbt documentation.
But, you know, still there's still a gap in making people, experts and, tech depth. To to me, even for a seasoned expert data engineer that's really good at data modeling, it does feel still like data pipelines are tech depth from day 0. So now we we're dealing with a much more much bigger issue at scale, like these mountain of SQL that are growing and, brittle. And, I don't know if we'll ever kinda dig ourselves out of that hole.
[01:02:17] Unknown:
And for people who are working through these exercises of building up their data architecture, working in their data warehouse, what are the cases where entity centric modeling is the wrong choice and they're better suited with standard star schema or activity schema or wide tables or just no modeling at all.
[01:02:39] Unknown:
Yeah. Yeah. No modeling at all or, you know, I think I think there's a bunch of things that have become, you know, practices that we used to do that don't make sense anymore. I would say don't start with entity centric. Just start by I mean, I I think anything you're gonna do is gonna be in some way entity centric. Right? You're gonna look at what are my business objects, like what are my business entities and how do I make sense of them. I know 1 1 thing that's a bit unrelated to modeling per se, but 1 thing that's that's happening is, like, with the rise of the Fivetran, Airbyte, Meltano, and other, like, the data sync. Like, 1 thing we don't do anymore is, like, writing data sync. Right? Like, in in there, we enter it comprehensive models from these tools. So at least you'd have to data model the staging area. So the staging area, you you get it it lands into your data warehouse.
I think 1 mistake that people tend to do and a bit of a pitfall is to go straight from that data sync to report. Right? Like, what is the SQL that I can run to create the report directly from my raw tables? And I think we all figured out, you know, over the years that you need some sort of, like, middleware there. I mean, like, middle section where you're gonna go from staging area into, like, pre transformed data and clean up and have a bit of a layered approach to your warehouse. I would say, like, don't go straight from your raw table into entity centric or or into reports. Right? Like, you think about, like, a layer that's much more like a probably something like a star schema to start with. So I would say, like, thinking about facts and dimensions in that first layer and the early layer is good. And then how do you bring your facts and your measures into your dimension? And maybe is a is a concern once you've solved, this the area of more like modeling your facts and dimension.
[01:04:37] Unknown:
And as you continue to work in this space and try to distill these various ideas and workflows and use cases, what are some of the predictions that you have for the future direction and adoption of entity centric modeling or maybe some other modeling techniques that are, waiting in the wings to come and obtain dominance?
[01:05:00] Unknown:
Yes. I would say I like to see the, I believe it's called activity schema idea. Right? I think there is a place for these generic metrics table. So we talked about, like, the metrics layer in the past and the future repository. As much as, like, what we're talking about here today, in terms of, like, in into the centric modeling is, like, an entity bound table with a lot of columns. The question of, like, how you get to a lot of columns, I think somewhere in your data warehouse, you need more of a metric centralized metric table, where that 1 would be a very thin table that is very long, that very, very high. Right? And and that, or by high, I mean, they're just the formatting of that table is not wide as thin. And and what's typically in those tables, it will be something like what is the time, what is the metric name, what is the entity name, or the entity type, the entity ID, and what is the value for that metric. And these are generic schemas that allow you to store things like events or metrics in time in a very kind of thin ways without having to change your schema. So I think we see that tends to exist quite a bit called a metrics repository or a feature repository, but it it's pretty common to go have that table that is metric bound, entity bound, and time bound. So I'll try to repeat it, like, but, yeah, it's very common pattern. You have a table that that has something like entity type customer, entity ID, you know, entity 1, metric visit, and value, and and time. So it would be on that date, this person for this metric at as this value. So once you have these very generic tables, you can do a lot of that pivoting I was talking about before.
That's more like, okay. Now I want to I want to compute 7 day visits for all my user, 28 day visits for my all my users. I wanna do that in 1 big scan. So once you have this table, it becomes much more natural to pick and choose how you're gonna bring metrics into your dimensions. And that table, or that that pattern is is close to what I believe, you know, act activity schema is about, and I believe is a place to loop back to what we were talking about in the beginning of episode. That may be a place where data scientists and data engineers can collaborate and share definitions upstream for their own pipeline or to agree of, like, what are the metrics all the metrics we track for each entity type and on what time frame. So there there's probably much more to to be written there. Right? Or maybe there's a question, like, how do you take something that's activity schema, entity centric, and and, you know, produce entity centric table off of those. I I wanna say too, like, there's 1 thing we, we didn't talk about, which is, you know, I said early in the episode that the I wanted to legitimize bringing metrics and time bound metrics to dimension tables.
Another thing I think that's important is bringing or allowing people to bring nested complex data structure inside dimension tables too. And going back to the podcast user listener table we're talking about, like, maybe it does make sense to have an array of what are the top the last 5 episodes the user listened to. Right? Like, for whatever reason, like, and if you wanted to do that, well, you're not gonna create 1 column called, you know, last episode they listened to, previous episode they listened to. If you wanna track, like, what percentage of the episode they listened to, right, 1 okay. What is the episode ID, episode name, and what percentage of the episode did they listen to? Okay. Well, now you'd have 3 columns times 10 episodes that we have to manage this dynamically. So then they're, like, well, this is, to me, really valid information, really core information to the user, like, 10 last episode. I wanna put that in the table. Well, put an array of, of obstructs in there and put it in there.
And, it will be easy for retrieval with a little bit of a JSON extract type function or, you know, whatever the database of choice, you know, that use, offers to query this. Now you could do a nest type operation, and you've seen those in SQL. But you could say, hey, if I have an array of 10 things, you can say generate pivot that into 10 rows into my query. And people know how to do that now. Like it comes naturally for software engineer to to work with data structures and and pivot things. So so don't necessarily abuse that. You don't wanna put all the facts, you know, about all the users as necessary structure, but it's something that you can do for things that are best represented that way too.
[01:09:55] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, as usual, this is, I don't know how many times you've answered this, but from your perspective of today, what do you see as being the biggest gap of the tooling our technology that's available for data management?
[01:10:17] Unknown:
Yeah. Maybe I'll I'll go on, you know, we talked about this just to stay with the theme of the episode. I think I would say, like, education and data modeling. Like, we it'd be great if we had, like, data modeling 101, and and maybe that's, like, somewhat provided in in and around, like, say, the DBT community. But, you know, it's like what what even if it was just, like, the the the 30 things that you should know before you go and pile on your SQL to the amount in a SQL that your organization has in terms of I I think there's much more to data modeling in that, but data modeling is a component of it. So, you know, things like if if you work with SQL and you create this set, you should know what the like, normalization, denormalization means. You should know, you know, what the dimension and effect is. You you know, so what is that vocabulary.
And there's a there's there's a bit of a gap there. Right? If you talk to 10 different DBT users, they might not use the same nomenclature and practices and convention. So we need to 0 in here on a on a set of, like, knowledge that people agree about or even just, like, definitions and and, terms that everyone knows that is close to that that area.
[01:11:38] Unknown:
Alright. Well, thank you as always for taking the time today to join me and share the work that you're doing and your thoughts on this overall space of domain modeling, the benefits of entity centric approaches to that, and the ways that our newer compute substrates enable more detailed approaches to data exploration. So, appreciate you taking the time, and I hope you enjoy the rest of your day. Yeah. It was awesome to be on the show as always, and, I'm sure I'll be back soon enough. So enjoy your day too.
[01:12:13] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Entity-Centric Data Modeling
Defining Entities in Data Modeling
Machine Learning vs. Analytical Use Cases
Competing Data Modeling Approaches
Impact of Tooling on Data Modeling
Building and Evolving Entity-Centric Models
Challenges in Multi-Unit Organizations
Innovative Uses of Entity-Centric Modeling
When Entity-Centric Modeling is Not the Right Choice
Future Directions in Data Modeling