Summary
The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Narrator is and the story behind it?
- What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability?
- What are the use cases that you are focused on?
- How does Narrator fit within the data workflows of an organization?
- How is the Narrator platform implemented?
- How has the design and focus of the technology evolved since you first started working on Narrator?
- The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format?
- What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault?
- How does the activity schema address those challenges?
- What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault?
- What are the performance characteristics of deriving models from an activity schema/timeseries table?
- For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema?
- Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture?
- What are the most interesting, innovative, or unexpected ways that you have seen Narrator used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator?
- When is Narrator the wrong choice?
- What do you have planned for the future of Narrator?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Narrator
- DARPA Challenge
- Fivetran
- Luigi
- Chartio
- Airflow
- Domain Driven Design
- Data Vault
- Snowflake Schema
- Event Sourcing
- Census
- Hightouch
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ahmed Elsamadisi about narrator, a platform to enable anyone to go from question to data driven decision in minutes. So, Ahmed, can you start by introducing yourself? Hi, everybody. I'm Ahmed.
[00:02:08] Unknown:
I started my career in robotics, autonomous cars, moved into AI for missile defense, eventually built that WeWork's data team and infrastructure, and now I run Narrator. So I'm excited to share the journey and why I left WeWork to build something
[00:02:23] Unknown:
specific and nuanced that narrator is. And do you remember how you first got involved in data?
[00:02:28] Unknown:
Yeah. So I've always been involved in, like, the algorithm side of data. So I had an autonomous car in college in 2010 that competed against a Google car, if you guys are familiar with the DARPA Urban Challenge. With that, I've always been a consumer of data and writing algorithms and to make decisions with data. And when I transitioned into working at WeWork, it was an interesting place because I got to see data that wasn't prepared for my algorithms. It was raw data, and it was messy. And it seemed that questions are like I don't know. I just come from Raytheon, and I was, like, dealing with missiles and the thousands objects of space. And then WeWork was like, how many sales do we have? And I'm like, how is this a harder question to answer?
And, like, the nuance of what data looks like, I think, like, makes this entire world of data engineering fascinating for me. It has the same goal, but the challenges just shift. And that's what I dove into, data.
[00:03:20] Unknown:
Now in terms of the narrator project, you mentioned that you left WeWork, found this business. And I'm wondering if you can just describe a bit about what it is that you're building there and some of story behind how it got started and why you decided that this was the area where you wanted to spend your time and energy. At WeWork, we had built, like, a dig a traditional data system. Right? We had EL, 5 Chain. We had transformation layer. And back then, it was Luigi.
[00:03:44] Unknown:
And then we had dashboards, and it was Chartio. And it still, so many back ad hoc questions were coming in. We couldn't answer them. This idea of, like, we'll build enough dashboards so people can self serve their own question never came to fruition. And numbers stopped started not matching. We're like, well, there's data modeling in the middle. There's all these things happening. And we kept trying to explain why numbers don't work, and it's a very complex procedure as you know and everyone else here knows. So we were like, the problem is Luigi. Let's switch to airflow. And then 6 months later, we're like, the problem is Airflow. Let's build our own system that integrates with GitHub and does all these things. 6 months later, we're like, wait. Let's switch to dbt. And I was like, okay. Something is fundamentally wrong with this approach is that no matter what tool I use, I still end up in a place where numbers don't match and I can't answer questions. And every new question requires me to go all the way back to the data model, and now everyone's blocked by data engineering, which was like a slogan how we work. And we had a huge team, 45 people. So, like, how is it taking this? How is this a problem?
I went out and I talked to Airbnb and Netflix and Spotify and a bunch of other companies that were in this space to data engineers there and being like, how do you solve this problem? And they're like, that's the job. Like, that's the nature of data is like new questions, you have to update your model, and you do it. And it drove me crazy. And I was like, what if, like, it wasn't? Like, we've solved this problem before. What if we can standardize data so that the way we ask and answer questions is repetitive? Which means that I don't have to constantly reinvent the wheel every time a new question is coming in. So we had an idea that came from reading blogs, which was like, I can read Netflix's blog, and I can understand what they're doing without ever knowing their data model. So there's clearly a data model that is universal, and I used to call it back then that universal data model that we were gonna create. And now it's called activity schema. And the idea was if I can standardize all of data for every company in the world, it's the same exact structure that is a single table, then I can make a standard way of asking and answering questions.
And if those 2 can encompass any question that could be asked, then anyone can go from question to answer in minutes. And I left with that vision, which I'm very happy that people funded me because it was very, very like, what? You're gonna standardize all data? Like, ecommerce. No. No. All data. So, like, just marketing. No. No. Any industry, any data source is gonna all be standardized into Narrator, and we're gonna be able to answer and ask and answer any question. And it took about 3 years before this entire flow actually started working. So that's kind of why I let's solve it. I just felt like there had to be there needed to exist another way. Otherwise, we would spend our whole life playing catch up with the business stakeholders, and it just kind of created this antagonistic relationship where everyone hated data engineering, but it was just like the nature of how data is. And you had to always deal with the annoying question. I don't get it. Why is this so hard? How come you can't just tell me how many people came to our website and called us? And you're like, well, that's a very complex question. Like, I can tell you why it's complicated, but you won't understand why it's complicated. And now it's like a whole nightmare.
So somebody had to change it, and it couldn't just be the fight like, this is the nature of data engineering is not a good enough answer. It's been now 5 years, and I think
[00:06:54] Unknown:
there is a solution. And I remember coming across the, you know, homepage of narrator maybe a couple of years ago, and it was very sort of enigmatic about what is it that they're doing because it's you know, the the page was saying, you know, standardized narrations about your data, and, you know, you can just do whatever you want, but you hadn't yet advertised the activity scheme element of it. So, you know, coming from a data engineering perspective and looking at the page, I was like, it sounds like there's just a lot of humans doing a lot of busy work. So I'm wondering if you can talk about some of the primary use cases that you focused on as you were going down this path of figuring out how to unify the data model across all these different industries and some of the lessons that you learned in the process going from, I have this problem that I want to solve, to where you are today where you feel like you've been able to complete that loop. Yeah. So narrative marketing is definitely 1 of those interesting journeys. It's like, how do you communicate what it is when it's like a data tool that solves the problem that's not for data? So it's like a really complex situation. And we ended up pushing the activity schema because it made a lot more sense for the world to really understand what powers it.
[00:08:01] Unknown:
So when Nerida started, we had this idea of the data model, and we said, like, how do we know this thing is gonna work? And instead of, like, selling it because it was, like, not a good product, we decided to actually start as a consultancy. So he said, what if every single data person can handle 10 different companies of different sectors? And I can field questions via Slack and answer them within 30 minutes as, like, the expectation. And I'll build a tool that will enable me to do that extremely, extremely fast. So that means, like, context switching. Every question needs to be used. And that's kind of how Narrator started evolving. The activity schema itself is a very, very we used to call it activity stream. We used to call it universal data model. It was this concept that was like, listen. Warehouses are really good on dealing with long data.
And as humans, we can talk about things as customers doing stuff in time, and most questions can be phrased as a customer doing something in time. How many people came to our website and called us? Like, what's the best paywall? Give me everyone to start a subscription. What was the last paywall they saw before starting a subscription? And what we ended up getting really good is taking any question and converting it to this time question. The challenge was how do you actually query time series data? So we ended up spending a lot of time building a tool and a little bit of a unique way of, moving these activities to actually generate datasets that you can actually use as like your middle layer, these tables, these materialized views, so you can actually answer these questions and make it easier.
And that's kind of what initially was the solution of Narrator. It'll be like, take it, we'll structure your data as these building blocks that we call activities, which are these tiny little SQL snippets that take about, their 25 lines on average, take about 14 minutes to write. You just define what is a page view, what is a ticket, what is a call, and then they're gonna stitch that together into a long table. And that table is kind of, like, cleverly pivoted to answer any question that you need with some additional powers and filters. And we would do the setup for companies and then put that output table in the dashboard tool and use it.
And, eventually, we continued down the path where we realized further and further and further that not only can you create those tables, but you can also create multiple aggregations and visualizations in Narrator. Well, actually, the questions you're trying to ask and answer now that we standardized inputs, well, the thought process is standardized. So we started building analyses and going through the analysis generation of, like, what if I can actually generate an entire story with a recommendation, like, exactly the way that a data scientist would? Well, because I have a standard input, I can actually get this whole thing to be reusable. And we ended up getting into reusable analyses and slowly, slowly kind of shifted narrator to really talk about the end state goal. Because early on, it was like, we're going to do this thing that is amorphous because in this middle layer, that was an alternative to star schema, and we were trying to show an alternative. But instead of a approach, it was a tool, which was a terrible idea. Don't compete approaches with tools. That's like no 1 understands you. So, eventually, now we've kind of as we've grown, we've realized that the core thing that Narrative does is make data teams so much faster in asking and answering questions and allows data team to deliver those analyses in a very clear way to their stakeholders.
But everything they do can be reused, and it takes away a lot of the common things that you are used to suffering through. Data always matches. Numbers always match. Super easy to combine data. You never have to worry about foreign keys or how do I join this. There's no more joins. All joins have become abstracted in Narrator. You never define how, like, an order ties to a web visit. All that is gone. And on top of that, not only do you get a structure, you also get this underlying framework that is we decided to actually open source this approach called the activity schema, compared it to the star schema, which is like the most standard way of doing it, and it's so much better. It is significantly a game changer, and it solves a lot of the problems that a star schema was designed in a world where a lot of the current problems that we face don't exist. An activity schema is designed to address all those unique problems that are now very, very apparent and very large. Does that make sense? Like, give a long journey.
[00:12:01] Unknown:
Definitely makes a lot of good sense. And I think it's worth digging more into sort of the star schema, Data Vault, you know, all these different data modeling approaches that we have taken and tried to standardize on wide tables now that we have cloud data warehouses and maybe talk through some of the inspirations that you've taken from those different approaches, both positive and negative, and how those led you toward this idea of the activity schema? Because it definitely seems very much like domain driven design, event driven architecture style, but applied to data engineering.
[00:12:34] Unknown:
A couple of things with the star schema approach and Datavault approach and a lot of these wide table or maybe join wide, there's an underlying assumption there that makes these things work, which is your data is actually connected. And the stuff that you need is predefined. So, like, in a star schema, you're building these kind of tables and you have your dimensions and your measures, and you're assuming that you wanna slice these dimensions by these measures or some sort of that nature. And you can relate data very cleanly in a star schema. So what happens in reality is that that's never the case. Somebody has, like, we have a table, like sessions that every single session, and whether the Calabrio customer and whether, like, their first session and then, like, their total order value. And somebody goes, okay. Well, actually, I wanna slice I wanna add another concept. I wanna know how many calls they had. Great. I can join the total calls. Well, I wanna know how many calls they had in the 1st month of their membership. Oh, wait. Now that's a lot trickier question. That is not something a UI tool can do, but now I need to add that column into my dimension, so now you can have that. Well, it turns out that this entire flow of new question that you can't do in the UI, you have to actually do in SQL, and constantly adding new columns is so standard in our practice. We're constantly trying to grow our dimensions and our measures.
And because each table now gets to be really complex and large, which Data Vault tries to solve by kinda creating smaller things, but then you have to figure out how do you tie them together and you create these these join tables to help you figure out how these things stitch. As these tables get really wide and large, there's so much logic in it, the numbers start not matching. Well, now from a sessions perspective, what if this customer is not attributed to a session? Now I have to create another perspective, and now there's, like, sales with sessions and sessions with sales and all these sorts of different things. And I'm looking at it from a customer perspective, and then I have columns like, first called at, second called at, third called at, last called at. And now you have, like, a 100 columns in this, like, weird growing table. And the more and more people who are asking questions, the more and more these things end up happening more and more often. People often blame BI tools for, like, not being able to self serve, but it's nothing to do with BI. It's this fundamental data engineering problem that you're constantly needing more columns that require nuances to combine. And it turns out the more the questions are fuzzy, the more miserable it is for data engineering. Like, when someone goes like, well, I wanna know if they called us, but only if they called us before they went on the website and submitted a lead or before they they opened their email. Oh my god. That's 3 different systems with 3 different identifiers. Now how do I stitch them together? How do I combine them? Well, how do I do these joins? That's like 300 window functions. Now what happens if I miss a small thing in the window function? I have to rewrap it and close it and make sure I don't duplicate the data, make sure I don't drop a row. And if I make any careless mistake, this whole thing goes like, it numbers stop being wrong. So now I have this, like, 1, 000 line query that's representing part of this ginormous, even bigger query that is the star schema. And I could componentize it and do a lot of stuff, and we've gotten really good at componentizing these things. But it's still extremely, extremely difficult, and debugging it is like a nightmare. So you have to go through out the whole path.
So that's the problem. Yet in event based stuff, this goes away. Right? Event based stuff, it actually goes away because when you're writing algorithms and code, you actually can look at a single customer. You can say, give me the first time the data, the second time. You have this really, really natural way of communicate doing data like you do in code. And it turns out that a lot of the way we structure data, the data models, often we keep time stamps in almost everything we do, and we have, like, structures of first class objects. Usually, your production system will do a good job of capturing a lot of your 1st class objects. Your Zendesk will do a good job of capturing tickets, comments, and closed and closed tickets, but they don't do a good job of stitching that data.
So event stuff is really good because you can everyone here who's debugged the star schema has, like, chose a customer, went and looked at all the time stamps, looked at it, and figured out what happened, and then like, okay. This is where I lost my join. So idea of, like, it can be a schema. What if your join was kind of the way you debug it? What if I took individual customers and followed their journey, and I can based on how our logic is, like, I want the first time they called us in between, but before this, I can actually pick out the exact ways of associating each individual activities, and then I can pull the features from each activity and have that be available.
So by going through time, you have it. The big challenge is SQL is not time based. So how do you take this time language in between first, last things and convert it to a very effective, efficient SQL query? Doesn't drop rows. It does everything you wanna do. And then that was, like, 2 years of, like, writing our own abstraction query language and then doing this whole thing in a lot more details. But at its core, the activity schema allows you to kind of do that work where you can move everything in time and ask and answer those questions. And we've written a lot about how you actually query the structure to get any table you need. Does that make sense? Yeah. It definitely does. And I definitely agree to the, you know, oh, let's just add another column to track this attribute.
[00:17:28] Unknown:
Yeah, and and that's another element too of where the whole idea of event sourcing in application architectures comes in is you just capture the events. And then if you decide that you want to process them differently, you don't have to try and figure out, okay. Well, now how do I back up this bug that I introduced? I just say, no. I just change the code, and then I just reprocess everything. And now I'm in the state that I want to be. So applying that to sort of the data engineering flow makes logical sense. I mean, that that's where the whole streaming paradigm is supposed to take us, but there's still this dichotomy between stream oriented systems and data warehousing systems. And they're working on merging the 2, but the of data modeling and approaches are still completely divergent because streams don't they have a data model and that there is a schema for the event, but there's no data model in terms of being able to manipulate the stream over a sequence of time within the streaming engine unless you're, you know, working at Google scale.
[00:18:21] Unknown:
Yeah. But the good news is, like, warehouses are very, very powerful and very, very fast and very, very so you can actually do, like like, what narrator is technically is, like, at its core activity schema is a stream. It's a you can stream your data normally into activity schema. You can also crawl your current production databases into a stream because, like, a lot of times, we're querying here. You're just converting these tables into kind of this long stream. You're just kind of converting from your database tables that are in, like, in a relational database into a stream, and then just narrate as an incredible way to use the stream as your data modeling layer and having that all those nuances just be done through live on the spot when you have a question. So, like, I think, like, anyone who works in streaming has this idea that's, like, this is the eventual what you're gonna do. Like because, theoretically, it makes sense. It's just practically it's extremely hard.
So Narrator has been taking that theoretical and made that practical now so you can actually do it. Like, we find most of our customers will save a lot of money on warehousing cost and get a much, much bigger speed bump by using our data model because warehouses are so good at dealing with long tables. They're so good at it. Like, way better than, like, these complex,
[00:19:29] Unknown:
50 query joins and all that. No. No. Warehouses will take that long table, scan it, process it, and do the pivot that we're trying to do in the way we join our data, and spin out that data for a 100, 000, 000, 000 rows way faster than you were to and if you're writing this SQL stuff, like writing it like a hint. So it's worth digging more into the specifics of narrator as compared to this idea of the activity schema because I know that schema definition itself, you have released publicly so that anybody can model their data that way. And I'm wondering if you can talk through some of the other elements of the narrator platform and some of the ways that you arrived at this architecture coming from the beginnings of operating as a consultancy to try and flesh out the ideas.
[00:20:09] Unknown:
The activities scheme that we opened up, we opened up the way you should query it. And what Narrative does in addition to that is provide a very, very seamless, simple UI to enable you to query the data. So, like, all you do in Narrator is you're defining these activities, which are building blocks, and you're arranging them using 1 of 11 operators, like first ever, first in between, first after, last before, stuff like that. And what it turns out in Narrator is you can actually create any table you have done in your start scheme or any table you need to answer any question by just combining activities with these very custom operators.
So the Narrow platform just takes allows you to kind of, like, quickly take your activities and convert them into tables using these operators in a seamless way with additional layers now. So once you have this data and you have you've created the dataset, you can create multiple aggregations off of it. You can visualize it. You can send that data to a webhook into a product. You can materialize it. You can do everything you would do in the data tool super seamlessly. So what we see is that just we've seen customers migrate Office star schema completely onto Narrator, and it's, like, couple days. We've seen customers go from, like, they have their entire system in, like, Redshift and switch to Snowflake, and it's a day to have everything downstream just magically work because Narrative just maintains, like, that single table and just how you combine those activities. And we've seen customers that go from starting with data teams asking questions and watching it all the way down to customer success and product people asking equally as complex questions as they kind of get used to this new way of asking questions with customer and time.
So that's the first part of narrative. It's really, really just making activity schema, querying it seamless so you can create any table and answer any dataset you need instantly. And I do that for a demo. If you anyone here is listening to me, who probably have seen me, if you Google me on LinkedIn, if you've heard anyone who's seen my demo, I always say, ask me any question. Like, ask me a really complex question. And if I can't answer it in 10 minutes, like, with our demo account, like, then don't buy narrator. And you will see that people have always said, it's just like, doesn't matter. Any question always gets answered.
So that's the first part of Narrator. The second part of Narrator is what we call our analysis library. So, like, Salesforce standardized a lot of data and Salesforce standardized sales into opportunities, tasks, and leads. And when they standardized that data, what ended up happening is that you were able to get the stack exchange. You were able to get entire analysis and algorithms on top of sales data. So you get to open up this world of reusability. And what Narrator has done is taken that standardization and allowed it. So you can run 1 of our templates to understand CAC. You can run 1 of our templates to understand LTV. But you can also ask very generic questions like, how does, like, number of calls impact likelihood to convert into a closed 1.
And Narrow, it'll tell you exactly how many calls. It will check it will give you an entire story explaining what it did. And if that story doesn't read like a senior data scientist wrote it for you and thought through the problem detailedly and made a clear recommendation, then we're full of shit. But if you read it, the biggest feedback is is, like, wait. Who wrote this? This is written like it reads like it's written by hand. And I think that's the power of standardization is that you can actually write these analyses by hand and get to reuse a lot of them. As you have been exploring this space of trying to unify analytics across industries
[00:23:28] Unknown:
and developing the activity schema, I'm wondering what are some of the assumptions that you had going into it or some of the ideas that you had at the outset that have been challenged or changed as you have gone through this journey? There's really 1 major assumption that I thought
[00:23:44] Unknown:
I had that was, like, actually ended up really breaking. So it turns out that when you convert your world activity to schema, all the questions sound the same. Like, there's not that many things you can do. There's combining activities with 11 operators and infinite number of questions are answered. So all questions end up getting mapped really, really easily. The 1 thing that I didn't realize, we had this idea of a customer. So we had this idea of a a time, customer, action, and a built in identity stitching, which is a very big part of narrator because you need to do that if your only way of connecting data is customer and time. You spend a lot of time identity stitching and making sure that's really well. 1 thing that I was very surprised to see was that the different perspectives that people have. So in Narrator, when you're asking a question, you're asking it from the customer's perspective. When we first built Narrator, it was assumed that everyone was gonna have 1 customer.
And as we've seen, it's actually never 1 customer. Most companies have 2, and some companies have 3. For example, your customer perspective, like a customer opens your app and a customer, like, submits a ticket and a customer calls and a customer opens a chat and a customer receives an email. But there's also a company perspective that you might wanna ask questions. And a company might start a subscription, but a company might have an employee reach out, and The company might have an employee open a chat, and the company might have these, like, similar activities from the perspective of the company. And it allows you to ask and answer questions very differently. So when you wanna know, like, does the number of tickets an employee submits changes elected for someone to renew? Like, you will do that using your company 360, which is an activity schema on top of your company. But if you wanna know, like, does this customer is like to come back to the app, change as they're submitting tickets, you can use the person 360 and answer that question. And that kind of, like, multiple dimensions of how people are asking questions from perspectives, I never thought people would be creating similar activities and reusing them. So that was a really core assumption of the activity schema that really, really ended up changing. And the second core assumption that I realized is that the activity schema makes way more sense if you're not in data. That was an assumption that surprised me way more. I was like, we were initially, like, convinced talking to data people. We were selling data people. We were always like, hey, guys. Like, I'm a data person. You're a data person. Here's the star schema problems. I think if you Google star schema problems, my blog comes up as, like, the number 1, like, Google, like, summary.
Like, we were like, here's all the things that I can tell you about why star schema is not good, and here's you've dealt with the situation. And they're like, yeah. A 100% agree with you that star schema kind of, like, sucks in this way. What the hell are you talking about with activities schema? It makes no sense. What do you mean you're just gonna make everything look the same? Like, no. No. No. All my questions are unique. Like like, how is this gonna work? And I'm like, well, look at this way. Look at this way. Look at this way. And it was, like, impossible, like, early on. Now we have a lot of documentation and things. But early on, data people were, like, not having it. They were literally not having it. And then you talk to non data people. Like, you talk to growth and operations and then marketing people, and they're like, isn't that how data is modeled? Like, yeah. Duh. Like, I don't know. What is innovative about narrator? Like so you put all data into like, that's how I ask questions. Like, you know, of course, I see the customer doing stuff like, what else would you do? And you're like, wait. What? Like, from their perspective, like, they imagine in their head the data looked like that way, and the way we ask questions is the way they already do it. They're like, yeah. Of course, I want it the first time be after it that they did it, and they only wanna see it if it's within 2 weeks. And I was like, woah.
The way you've always asked questions and the way you've imagined the date world of data was always in this way. And And coming from a data perspective, I've always conflicted because I was like, you have no idea what you're saying. But it was just like they just saw a simpler world, and I was dealing with a different reality. And when they were to change the reality to look the same way as them, these people were like, cool. And that blew my mind. Seeing nontechnical people explained, like, the approach that we have back to me and it be perfectly correct, like, well, mostly correct, I was just like, what?
Data community should listen to marketing people more often. Yeah. The blasphemous statement.
[00:27:41] Unknown:
The same ideas with, you know, when you're doing software architecture of, you know, making simple things possible is harder than adding complexity. You can add complexity as much as you want, and it will eventually be able to do the thing that you want to do, but it'll be horribly unmaintainable. Same thing is what we're discussing with star schema. But then in order to be able to figure out how do I actually make this complicated operation simple from an architectural perspective is where it really requires a lot of thought and innovation. So
[00:28:13] Unknown:
Yeah. It reminds me of FaceTime. Do you like, I think FaceTime, when they first designed the app, there was, like, video calling from Nokia before that, and they had this, like, entire app with menu and da da da. I think there's, like, a blog that explains this thing. And they slowly started removing it and removing it and removing it and removing it and removing it and removing it. And, eventually, FaceTime was 1 button. Call, hang up. That was it. Like, they were like, that's all you need. And then I see, like, my parents, and they're like, yeah. What else would you put? Like, I don't know what else you could ever want could ever want, then just call, hang up. Like, what am I gonna do with this call? It's a phone. And you're like, wow. Like, it's so interesting, like, the process that we take.
[00:28:47] Unknown:
Yep. Well, you have to figure out the hard way to do it first so that you know what's possible, and then you have to figure out how to make it simple. Yeah. Trust me. I've been those scars in my old days. And so now for somebody who has an existing data infrastructure, they have, you know, various platforms. They want to start integrating Narrator's capabilities into their system. Can you talk through how Narrator fits into the stack and what's involved in actually migrating their information into this activity schema format and some of the domain modeling that goes on to be able to figure out what are the appropriate actions, what are the entities that I want to model, and all of those kind of ideas? Yeah. Great question. So
[00:29:27] Unknown:
1 of the things about being so new and different is that every customer we have kind of might like, is really doing a partial I don't know if it's considered migration. I think that what ends up happening in practice is that Narrator becomes the Dative. What we're seeing right now is that people are using approaches to solve self-service that weren't designed to solve self-service. So people are using dashboards and dimensional modeling to create this hacked world of open ended questions that you can kind of click into with some dashboarding tools. So when Narrator comes in, usually, it's like a breath of fresh air to enable people to ask and answer questions. So in terms of implementation, the implementation of narrator is actually really, really, really cheap and easy. Like, we often do it in a POC in 2 sessions, like, in 2 it's part time minute meetings. We actually have the stakeholders come up with all the questions they wanna add, like, a bunch of questions.
And what happens is that stakeholders already ask questions with these concepts of entities in their mind. So it's quickly just like you parse them. You say, okay. Clearly, you have 2 perspectives, a customer and an organization. Now here's what this question is just using these words. What do these things mean? And they're like, oh, we have, like, a sale and a call and a this. And then we choose those activities, and we write quick queries to represent that concept. Like and what happens because the way that we represent the concepts is just what it is, those queries are super quick and super easy, and we have a whole library of them for almost all the common tools in our doc site, and, like, nothing goes more than 20 lines. It's like select blah blah blah from calls. Like, it's always, like, that simple. Maybe join to get the person's email. So what ends up happening is people start migrating to that. And then what we see in our customers is 2 use cases. Whenever an ad hoc question comes in, instead of going it through the update data modeling dashboard flow and maintain this new structure, they will just kinda create a dataset and shoot it back to the customer. And then what we're seeing with their stakeholders, so you get this ad hoc questions being answered, and the stakeholder can now, like, aggregate it different ways, add an additional columns, do small changes themselves. And, eventually, with a couple months, we start seeing that people actually go to Narrator themselves to ask the questions without even going to data team. And when there's a little bit more of a complex question, they go to data team, and data team shows them a little bit more of how to translate their question into a good data question that you can then answer quickly in Narrator and give it. So it becomes just, like, really nice ad hoc layer, and then it becomes an analysis layer as well. So then when someone asks a question, if it's, I just need data, here it is. But, like, usually, there's an underlying hypothesis. What somebody will do is just test that hypothesis in Narrator and give someone a full analysis. And sometimes that analysis is not actionable, and Narrator will, like, rerun that analysis every single week or every day or every month and email you if that analysis changes from not actionable to actionable so you can be informed. So you're seeing these companies build up this repository of all their questions in Narrator. They're getting these faster ad hoc questions answered, and this whole thing is happening in a maintenance free way. So data engineering models that they currently have don't keep growing out of proportions. And what ends up shifting is that the data models end up later on getting cleaned up, so you have data those clean data models and you're dashboarding for, like, executive dashboarding, which is what's purpose, what you put on the TV, what your executives are seeing. And you get rid of those, like, 1, 000 dashboards that were used for 1 purpose and, like, onetime dashboards go away. And all these data models are not critical go away because all those things end up being just replaced by having narrator ask and answer questions more fluidly.
So it really just shifts instead of the data team maintaining hundreds of data models, it just makes them maintain 10 data models and then the narrow directivity schema. And then it allows the consumer to actually get the data in the way that they're used to, make changes, ask answer more questions without these 1 purpose dashboards and then 0 ability to do follow-up questions. So that's been the case. And we've been seeing, honestly, like, there's been also a lot of additional use cases that I never thought about that I'm seeing a lot more of our customers doing. So I'll also share some other use cases that people have done in Narrator. Yeah. Definitely. So 1 of my favorite use cases, I'm actually writing a blog on it, was a product team. So our product team added this, like, impact section to their product feature doc.
And now in every single product feature doc, when someone submits any feature, like, for example, they wanted to redesign the users page, they have to now define, okay, what KPI are we trying to affect? And they're like, okay. Number of users added to the platform. And then they go, okay. What OKRs is gonna affect? They're like, trial to subscription conversion rate. And they have that and it goes like it says narrative link, which is run that exact question announced to narrator and see if it's actionable and see what the lift is. And what happens in this process is that people then run that question. Number of users added has no impact on the conversion rate. And they're like, wait. Maybe it's not a number of users. Maybe it's time to add the first user.
And they refine the question until they get it, or it turns out there's there's nothing the user's page is fine. It's not touching it. So it's like 0 impact. Or if it does have impact, they have a quantifiable lift number that's, like, based on, like, the our historical data to it. This whole thing is being done by a product manager without ever talking to data teams. So they're iterating through their question. They're defining. They're clarifying, and they're defining the lift, and then they're using that lift to organize rank order, their features, and how they built them. And I was just like, that is so cool. Never thought somebody would do that in product. That's another thing that was also in customer success that was very popular. So we have a customer journey view where you can, like, kind of, like, look at full customer journeys.
In customer success, people just started doing it as, like, part of their operations, which is weird to use a warehouse as your operational tool, but whatever. What they would do is whenever they call a customer or a customer submits a ticket, they would actually look them up in narrator, and they eventually added direct links from the we actually copied it from HelpScout to Narrator. And what it does is I get to see everything the customer has done, the time they called before, the tickets they submitted, what they've done on the app, in the emails. They read the thing. Everything that has happened across all our systems in 1 place instead of just depending on the notes or the context they gave me. So I have a timeline history of everything about this customer, and I'm able to then once I help the customer, like, with their ticket, I'm able to be like, wait. How many times does this thing happen? And And I can just hop into dataset and see, like, is this a 1 time thing that a customer did, or do we see something that's, like, a very blaring problem that everyone who views this page ends up submitting a ticket and then later on doesn't do this thing. And I'm like, that is so great. So stuff like that, I've been really, really excited to see warehouses become operational tools and, like, have data be actually used in product because every single product person wants to use data. But usually it's like a screenshot of a dashboard that's like total sales and you're like, I don't know how this affects your feature, but whatever. I'll believe you.
[00:35:45] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. In terms of the sort of exposing the customer journey to the, you know, call center operators or the, you know, product teams. I'm wondering if you can talk through some of the performance characteristics of that in the data warehouse because, generally, you think of the data warehouse as for performing large aggregates, not necessarily for doing, you know, segmentation of understanding the behavior of a single entity. And then also another aspect of that is the sort of governance and access control attributes of understanding what are the specific types of actions or what are the specific entities that we want to give access to these various roles within the organization?
[00:37:23] Unknown:
So great question. So warehouses are incredible. Like, 1 of the best things that we did at Narrator is that we built Narrator on top of a warehouse. Like because a lot of the streaming approaches, they built their own data store, like Amplitude, Heap, and it, like, just doesn't scale, and it's so slow, and it's really annoying. 1 of the benefits of building on top of the warehouses, warehouses have gotten really good and continue to get really better at processing data. So, like, a select star with a where a filter is on, like, the customer equals whatever entity you want, even at a 100, 000, 000, 000 rows, even at a 100 like, 500, 000, 000, 000 rows, even at a trillion rows, unlike a snowflake cluster, is relatively fast.
Like, you're talking about a couple seconds to get the data. And at larger scales, it's fine. Like, narrator, like, is incredibly fast as a tool because warehouses have gotten really, really good at dealing with data, especially time series. Like, you've seen tools that do, like, time series searches and query on top of it, and, like, now we're getting to people who are talking about it in milliseconds. When you can go down to, like, the red chips cluster, it's still, like, 5 seconds, 10 seconds to get, like, a lot of complex data. So warehouse is scanning because, again, also keep in mind that the activity schema is, like, 11 columns. So, like, your column or database, you're scanning 1 column relatively fast with some specific sort keys and partitions. It's even faster.
And it's like 11 columns, this whole thing is, like, 30 gigabytes to, like, scan a single column. Like, your warehouse will do that incredibly, incredibly fast. So we end up not really worrying, and it is a different experience when you're dealing with a warehouse because the expectation is a little bit also different. If you're ever used with dashboard tools, you're like, dude, everything is so slow. Like, especially when you deal with large data. In small datas, like, warehouses will just behave kinda like operational tools, and you can't even notice it. They'll, like, give you data right away in a second. And when you get larger data, like, the customers usually if you have, like, a 100, 000, 000, 000 rows in a heap, like, and you try to do anything, like, you're gonna be sitting there for 7 hours hoping that the world changes. But if you do that in, like, a narrator with on your warehouse, your warehouse will process that in, like, 30 seconds, and you're like, wow. That was way nicer. Like, 30 seconds is slow, but, like, based on your experience with your data size, it feels so much better.
So that's kind of what we've seen. And there's a lot of reasons why we can dive into, like, why warehouses are so good, like, in doing large, long, skinny tables. But, like, in which Google column or data warehouses in, like, they're designed for this. Like, BigQuery was designed as big from Bigtable, which was, again, 1 ginormous table. Like, this idea you saw Dynamo, which is processing also. Like, people are moving toward this, like, single table structures that are just, like because scanning 1 column to get the data you need is really cheap versus joins. Joins are expensive. Joins are really expensive. So those, like, scanned like, narrator minimizes into, like, 2 joins maximum in, like, its processing, but it's very, very clever in how it does and squashes the data early on. But scanning and filtering, warehouses just, like, crush that. With some partition and sort keys, it's beautiful.
That's what ends up happening in when people use Narrator as an operation tool. It feels like you're using a production tool. And, like, it surpasses the bar, I think, for most people. And then on the sort of security and governance aspect, what are some of the
[00:40:24] Unknown:
useful patterns or approaches that you've developed for being able to identify sort of what are the actions that we want to, you know, grant or restrict access on, what are the entities that we want to grant or restrict access on, and just sort of managing that sort of governance layer of this data tool? I think with different sizes of companies, these things become very different
[00:40:44] Unknown:
thought process. I think that what ends up happening a lot is that there is usually some features or some things that you want to minimize visibility to, like the customer's email or something like that. And Narrative does a really good job in just allowing you to control whether some people access, like, activities or table or the activities schema. Like, usually, you have 1 or 2 perspectives, and you can choose the activities. But because everything is, like, first class objects, you're never really filtering, like, don't show them this column. You're just like, okay. Is the marketing team allowed to see, like, order data? Yes. Are they allowed to see session data? Yes. Are they allowed to see this? And you just kinda select the activities that, like, this team can actually see. And when you think about everything as first class objects, it ends up becoming so much easier to governance. Like, it's actually 1 of the simplest things because you're like, what concepts do you use in your job? Here they are. You have to solve them.
And instead of having to worry about all the possible questions and all the ways of combining them and all the tables and all the columns and everything that you build in your entire system, like, dependency management and lineage doesn't exist in Narrator because there's only 1 dependency. Everything on top of the activity schema, literally 1 layer. There's no additional layer in. Everything is just a query on activity schema. Nothing more than that. And there's no lineage because you're always combining activities. Right? So it's like, what is the lineage of this table, these 2 activities that you're using? So what we've just seen is customers would do that. And whenever there is, like, nuanced things, they just create separate activities because those nuances should be separate concepts anyway. So, like, you'll see companies do, like, completed sale and then, like, make and then, like, deposit a transaction. And, like, a finance team might get to deposit it because it's a concept of when the money got deposited, and the sales team gets when the sale is made, which is super critical also because it also allows people to understand the nuances of how long it take from a sale to deposit, how many sales do we have that don't have a deposit within a month, how many deposits do we have that we don't have a sale before it. Like, those nuanced questions of edge cases now become partly of exploratory stuff that you can do with Narrator, and they do it. And we had this issue at WeWork. I remember, what is a sale in WeWork? Is it when the contract gets signed? Is it when someone moves in? It when their reservation starts? Is it when they first pay us? And we're like, oh, these are 4 different concepts. They should be 4 different activities because there's always gonna be a case where data is not clean. And this is 1 of the other things that makes activity schema so nice is, like, in these nuance, messy, messy situations, having separate activities makes a lot of sense. Same with, like, leads. Like, I have never seen a lead system or anyone who's implemented Salesforce that has 1 unique lead per customer. Like, everyone thinks that, but I've never seen it. There's always duplicates.
So people are like, okay. Well, just give me the first lead. The first time this first thing got submitted as lead. And that you don't have to deal with that in your transformation and, like, oh my god. There's duplicate people, and now I joined, and now there's duplicate sales, and now there's duplicate transactions. You just say give me the first lead and give me their, and it works seamlessly, and it just handles those cases much nicer. So that's kind of, like, the world of access control is that often the constants are different, and often you can just control access by just activities you give them. And then there's also access control and analysis. And, like, if you wanna share data publicly and you wanna make people have only read only access to our analysis so they can't see the data or full access, all that stuff becomes part of the narrative that you can seamlessly do. But, yeah, I think a lot of again, like, I hope everyone who hears me is kinda getting is, like, a lot of the tools and situations we've built are trying to manage a star schema approach.
And that's a world where you have thousands of tables, thousands of columns, and, like, so many layers to it. And you need a data dictionary. You need lineage. You need heavy, heavy access control. You need roll up access control. You need column of access control. And when you switch an activity schema, the activity is a data dictionary. There's no data dictionary needed. There's no lineage. There's no dependency management. There is no, like, complex role level access control. You're just saying give me the activities that I need access to, and that's it. And it just simplifies a lot of the parts of data that we hate.
[00:44:36] Unknown:
And to your point as well about the sort of messy data problem and the idea that 80% of a data scientist's time is spent on data cleanup and that, you know, the data engineer is brought in to be able to do all that cleanup ahead of time. But then once they've handed over this cleaned up dataset, there are still edge cases or you have accidentally elided some outliers that are important to the analysis. And so having that all as part of the activity of actually doing the analysis rather than having to have this cleaned up dataset is, I I think, important because it requires that you understand the domain that you're actually doing the analysis on and not just saying, okay. Well, I have this normalized schema that I can now do my exploration.
[00:45:20] Unknown:
Yes. The 1 thing that I would like to clarify, which is people do love throwing the statistic around that 80% of data science is data cleanup. I think the idea of data cleanup, people are thinking, oh, remove null values or change Florida to FL. No. Nobody gives a shit about that. Like, that's just like you kinda process that in the algorithm and you just kind of deal with it. The cleanup part is like figuring out how the hell do I stitch data so I can answer it. Like, if you're trying to do a data science and you wanna run, like, a naive bayes or any algorithm, you need to kind of figure out how to get each row to represent the concept that you want. This session viewed this page, and then the person called us and the person bought. And then I can run an algorithm once I have his data in this format.
Getting that format to guarantee that this session has the number of correctly, the number of calls in it and the person converted, this is what we consider a call data cleanup. You're writing all these queries to try to get this data in the format you want so you can do it. And part of the challenge of data science, which any data scientist will tell you, is you need consistent assumptions and guarantees. You can't write an algorithm unless you know what the assumptions and the guarantees of data is. If the assumptions and the guarantees of data is different than what you think, like, you're done. Like, you shouldn't start. And the standardization part of narrator makes sure that everything you do has standard inputs, guarantees, and assumptions as part of it. So this data cleanup is really data modeling. Like, I keep people keep telling me, like, oh, because there's a 100 companies trying to solve data cleanup problem, and they're like, data science is spending all the time data cleanup. And you're like, talk data scientists. Like, they're not dealing with the word Florida into FL. They're dealing with, like, how the hell do I know this person, like, came, and now they're dealing with identity stitching and and, like, combining data in, like, large scale, and they're writing like, if you ever seen their queries, they're insane SQL queries. Like, data scientists write, like and data engineers, what they're trying to do is create structures so data scientist queries can be a little bit more manageable.
Because data engineers are like, please stop writing. You're so bad at writing SQL, and you've done this insane thing, and I don't wanna debug it. Like, you're gonna be like, why do the numbers of sales not add up? And I'm like, you wrote 3, 000 lines. I have no idea. Like, 1 of those 30, 000 joins that you did is wrong. So data engineering need to, like, reel data science in, and this whole entire dichotomy and problem and tension between them has been just characterized as, like, data cleanup. And it's like, no. No. This is a data modeling problem. Like, everything actually, most problems that we face in data, like, 90% of the problems that I hear can come down to data modeling.
You ask people, like, why is my questions taking so long? Data modeling takes too long. Why is the numbers not matched? Data modeling. Why is my Looker instance so slow? Data modeling. Why is my warehouse cost so high? Data modeling. Why is my data scientist not being able to run an algorithm? Data modeling. Why is my algorithm not working? Data modeling. So many things that are happening in the entire experience of data is a data modeling problem. And data modeling till today is done based on the person, their opinion. Like, your data modeling is as good as your data engineer and how much experience they had, and everyone redo the data model every, like, year or 2. Like, if you haven't been in the last year or 2 having done a full refactor of your data modeling and hope that this next refactor is gonna solve all the problems that you had in the past, it is gonna solve all the problems you had in the past. But next year, you're gonna create 200 new problems, and then you're gonna have to refactor again.
So this data modeling problem needs a standard solution that we know works for every company, for every person, for every question. And I think that's why we said, that's it. We're gonna do a standard thing, and it's called activity schema. Very passionate about this.
[00:48:40] Unknown:
Yeah. It definitely shows. I appreciate that. And so starting to close this out, you already shared some of the interesting and unexpected things that people have built with Narrator. I don't know if there are any others that you wanna share. I've seen a couple of things that I find interesting. I found a lot of interesting questions being asked.
[00:48:58] Unknown:
Like, every now and then, I, like, see myself making recommendations of, like, a marketing question being asked to, like, a finance company. Like, marketing has this concept of, like, attribution and first touch, and you're like, okay. Well, what is the first touch attribution? And you wanna help them understand how that affects conversion. It's like a very classic marketing question. And then you deal with, like, an insurance company, and they actually care about what was the first quote that you received, and how does that impact the likelihood for you to actually end up buying. And you're like, wait. What? This is a a similar thinking process. And it turns out that this thinking process like, I'm hoping as Narrator grows, we're gonna open up an entire store. But what you'll see in the world of Narrator when you standardize data is that insurance companies, SaaS companies, marketing companies, ecommerce companies, all have unique perspectives around similar ways of combining customer behavior to understand something. And for me, it's really, really cool and beautiful when those things overlap.
And by standardizing data, you that overlap becomes so crisp and clear because I'm seeing the exact same question asked with different building blocks and then having completely different outcomes as part of it, and I've been fascinated by that. And I just think that, like, people asking more questions is just something that's really, really beautiful. Like, people who are using Narrator are asking an average of, like, 42 questions a month. Like, that's a lot of questions they're putting into Narrator. And they know once they put the question into Narrator, it will continue to be asked, so they never have to worry about it. So people are kind of, like, using narrator as a way to all their hypothesis, all their hunches are being just like, okay. I'll just put a narrator, and if it becomes important, it becomes important.
And I think that's just, like, making data really, really, really part of your practice. I used to call it commoditizing data. Like, really making just, like, commodity to ask and answer a question without feeling like it's gonna be, like, this humongous lift. So
[00:50:49] Unknown:
those are the things that I always find just so fascinating in how are your customers are using Narrator. Another interesting element of this and the idea of the activity schema and being able to sort of close the loop of data and make it possible for anybody in the business to be able to ask and answer their questions and validate their hypotheses is this other rising trend of what people are now terming operational analytics with the likes of Census and Hightouch and being able to propagate your data from your warehouse back into your operational systems. And I'm wondering what you've seen as the potential for the activity schema or narrator specifically to be able to work within that cycle of being able to say, okay. I have this question.
I want to be able to segment my users based on these attributes and then push that back into Salesforce or Mailchimp or whatever that might be. We actually
[00:51:38] Unknown:
did this thing at WeWork, where we actually would have the warehouse generate this table. That would be cache layer. We'd work with Census a lot, and we have webhook integration into our product. And so you can send your data out. And what we're seeing is, especially with Salesforce and especially with, like, email clients, you often have so many columns that you're adding there that you're having engineering built integrations to add, like, that's already in your database or your like, your Salesforce has, like, 1st lead source and 1st touch ad source and when they've been on the website and total pages viewed, and you have built all this engineering complexity to do that. And what we're seeing is a lot more of that pathway where you're just saying, okay. Well, that data is already narrated. Why don't you just send it to Salesforce with webhooks? It's like 3 clicks. Boom, boom, boom. Send it to Salesforce, and now I can add data from my tickets and my calls and everything I need into Salesforce, into my opportunity or lead object easily.
An easy way of seeing it, I love our way. If you start using Narrator, you will receive emails based on your behavior. And those emails are all just generated from datasets. And the reason why we haven't done from Narrator is because doing it with engineering would be crazy. Like, I'm like, I noticed that you actually started creating a dataset, never got a chance to save it, but I've recently viewed our doc site about dataset, and you haven't submitted a ticket yet. For me to ask that question in engineering, that's 7, like, 7 different systems from, like, tracking our marketing sites, tracking our doc site, to tracking our tickets. But in Narrator, it's, like, 6 clicks, and I can create that row for every customer who's done this exact behavior. And I can just send them an email when they do it. And the best part about it is that email, because because it's being tracked back in Narrator, I can say remove anyone who I've sent an email to with this tag in 6 months or, like so that you don't get the same email. Because you build this endless cycle because that data is being sent out, but then being sent and the email being opened, and all that stuff is coming back to the warehouse, and I can use that back in my dataset.
So you end up building this, like, live dataset in, like, 5 minutes and doing it. We've also seen 5 minute recommendation engines. Like, what's the best first product to get someone to buy to maximize retention? That's, like, 3 clicks in Narrator. It's a very common question people ask. Give me everyone's first product order and see how to let me know the conversion rate to next order and give me the top 10 best converting products. Webhook them to my product, have that be a cash layer. And when someone lands dynamically, you're seeing the product that we think is best for you to buy based on the conversion. Oh, I wanna add based on gender and see what's best on products for each specific gender.
Great. Another click. Here you go. Now I have gender and top 5 products with the conversion rate expectations in my production system that can dynamically create this experience for the customer. That's like, hey. Recently, people who buy this thing is really good. Because it's recursive and dynamic, it's like, oh, you launch a new product. People start buying it. People love it. And that's, like, a very, very basic, like, naive based approach. You can do definitely more complicated stuff and process data, but, like, that works better than just randomly putting something out there. So I love stuff like this that people are doing with narrator is, like, really leveraging your warehouse and having the fact that your warehouse is combining all your data so you don't have to do it in product available for you. And just send that data out, and Sensus has done an incredible job. Hi touch has done an incredible job of making that available. And, again, those tools, like, most of our customers who use Sensus and HiTouch actually, Census, we talked to them because 1 of our customers was like, hey. I need help. Like, it was just, like, a whole thing. The customer customer is not technical, and he was like, I create all the data I want. I just need you to help me get it into this source. And I was like, yeah. Like, Narrator solves that problem. And they were like, okay. Most of our customers complain because we tell them what the structure they need, but they can't get their data in that structure. And I'm like, that's the data modeling problem. Back to the co problem.
Luckily, you can give them instructions, and we can do it for them in Narrator in, like, 2 minutes. Because whatever that structure is, we have a standard schema. So if we can get it build that transformation once, we can build a little dock. Here's how you do it. Click these 6 buttons. Put it in there, and the customer can control dynamic many more things in there, and it's easy. So I will bet my life on that in, like, 5 years, warehouses are gonna get even faster, and it's gonna be so seamless. And more and more people are gonna be, like, keeping production systems focused on production, and data will be going through the warehouse, getting cleaned up in a standard way like Narrator does. It has to be standardized because we're not gonna keep doing this by hand. And that data will be able to be sent to any product, And that will be the flow of all data integration. No more live direct data integration with, like, sockets and RabbitMQ and messaging and all these things to dynamically do it. Gonna see data going to warehouse warehouse going out. And, eventually, the warehouse will be so fast that this will actually appear as streaming, kind of like your Kafka pushing your data, processing, and bringing it out all dynamically and instantly.
In terms of your experience of building Narrator and working with your customers developing the activity schema? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Data people are very skeptical. So that's 1 of the things. I tell people, I'm gonna standardize all your data at a single table, and I'll answer any question. And they'll say, there's no way you'll do it. And then I'll do it in front of them. And they'll say, I don't believe you. What's the catch? And I'll say, ask me enough questions. And they still won't believe you. And I'll say, find me 1 question I can't answer, and they still won't believe me even if they can't ask me a question.
And then, like, a month later, I'll get an email saying, I thought about the activity scheme, and it turns out I think it works. So just data people are not, like versus, like, software engineers, which are much, much more likely to try new tools and experiment with things. Data people, like, do not like change, and they're too busy to, like, try something new. But they're so busy because the thing that is just, like, not very effective, it's a it's been a really big challenge to kind of, like, sell to a community that I love so deeply that, like, refused to use anything different. So but we're slowly reading them down. I feel like more and more people have, like, fallen in love with activity schema, and everyone who uses it, like, is obsessed. So slowly, the data community will be done. The second thing is I don't know if this is, like, good to share, but I'll share it anyway. Like, the way you want to do data and and trade and the way you need to sell is, like, kind of counterintuitive.
Like, every data person, when I was a data person, I used to hate that, like, head of sales would buy data tools. Like, our head of sales bought Looker, and he was like, data team implement it. And we're like, why'd you buy it? We have 4 other dashboarding tool. And it turns out, like, that's because that's how companies sell. Data companies sell to business stakeholders, and the data people kind of, like, have this, like, shit place where they just have to implement it. That's another thing. The other thing I will say that I've also learned that's been really, really challenging is that when you're selling a product, you're not competing against other products.
You're competing against other products marketing. So, like, it's such a hard problem because I have to often answer questions like, what is the difference between narrator and Looker? And you're like, what is different in like, so different. Like, most of our customers use both. Like, there's they're like we're solving a problem and, like, self serve analytics and data and Looker is dashboarding. And they're like, Looker says it does self-service, and you're like, yeah. But it doesn't. And I'm like, well, Looker says that it gives me analysis, and, like, you can see his case studies, and it helped Coca Cola make $1, 000, 000. And you're like, but Looker didn't do it. The person who built the analysis and then put look the dashboard tool on top of the analysis did it. And then you're, like, in this, like, weird place. So that's always a challenge in just the nature of that's a good learning lesson.
Marketing and nuance are not friends. Like, another thing. When you explain narrator, as you've seen by our editing of our site, narrator is a nuanced solution. It's very nuanced. To understand why it works so beautifully, you have to understand the nuance of the problem. You have to understand the nuance of it, and you see it in action. If you don't know the nuance, it sounds like you're writing SQL to create a table, and that table is being used to answer questions, which sounds like everything else you've been doing. So I like that to say, like, not all SQL is the same. But there's all these layers to it, I think, it is that we are seeing. And, oh, the last thing that I actually will say that I also found to be soul crushing and very things is that people don't like to use data. Like, no one's job is on the line to use data correctly or make better decisions. Like, no 1 is responsible of making good like, there's this whole theory of, like, everyone has to be data driven. But, like, if your data team takes too long, like, nobody cares. Like, if your data team gives the wrong answer, no 1 cares.
Like, it's this weird dynamic where value of answering questions correctly and faster is a value that it actually has to be very deeply explained why somebody would care about it because no one's job is on the line if that's not the case. No 1 needs to answer questions faster. There is a clear job to build dashboards. There's a clear job to build modeling, but answering questions and doing that a lot faster and better is something that we're not in our practice, even though we love to say we're data driven. But, like, that just means that, like, I took a screenshot of a dashboard and I put it in my PowerPoint. It doesn't mean that I've used data to answer questions. And most people actually don't wanna use data to answer questions, and it's probably because of us. Like, as data people, we've made asking and answering questions so painful for stakeholders that we've trained them to not ask questions.
We've literally been like, if you ask me a follow-up question and I'm like, hey. Give me 3 weeks, and I'll build you another dashboard, and you'll be, like, half happy, and then you'll ask for another follow-up, and you'll just give up on asking questions. Product people would rather have engineers build it and AB test it than do a data analysis before they build it because it's actually faster to have engineers build it and test it and throw it in the garbage than to go through data. All these little things, I think, are going to change because I think as some companies get better and better in data, like, it creates a much, much bigger gap, and that competitive advantage just goes away. Like, you have to get really good at answering questions because right now, the lines are getting closer and closer and closer, and mistakes are less and less forgiving.
So data will eventually become important. But right now, surprisingly, from what I was thinking, like, it's not as, like, mission critical as people like to advertise that it is.
[01:01:17] Unknown:
For people who are excited about the promise of the activity schema, they're interested in the capabilities that you're creating at Narrator. They wanna be able to ask and answer these questions or understand more about their organization?
[01:01:30] Unknown:
What are the cases where either a narrator is the wrong choice or the activity schema breaks down? So you have to have a lot of questions to make Narrator make sense. Right? Like, if you're trying to build an executive dashboard that is for your executive team and it's answering very specific, like, 5 questions that everyone needs to see, Like, Narrator is not a good tool for that. Like, don't use Narrator for those things. That like, DBT is incredible. Build a SQL query, have a pipeline, and put it in the dashboarding tool. So the time when you wanna use Narrator is when you have a lot of questions and your team is dynamic and you have a time crunch. Right? If you have unlimited time like, Narrator is simpler, but, really, the value of Narrator comes in is if your data is messy, you have a lot of data, it's all over, and you have multiple data sources. So if you have 1 single data source, don't use Narrator. Like, that doesn't make sense. You have to have multiple data sources, lots of questions, and you have to have, like, this kind of, like, time pressure. And when you have those 3 components, Narrator makes a lot of sense. Single table, like, we've seen a lot of companies that's like, oh, I have a single table. Like, I'm a finance company. I have this structure table, and I wanna slice and dice it. Like, put that in Tableau. Like, Narrator's not the tool for you there. But, yeah, multiple questions and messy data. Like, I think that you'll see that messy data works a lot better in Narrator than anything else I've ever seen. So, like, people often worry, I can't do data analytics. My data is messy. Use Narrator for that case. And, yeah, just ad hoc questions. The more questions you have, the more Narrator makes sense. If you don't have a lot of questions, then Narrator is just like a very, very, very overkill for the 2 questions you might wanna ask. And as you continue to iterate on the product and the business, what are the things that you have planned for the near to medium term? So we're refining our access control. So there's a lot of access control, a little bit more controls, and giving people more ability to control access.
We are adding pipelines through GitHub, which is another big thing. We're improving our analysis. A A lot of our customers are creating their own analysis because everything we all our templates are just we just write the analysis, and we make our all our algorithms and everything we do public so that anyone our customers can do the same. And they've been really liking it. Been liking the way of writing these stories and combining algorithms and data seamlessly inside of it. And we're trying to make that experience a little bit nicer. So embedding full open right now, it has partial Python, but allowing you to write full open ended code, allowing you to comment on it, allowing you to share it, all these different pieces of having the email to you. Most of our customers forget data preparation if there's a thing that happens. They just, like, start shifting to, like, more and more questions. We're trying to make sure that experience of creating your own analysis, editing them, commenting on them, talking about the insight that you found is even better. So those are the big things that we're gonna see
[01:04:11] Unknown:
coming up. Are there any other aspects of the work that you're doing at Narrator or the capabilities and use cases for the activity schema that we didn't discuss yet that you'd like to cover before we close out the show? So the last thing I'll say is with privacy and identity of stitching has become such
[01:04:27] Unknown:
such a big deal. Like, with the Apples, the way that people block their party cookies, and how do you stitch people from multiple different sources, and people now have many devices and shared computers and these things. Narrator solution for identity resolution is by far superior than anything that exists on the market, and it does it without breaching privacy. Data never leaves your system. Data is never shared. It's just a way for you to best combine your data that you have internally to create the most high resolution version of a customer. And I think that's going to become more and more critical as we move to a more and more security centric view. You shouldn't be using external data to, like, tie your customer together and, like, look up their cookie in another service. You should be doing with the data you have and using your information better. And I think that I can't explain that on a call. Like but if you're ever interested in identity stitching or identity resolution, we have some blogs, but you should reach out to us. And that part has to work incredibly well in Narrator, and people, when they see it, are usually pleasantly surprised. And how much more granularity and understanding you can get of your customer when you do proper identity stitching across all your different data sources.
[01:05:30] Unknown:
Absolutely. Well, thank you very much for that. And for anybody who wants to get in touch with you, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today. I have a very narrative centric view.
[01:05:46] Unknown:
I think people don't have the ability to actually ask good questions. I think the data tooling has to kind of start thinking about things from the perspective of, like, how do people consume information? And right now, what we have done is we've created so many complex visualizations that are very hard to deal with live data, and so many questions are really, really bad questions. And I think that there is a world where there's an additional layer that helps people kind of get that education and training to best consume that information. Because I do think that warehousing technology, visualization technology, now with narrator data preparation and answering questions are all getting pretty damn good.
But there's still a huge gap in how we consume that information and what it means. And I think that there has to be a way to integrate that into your daily practice in just helping people ask better questions. I narrowed it, we explained it, but, like, it is very, very common for people to struggle with, like, live dashboarding, BI tools with visualizations. Like, I still, till today, see a world map, and I'm like, what am I supposed to get out of that world map with, like, 700 circles on top of it? Like, what am I supposed to get to that? Am I supposed to understand, like, the big circles or the places I care about and I have to remember the globe and how it looks like in my head? Or just give me a list of top 5 countries, top 5 states, top 5 cities. Like, done. So I think that level of all these different things lead to a lot of misinformation and clouds a lot of information, and we get into this habit of, like, live debugging. So I think the biggest gap is something to help transition the what you should ask and how you should ask it and how you should read it. So
[01:07:23] Unknown:
I'm excited for those kind of things to be helped out. Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing at Narrator and the work that you've done to develop the activity schema. It's definitely a very interesting approach to a complicated problem that everybody feels. So definitely enjoyed being able to learn more about your experience using it, the benefits that you've been able to see from it. And I look forward to experimenting with that on my own as well. So thank you again for all the time and energy you're putting into your work at Narrator and sharing that with us, and I hope you enjoy the rest of your day. Thank you so much, Tobias. I loved it. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to the Guest and Narrator
Ahmed's Journey into Data
Challenges at WeWork and the Birth of Narrator
Activity Schema vs. Traditional Data Models
Narrator's Platform and Features
Implementing Narrator in Existing Data Infrastructures
Performance and Governance in Data Warehousing
Data Cleanup and Modeling Challenges
Operational Analytics and Activity Schema
Lessons Learned and Customer Use Cases
Future Plans for Narrator
Closing Remarks