Defining A Strategy For Your Data Products

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack

Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today I'm interviewing Ranjeet Raghunath about tactical elements of a data product strategy. So, Ranjeet, can you start by introducing yourself?

Absolutely. Firstly, Tavares, thanks for the opportunity to have me and as a delegate of CX Data Labs on your podcast, a big fan. So so thank you. So my name is Ranjeet Ranganatha. I'm a managing principal over at a company called CX Data Labs. We're a data and analytics strategy and implementation services company,

and we focus

on

optimizing customer experiences in the retail, life sciences, and financial services verticals,

using data engineering

and data platform as our core set of

pickle picks and shovels,

effectively to kinda tie these systems together so that, businesses can see a holistic view of the customer and then action on it. And some of the things that they do as a result of the work that we've done is is increase their ability to personalize on certain content that they present

or better understand their marketing spend in terms of what resonates well with customer acquisition

costs or simply optimizing wait times, as people call into a call center. And so so those are those are some of the examples. And and for me, personally,

this has been a long time coming, and and I've been

in the data and analytics field for roughly 17 years. I've done nothing but various forms of engineering software and data all under the vicinity of of either producing data solutions or data products

and, you know, just just an overall geek, you know, and then a nerd as it comes to data. And do you remember how you first got started working in data? Yeah. I do. I do. I was an intern over at a company called USAA,

and they were they were working on a build back model.

And the core problem that they were trying to solve was they had a set of infrastructure

that they wanted to go through and build back all the way to

the business teams utilizing

those applications so that so that they were getting value from it. And 1 of my tasks was to come in and and help the team

go through and and provide this costing model. And so as I came in, they were using Excel, and they were using Access and and,

to do some of these computations. And I kinda looked at it, and I said, hey. You know, what What if we start writing data pipelines to do this? You know? Which I didn't know they were called data pipelines. By the way, I was an electrical engineering graduate coming in as an intern. All I knew is, well, maybe we can optimize it and do it differently. And then soon got introduced to to dimensional modeling and said facts and dimensions is how you can do that. Oh, well, what if we turn like, why are we sending these reports to them? Can we bring them over and then have them take a look at its self-service reporting with the the business intelligence? So a lot of it, you know, I didn't have the names for it, per se, but, that that's how I started cutting my teeth into it and just started kinda navigating it, all to optimize and to kinda lower the ratio in terms of the work done for people getting the value that they need. Yeah. It's amazing how much in the technology industry in particular, but probably any industry, really, that if you don't know the right terms that people are using, then you just end up rebuilding it yourself because you didn't know that it was already done. 100%.

100%.

And and and a lot of it also is

the thing that I've always loved about data and data analytics

is it's it's a it's an objective way to make decisions. It it also provides,

some eye opening opportunities when you put things in front of people,

and say, hey. These are the observations. Right? I mean, we could we could debate what

the how we use it and what that means in the context of of of the scenario,

but this is what we're observing.

Right?

And oftentimes in the real world, you know, if you contrast that experience,

we both could be seeing the same events, but we couldn't be interpreting it very differently. And and in the data and analytics world, you know, we have observations, and we can see that. And then we have inferences that we can draw on it, but we have that dissected framework to to lean on. So so that's another kinda thing that that has always,

motivated me to kinda, you know, be a discipline,

be be a disciple of

the of the field, so to speak.

And to that point of shared definitions, shared vocabulary, before we get too far into the conversation at hand today of data product strategies,

Let's just start by identifying

a shared understanding of what we mean when we say data products and how those might differ from data assets like a dashboard or a table or a

report that gets delivered quarterly? And what what is necessary? What what is the,

what are the surrounding attributes for a piece of data or a grouping of data for it to be qualified as a product? Sure. Sure. Sure. Sure. And, I mean, there's probably multiple definitions around it, so I'm gonna give you my rendition of what that means. And so the disclaimer here is, you know, it is not the definition, but it's a perspective and a point of view. When I think about a product, a product is

services a need

that a customer has or a certain segment

of people that fit into a persona. Right? So you have a need, and then there is

an outlay for that need and get serviced. And the way that it gets serviced is through a set of features. And then you have different set of products that that help you get to

the end of job that you may have for a particular experience that you wanna deliver. So okay. All of that sounds very nebulous, but what does that mean in the context of data? You use the word data asset. For me, an asset means something that you can harness value from and log transactions against. So there's a cost, and then there is there's revenue that comes in. A data asset, as you talked about, is an is is a type of data product. A dashboard is a type of data product. A model

is also a type of data product. And so these are these are interfaces that you have customers use

to harness value and then also assimilate costs across it. Right? For me, simply put, a data product is thinking about the customer and then the way that they use data and its attributes to make decisions and cataloging them into a set of features

that you can then have

expanded teams put together that deliver it, but then can also have long runtime road maps that you kinda have that you kinda can nurture and then kinda grow over time. So what does that mean? So let's say that we say that we're gonna build a c 360

data asset. Right? So we're gonna break that down and try to identify the different features that we would need to correctly depict

what a customer would be, and we would think about it in a product. So what does that mean? A product has a life cycle. You know? It has it has release notes. It has releases. It has a team that's long lived that goes through and produces this.

We also take care of regressions. We also take care of things that we may need to deprecate over time. We we may think of, you know, features that add on to these different modules. And

so thinking about that 363

customer 360

data asset as a product

and then putting together a release road map that says these are the features that are coming out in this quarter. Who's gonna be first who's gonna be a technology

evangelist versus, you know, an adventure enthusiast. Those could be multiple variables that come in, and they could be spread across 2 different releases.

And so I look at, you know, the the concept of a data product as 1 that builds on top of each other and really thinks about, you know, the the customer and the way they use it and how they use it

and provides them with,

interfaces so that, you know, it's easier for them to use. So the last thing that I'll say before I hand it back over to you is the usage modality here could be that, you know, we would like to give a customer ID, and we would like to get to know if this person is a technology enthusiast or not. And the best way to do that may be consuming that dataset through a restful interface where it has a certain set of specifications

in the term of a contract so that I can go through and and enable,

real time decision making. Great. That's an interface into an asset that we have culminating into a product that we go through and sell. They could be author interfaces to it, which is, hey. I wanna consume all of these records in batch and then make decisions and go through a drive. Well, that's another interface, again, into the same asset. So,

it's it's just breaking this concept of of data usage and

what it means

into a set of constructs,

that we just kinda talked about. And so

with that shared definition

of what it means to have a data product,

what are the pieces that we need to strategize about? Why do we need a strategy? What purpose does that strategy serve, and how does that inform the work to be done? Very good. Very good. It's a good question because this is something that I think about quite a bit. We talked about different types of data products. We talked about type model. We talked about type dashboard. We typed we talked about type data asset. You can you can have further categories like information asset as well. And so if you were to hydrate these into

or condense these into patterns, then you start taking a look at a value chain that comes from a set of activities when done in unison,

produce

an artifact. Right?

That that being a data asset. Okay. So then the the thinking here is you have patterns, you have a set of activities that line up to these patterns, and you have a set of artifacts that are produced as a result of that. Effectively, what we think of when we say a data product strategy

is the formulation

of

that

so that we can go through and industrialize its production. Right? So that from the concept of inception all the way to industrialization,

you can utilize this app model, so to speak, to kind of produce this artifact

in a very streamlined way. Okay? So what does that mean? You know, you're producing a dataset. Let's just say it's a data asset that you're producing. Let's go back to the other example of customer 360. In order for you to source that information,

you may need to go

to

a CRM system, and that CRM system, let's just say, is Salesforce. The ingestion pattern associated with ingesting data from Salesforce does need to be recreated again and again for sourcing, let's say, from 1 entity such as contact or another entity such as account. You define the pattern of the ingestion once, and then you go through and leverage that highway, so to speak, for different objects as they come along. And so you slowly start condensing those set of,

those set of patterns into broader capabilities,

and then you free up the development cycle for

producing these products

and and hone in on these capabilities. Right? And so, effectively, what you end up doing is you you make the marginal cost of producing the next product

simpler and simpler. And so, effectively, you you you harden these set of capabilities. Right? And so that's kinda that that whole piece of the puzzle is, we produce and develop a strategy, and that's why you need 1. Otherwise, what ends up happening if you don't have 1 is the cost of producing a product is just the same or expensive again and again and again. Right? And so what you wanna do is you want that cost curve to come down. So, hopefully, that that answered the question. Yeah. And for those elements

of

defining or establishing what the strategy should be, who is responsible

for guiding that process, who are the people that need to be involved from a kind of roles and persona perspective,

and what are the things that might trigger the development of a given strategy?

Yeah. Yeah.

So I think you always gotta start off with the with the consumers of analytics

in mind. So they're very important stakeholders.

These are the folks who consume

the the the analytics being produced and action on it. So think about somebody in the office of the CEO.

Think about a chief credit risk officer who takes a look at, you know,

the analytics being produced and says, hey. This is my overall risk for my portfolio within the sector that I manage. This is how I can curtail

my my bookends with respect to, you know, certain hedges that I'm performing.

But that's that's a cohort. Right? That's that's a segment of the population that provides you with, hey. Here's what I'm gonna do with the analytics that you provide me, and this is what I decision on and action on. And, oh, by the way, this is why I need what

I need what this is. And that typically for us is use cases. Right? And they come from our stakeholders.

And the stakeholders kinda closer in the business that go through and drive that out. Those needs

and that level of dialogue that goes on that says, where in this business process do you actually embed this level of analytics? How do you use it? Oh, what is the time taken for you to go through and provide it? Is there any sensitivity to the information being provided?

All of those kinds of questions and answers that need to kinda bring the use case to life in our world is brought together

through the lens of a data product manager. And and in some organizations, you know, that could be further bolstered with a product owner that that's a little bit more tactical, kinda taking those needs and helping them kinda see the

the the technical definitions around it, or it's fully owned by the product manager themselves.

And then what we have is we have a a set of, you know,

software data engineering managers who kinda sorta go through and break this down in terms of, hey. Here's what the needs means

in the way of thinking about nonfunctionals.

How do these come into play? And that's where we really see the software engineering manager, data engineering managers come in. They gotta hear the functional needs and then start saying, well, here are the nonfunctionals. This is why this is what we need to do. Okay. Well, we're gonna produce it in this way. We should we should have some logging measures being put into place. We need to have some telemetry. We need to have some monitoring. And then they also take all of the needs being articulated and put them into functional requirements,

and then they start breaking them down. And the breaking them down part really is where we see, you know, TPMs or scrum masters or, you know, however you wanna call them, but effectively, folks

who can take a a set of functional requirements, a set of nonfunctional requirements, and then kinda devise them into into a plan of action that the team can execute on. And then you have a set of developers. Right?

Now they fit into multiple different back brackets. You know, it could be

platform engineers. It could be data engineers. It could be software engineers, but they all kinda listen to these needs that have been kinda dissected into stories, and they start saying, okay. Well, this is if we do this, then we can achieve this.

Do you agree with this? And that whole,

negotiation going back and forth happens internally to the team and then also with the product manager,

and then ultimately signed off by the the software development manager or the data engineering manager. And then it gets formulated into a set of,

release artifacts that we go through and produce and provide out,

that ultimately gets embedded into the business workflow. Now

all of this stuff

is going to be, useless if we don't have a really good business enablement,

customer success

driven viewpoint in which we're doing change management both on the technology side, but then also on the business side, which is now you're gonna get this new analytical,

you know, component. How are you gonna use it? So for example, let's say there's a propensity for failure of paying back a loan. How are you gonna use it when you make the loan origination decision? When should you pull the lever, right, to say this model doesn't make sense? These answers don't make sense.

And, oh, by the way, how do you tune it? Right? Where do we monitor that, and how do you make decisions on it? So it it's a combination,

right, of of different items coming together along with different roles,

and and they encompass all the way from change managers. Right? Sometimes these are played by the product manager and then, like, you know, some analytical translators or folks or business analysts directly in the business. It just really depends on the on the company that you're in and the role that they play. But then but then those are those are all the kinda left to right kind of side of the equations that would look like to kind of produce

a data product and and and and the different activities that would go into it.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT,

the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/dataFold

today.

And

regardless of whether you actively engage in defining and implementing a particular

strategy, there's always going to be a strategy. It's just a matter of whether you are explicit and purposeful about it or if it is just something that is emergent. And I'm curious what you have seen as some of the juxtaposition of teams that are very deliberate about the definition and execution

of a product strategy for their data assets versus teams that just leave it to, well, this is what we're doing. It'll just emerge, and we'll figure it out as we go kind of approach. Yeah. I think it's a good question. I think, like,

so I think when you have, like, very product centric

teams that are exclusively

focused on enabling,

let's say, a product that they're going through and and releasing,

and they have analytics as a tie in

to that product, I see them leveraging and and kinda latching on to the product strategy itself and and analytics and data kinda sorta they don't fall by the wayside, but but they're but they're secondary actors within that overall equation. Right? Which means what? You typically have 1 business intelligence engineer. You have 1 data engineer that's within the group, and their entire purpose of existence is to help the product manager

rationalize decisions based on either funding, customer decisioning journey,

churn, whatever it may be, ARR. Like, whatever it is that they wanna do, the flavor of the day is what they work on. Right? So in that case, they don't they're not really coming up for error as much and thinking about and saying, hey. Here are the 14 or 15 different questions that I get asked. Here's how I can start laying out the foundation

so that I don't need to do the same amount of work that I do for answering those 14 different questions.

Let me start formulating

and sourcing data that will create these core entities that I can then use to mash up. And, oh, by the way, let me build a dashboard on top of it,

so that, you know, the product manager can do it themselves. Right? So that side of the equation is where I see less of that, and and it and it's more where the product to center

and the product manager

are driving

all of the kind of work and centering it on the product itself. Right? So I don't see a strategy that coherently kind of describes anything in scenarios like that. Where I've seen companies use,

or kinda dive in really into data product strategy

is where

there is

a focus on building data products,

but there's an aspect of doing it in a centralized fashion. And it could be it's not that everything is centralized, but it could be that a core set of the infrastructure

is centralized, a core set of

the assets being brought in are centralized.

So what does that mean? And the let's go back to the example that I gave about Salesforce. Right? It's a CRM system. Okay?

It's got a ton of assets within it. K. What does that mean? It's got contact. It's got

account. It's got leads. You know, there's there's leads that are being qualified. There there are sales. There's tons of information there. Do we want every single team to go through and and source that information again and again and again?

Probably not. I mean, if you think about an ingress and egress perspective,

it doesn't make sense.

So

you have a set of teams that go through and say, hey. Here. We're gonna model

how to particularly use contact and account and the relationships between it, and we're gonna manifest it in a place

that makes it easy for teams to go through and soup and source it.

Okay? Well, when they do that,

they're effectively centralizing that that that capability of data ingest

and data rationalization.

So when you're a consumption driven team coming through, you have to learn that mnemonic, and you have to go forward. So in cases like that where people are looking for efficiency gain through through central

harmonization of data,

I see those kinds of companies do do data product strategy more and more. Right? And then the the the the the the tier that sits in the middle is where

they don't necessarily agree on the centralization

or the decentralization,

but they but they

agree a ton on the ways of working and the standardization of the ways of working. Right? So if you think about, hey. What does

what does continuous integration look like, you know, in in the concept of data engineering? What does continuous deployment look like, and what does that mean? And so if you have teams that are really,

you know, software focused but then are trying to enable,

data products.

They they hedge on the ways of working. They say, hey. Well, let's have

a a repo structure that's conducive to working on data engineering efforts.

So they're they they go through and and drive out a set of standardization

there, and that's hedged on kinda what is a data product and how do we enable that. So those are the 3 kind of verticals that I see,

as I've as I've kind of scavenged the field. You mentioned the different roles and responsibilities

throughout the process of designing and implementing a product strategy.

But, of course, not every company is going to have the same sets of people, the same titles, or even a given title might not even exist across separate people. And I'm curious how you have seen the

size and structure of different teams both within and adjacent to the

data capabilities

influence the ways that people approach the concept of how to strategize,

what the scope of a given product looks like, etcetera?

Yeah. And I think this really depends on verticals

and the kind of vertical that you belong to and and

the the importance given to data within that vertical. So if you take a look at the insurance business, they've been using data to make decisions,

for for a very, very long time. So their maturity around

their maturity around data management and the need for it, right, is is super high because when you make rate changes on insurance policies, you need to file with the state. So guess what? Like, you automatically are thinking about with data retention. You're automatically are thinking about

the governance around the the data that's used to make those decisions. You're automatically thinking about the fact that, you know, you have a deadline to submit these things, and and you have SLAs in place. Okay. Well, they need to be done in a particular order so that it can be it it can go through and be deployed.

Every like, you know, if you take an an actuarial scientist, right, like, there's a particular way that they go through and do their do their business. So there's a certain hygiene around the way that they think about processing the data so that they can then answer the questions that they seek. So based on the vertical that you're in or the industry vertical that you're in and the importance that data has within its own its relevance,

right, will dictate how much you think about data

and how much you think about all of the that come with it. And so in the insurance example that I gave you, they inherently

have a strategy, but it's embedded in the way that the vertical exists.

So you may not need a business analyst. You could have an actuarial scientist, you know, or maybe a junior 1 who functions as 1. And they write documents. They write requirements in the sense that that depicts the process flow where things need to happen or not. And so that that's 1 example. Right? Another example, you could you could have,

and they may need to have, you know, separation of roles because they are properly in a regulated business where, you know, the person doing the math cannot be the 1 that checks the math. And, therefore, you know, the way that they've kinda written these rules says, you know, we need to physically have them as being separate so as to guarantee a level of quality that they go through and drive out. This is that way in in the life sciences business, for example. Right? There's there's a large focus on QC because imagine

you getting a drug,

that hasn't been QC'd

as much as it should have.

So there's a certain set of

operating protocols and procedures that have been grounded on mitigating risk and increasing quality, and and that's kind of led to an op model where you have different people doing different set of roles, and that's dictating the way that the industry operates and and drives.

So that that's another that's another vector, right, which is the industry drives the kind of roles based on the way that they segment things. The third is where

data is used as an enabler, but the cost of getting it wrong is not that high, and you need it for directional

correctness rather than,

you know, exactness. Right?

So in the example of life sciences or even in cyber, right, like, or in security,

you you cannot,

you cannot get things right on average. Those things don't happen in those verticals. Right? You have to get it right every single time

versus in retail, for example. Right?

You're less likely to give your address,

to to a person who says, hey. Can I get your address right at the checkout desk,

versus, you know, like, you're you're within a financial services institution, and they ask me what your address is so they can send you statements? I guarantee you, 1 has a higher likelihood of you giving the most accurate information compared to the other.

So you get a ton of garbage in,

right, in in in some of these verticals, like, you know, for example, in retail.

So then you start saying, okay. Well,

you know, you're you're gonna you're gonna have to formulate a ton of rules to get it right.

And so you start to say, okay. Well, there isn't a lot of definition here.

There isn't a lot of,

criteria

that we put on the docket. We just need to do a ton of iterations

and go through and get the answer right. So what I've seen in industries like that is you don't have a ton of roles. You have, you know, 1 developer, 1 solution that goes deep. Right? They may do the business analysis reporting. They may do the the data engineering. They may also help the business in doing the governance itself

by flagging elements that are out of sync or out of place. So, this is a very long winded way to say, Tobias, like, depending on the industry vertical that you're in

and the place that data has

in the relevance of the decision making process

and the kind of inputs that they get, you know, high quality versus not, the cost of getting it wrong versus being directionally correct, all define the number of roles that are being put in place. So it's almost like a spectrum.

Right? So so so that's that's kind of 1 way to take a look at it. And so now

what are what is the commonality that you see regardless of the industry that you're in? I think it comes down to the artifacts. Regardless of of how many roles you have and and, you know, which industry that you're in,

a technique that I've seen work well,

at least from my perspective,

when you formulate these kinds of strategies is is a set of interviews with stakeholders. These are people typically

who are making decisions using the analytics that you're providing.

Taking those set of use cases,

boiling that up into a set of

capabilities

that, you know, need to be invested in,

dissecting that building programs,

which have projects within it that then get executed on,

that then kinda tie into metrics that say, this is why we did what we did, and this is the value that we're gonna get. And, oh, by the way,

through that process, you know, this is the datasets that we're governing, and this is how we're governing it without maybe using those words is is what I've seen work well,

and it also

disarms organizations.

Because many times, you know, when you go in and you say, hey. We're gonna stand up a, you know, like, the outcome of

a data product strategies

is a team that we need to build up of, like, 50 people.

In an organization that doesn't have data

within the decision making,

nomenclature,

right,

that's gonna be a tough 1 to stomach. But even in 1 that

that that that that is driven in that way, 50 is a large number. They're gonna buck anyways. And they're gonna say, well, we're getting we're getting efficiency with 1 person. Why would we need to do it differently? So I think just focusing on the artifacts and really thinking about how do you take these use cases, hydrate that up into, you know,

into a set of portfolio and and programs

that you can execute on wrapped around with metrics and governance

is is the way to go, regardless of who does it.

The other interesting element of data products is the audience where

because data has so many different potential stakeholders and consumers that will drastically influence

the

overall user experience they're trying to optimize for because

the core element of something being a product is that it is

consumable out of the box

versus just here's some data. Good luck. You know, you can pull it from this s 3 bucket if you want. You know, as as a product, you know, you go to, like, a Netflix, that's a product. You go to Amazon, that's a product for ecommerce.

If you are a data consumer and, you know, if if I give you a CSV file and I'm an average person who is just trying to answer a question, what is the CSV file gonna do for me? But if I have a search box where I could type a question and then you're using that underlying data to give an answer, then that's a better experience. Whereas if I'm a data engineer and you give me a CSV

and and a little bit of documentation, what to do with it. And so I'm curious, what what are some of the useful questions the teams need to be asking in the development of that product strategy that

will inform the implementation details and the types of technologies that they need to bring to bear on the solution.

Yeah. Thank you. Thank you for kinda highlighting

the importance that interfaces play

in in the role of a data product. So I think 1 of the things that you kinda mentioned and the examples you gave about Netflix and Amazon and everything else is is you know, let's just take the example of maybe Amazon. Right? You come in, you search for a product, and you buy it. Right? But why did you search for that product? You had a need.

You know? Let's say you're buying a household. You know, you're buying a household cleaner. Right?

You look going in there, and you're trying to search for something because you wanna clean your house.

Right? And you wanna do it in a self sufficient way that, you know, you wanna buy a product, put them on the floor so that you drive full. Okay. Well, there's different, choices that you have. You know? But the point I'm trying to make is

that whole concept of Amazon and search

is in the context of a need that the customer has

in the life

cycle that they're in

that this this fits into. Right?

And so

what does that what does that mean in the context of data products?

As we start collecting these use cases, a big thing that we do and we emphasize on is how are they gonna be used

and how often are they gonna be used and in what context are they gonna be used.

Right? So for example, like, if someone says, I need this information.

Let's say that they produce a propensity score for

that person's

ability to either default or not on the loan. I need it within 5 seconds. Right? Even 5 seconds needs to be refreshed. My my my question always is, let's say that I give it to you in 4 seconds.

What are you gonna do with the 1 second that you save?

Right? Let's say I'm gonna do that. What are you gonna do in the next 5 seconds before the data comes in? And at least a very interesting conversations because what effectively you're trying to unravel

is what's next. Like, what do you do next? Right? In the context of the Amazon example that you gave, well, okay. I I take it,

and then, you know, that spray comes home, and then I clean

my,

my table with it.

Okay. Okay. Well, that's good. Then what do you do? Well, yeah, that's and then I store it. Right? Okay. Well, in the context of data products, like, you know, in the context of the example that I gave you, well, I'll take that that output that you provide, and then I make a decision on it. Well, what's what do you do with that decision? Well, very quickly,

in the flow

of the application, the loan origination application,

the customer is gonna be able to see

if they got a yes or no

in terms of the loan that they were asking for because I take this variable

and I weight that

by 70%

because I heavily weight this to say, you know, if this is a yes or no, you know, it kinda determines if they get the loan or not. Oh, wow. Okay. That's interesting. So now you start walking backwards from there, and you start saying, okay. Well,

a CSV file probably won't scale for that. How are you gonna do a reach in for this? Well, hey. We we typically you know, like, within the application flow that we have, we use RESTful interfaces for doing everything that, you know, we go through and and go through and drive out. Okay. Well, alright. So now you start saying, okay. Well, now we need to start using APIs. Okay? They need to be discoverable. But, well, how how like, what kind of what kind of validation do you do on this to make sure that it isn't something that's so wild? What happens as a result of that? Okay. Well, then as you start having these conversations with with your customer in the way that they are gonna be using that analytics,

you start formulating the interfaces that they're gonna be using or channels that they're beginning to use to to to soup up the intelligence that you're providing, whether it's core data

or insights or information,

knowledge, you name it.

You know? That's what it is. Right? And so that starts to formulate the way that, you know, you you start providing these interfaces.

And the same

dataset

or information asset or data asset, these different types of data products could have multiple channels. Right? For example, 1 of the things here could be that, you know, in the in the context of the persona that you gave, right, of a data engineer, they could be wanting that dataset through an s 3 interface. You know? Like something that they can consume in batch and then do some reconciliation on. So the way that the consumer utilizes it in the context of the decision making will dictate the interfaces,

and those interfaces is what we build

that then says, hey. Here's the product that we're building. Here's how we here's how we deliver it to you so that you can consume it. What are your consumption patterns? And you gotta keep that front and center as you're walking it through because the last mile optimization on that is driven off of those items. And then there's some interesting nuances as well. Right? In in the last mile consumption piece, you're less worried about duplication. You're less worried about, you know, oh my god. Am I copying this data, or am I copying this in 14 different ways? You're more worried about

is is the interface

optimal for

the consumption,

right, versus optimal for storage and distribution.

As more people start using AI for projects, 2 things are clear. It's a rapidly advancing field, and it's tough to navigate.

How can you get the best results for your use case?

Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI powered apps.

Attend the dev and ML talks at Nodes 2023,

a free online conference on October 26th

featuring some of the brightest minds in tech.

Check out the agenda and register today at neo4j.com

/nodes.

That's ne0,

the number 4,

j.com/nodes.

I'm curious how

technical debt factors into the overall process of the

development

and consideration around what the strategy is and how to approach implementation

both in terms of I already have this existing technical debt, and so that constrains

the available set of capabilities that I have or it will extend the delivery timeline.

But also,

this is the strategy that I want to implement. This is the timeline I am committing to, so now I need to consciously take on this additional technical debt. I'm just curious how that plays out in the overall process.

It's it's it's a good question. And I say that only because we all accumulate technical debt, and I haven't quite seen to the extent that I would like to, you know, out in the wild,

including when when I used to be on on the other side of the fence, like, leading data teams in corporations both in tech and nontech,

do it well. Right?

And so here's the way that

here's the way that here's 1 of the ways that I've seen

get close to doing it well. It's really negotiating

a percentage of your execution backlog to be dedicated to technical debt that the

engineering team has accountability

for in terms of prioritizing so that the overall cost of delivery comes down. So what does that mean? In the backlog, you could have net new features,

bug fixes,

and technical debt all be,

commingled. And so what we've seen work well is you take about 30% of that backlog and you say, hey. We're gonna we're gonna dedicate this to technical debt, and we're gonna give the accountability to the data engineering managers or the software engineering managers to go through and drive it. They prioritize it. They put it on there so that you go through and see it. You go through and move it accordingly.

The product manager should be able to see if they're doing a good job of it or not by tracking the overall cost of production,

and operating and maintenance costs associated with the product itself.

Right? So the lower amount of tech debt that you have, I think you can see it in in a couple of different ways. 1 is your O and M costs are gonna go lower. Right? Yeah. And, typically, you know, o and m costs operating and maintenance costs are roughly in the 50% mark. So if you can bring that down by 20, right, and kinda bring it even into 30 or if you're super optimal into the 15% range, that's a good indicator that, you know, you're you're resolving your tech debt as much as you can. The other

nascent thing to look at is attrition. You know? When you have a really poorly built product and, you know, you're you're it's gonna be tough for you to maintain people on the operating and maintenance side of the equation to go through and drive that out. So so that's that's 1 other side of the equation. The other thing that I've seen work really well is in terms of tech tech because when people go through and and

and,

provide these these these strategies or or even these patterns and

and drive them out, you know, at that at that point in time,

they were probably the best,

and the greatest. But, you know, like, over time, like anything else, everything deteriorates. Right? Technology is moving at a faster click rate. So having, like, you know, a dedicated time during your during your execution mechanism, like, 1 sprint out of 7, right, in a classic PI type setting with with agile

sorry, safe

agile, could be 1 where you kinda take a step back and you allow the practitioners on the floor, right, who are driving stuff forward, who are actually the ones that are closest to the to the pain to say, hey. Like, there are these new ways of building things. Can we go through and try and implement them and see where they go? Right?

And that kind of raises the bar in terms of making sure that not just tech debt stays in check,

but then you're innovating.

In some cases, what I've seen is teams

completely stop delivery of net new features

and and saying that, you know what? The way that we're gonna resolve this is we're gonna take care of all the bugs. Right? So we're gonna have something called a bug bash, you know, and then take that completely down.

Right? Maybe they do, you know, like, a month long worth

effort there,

and then they go through and and and throttle their backlog, so to speak, you know, so to make sure that they can get back in line. So these are different ways that I've seen,

teams go through and manage this concept of tech debt. And then the last thing that I'll mention is, you know, the concept that I talked about, use cases being hydrated up into

a set of patterns and these patterns kinda going into capabilities.

It's really important to kinda go through and and score those capabilities on a yearly basis to say, how well are we doing? Right? And sanitize that and say and that's another way to measure architecture as well.

And that, I have yet to see teams do a good job in because they just don't think of

architecture and scoring the You know, someone writes a blueprint. You know, it's it's super high level. Somebody goes through and implements it, and we never score those capabilities. Like, for example, how is our data injection capabilities? Is it a 9 out of 10? Why is it a 9 out of 10? How is our data ingestion capabilities? Is it a 9 out of 10? Why is it a 9 out of 10? Well, guess what, folks? We can't ingest TSVs. Okay.

How important is it? Do we have any use cases that go through to that? Well, yeah, we do. You know, we have 55 out of 20 use cases that are doing that. Okay. Well, how much time are we spending as a result of that? Well, our sprint points are x, you know, for these kinds of things. That kind of telemetry, walking that backwards and then saying, hey. This is how we score architecture.

I haven't seen that as much in the wild,

if any.

But I think that's another way to kinda score your architecture based on the capabilities that you've driven and to make sure that these these tech dead items kinda get brought to the surface.

And circling back on the interface of the products, there's also the question of

customer education, of how much context

and how much familiarity

do they need to have of the data,

of the statistical

aspects of that data in order to be able to use it to

effectively make decisions, or,

is the understanding that they're reaching actually accurate based on their background?

And I'm wondering how you've seen teams try to approach that element of

delivering the data product, delivering

the

guardrails or surrounding

capabilities

so that the end user is able to actually effectively make use of that product without having to have somebody sitting beside them saying, This is what you need to know. These are the steps to actually use this thing. These are the other things that you need to do after the fact, etcetera.

Good question. I'll start off with the story. Right? And I think all of us will be very familiar with this 1. That number is incorrect.

And they're like, why is that number incorrect? Because the person did the roll up in the wrong way.

Okay. Well well, it was obvious that the the column was there, so I ended up rolling it up. Well, what you didn't do is you didn't apply a filter

because this other column actually you had to apply a filter for this column and then do an aggregation, and then you'll get the right number. Because, effectively, what you've done right now is you've made it 10 x the number that it is. And so

these kinds of stories, right, I've genericized it, but these kinds of stories are pervasive. Like, all of us have heard it. Right? And so if you think about it,

and so it's saying, well well, how did that come to fruition? People think just because you have the data, you can just kinda give it out and and, you know, not not

knowing the persona group that the person belongs to and how the consumption experience has been has been defined

for that persona. You'll often hear people say, hey. Just give me access to data. I'll figure it out. You know?

And oftentimes, you end up with stories like this. So I've seen well and done well and and kinda something that we practice and and both preach is that the interface that sits on top of the data

needs to walk backwards from the set of questions that we're trying to answer. What are the kind of roll ups that we're trying to do?

What is it that we need to do in order to make sure that we put a definition around the roll ups

so that it's relevant?

What are the filter conditions that

are relevant for those roll ups versus not?

And, I mean, in this particular instance, I'm talking strictly about dashboards

so that you have those items outlined so that when people come through and and consume this, the the number of toggles or inputs that you can use

that you can get an outcome with is limited,

so that you can go through and drive that out. And so so that level of metering is super, super important.

Now on the other aspect of,

of educating the user about the data

and what it means,

what I've seen, like, you know, specifically in in the modeling arena is is boundary conditions

and self throttling even before you get the results out,

right, to say, hey. This kinda breaches our out of bound conditions,

and, therefore, you know, this needs a second set of review.

What I have seen or the worst is, is a ton of very detailed documents,

you know, spanning multiple pages

that exactly explains what that is or,

in fact, even a user session, you know, that every time we get onboard, I sit with you and I walk you through what that means. That's another thing that that that I don't see used very well. So our preference and what we typically like to do is is a set of tests that are run to make sure that the that the data that you're actually consuming

is accurate and and of high quality and of integrity. And then on the consumption side, really limiting the inputs to the outputs. Right? Like, you know, like, if there's a country where they don't use ZIP code or or they use a you they use another form of ZIP code, then don't show that option. You know? Just limiting it considerably and then lining that up to the questions that you're asking.

And in your experience of

working in this space of helping

data teams understand

what is the customer experience that they're trying to satisfy, how can they actually go about delivering those capabilities,

what are some of the most interesting or innovative or unexpected ways that you have seen teams either

go through the process of developing and executing a given strategy or some of the most interesting

formulations of that strategy that you've

seen? Yeah. I think I think, like, when we think about customer experience, right, let's just let's just kinda ground ourselves a little bit on the definition

of of how we bring that to life. The inputs to customer experience is really kinda taking a look at your business and saying, these are the different touch points that

that our customers

produce as they interact

with our digital as well as our analog real estate.

And so right there, you can take the analog real estate real estate out. You know? And you pretty much have the digital real estate, and you say, okay. Well, these are different in

interaction points that we have. Alright. So now that we have that, we use those as the input to then

drive decisions

that then,

the customer,

that then the customer experiences.

And that whole process could be how do we optimize the loan origination process

for the lowest number of clicks, right, to get to a decision. It could be, you know, how do we make sure that, you know,

Tobias gets the most relevant content that gets presented on screen so that he quickly makes a decision on buying a product that is relevant to their need. Right? So so what I'm trying to get at is the way that I've seen teams do a really, really good job of that is asking the question as to what is the core metric that we need to hedge on that clearly defines

is the customer experience optimal or not.

Is it the number of clicks?

Is it the time taken per page? Is it,

the number of items that he's left on a basket? What is that? Data teams and I've and I haven't seen many data teams do it, but I've seen a lot of business intelligence teams do it, which is they really, really anchor, and they ask the question as to what is the metric that we need to be really optimizing for. And getting that formulated, getting that listed out accurately and done well. Right? The next thing from there that I have seen data teams

do well

is take that and think about,

all of the data elements that come through and formulate that answer

and start putting in early signs

of failure. Right? So, for example, in order to determine the number of clicks, we get that from 5 different systems. And we know that this 1 system you know, when we get it from that 1 system,

we have to make sure that the integrity and the quality is extremely high. Okay? But we produce this on a weekly basis. Should we flag this, at the end of the week, or can we flag this as and when the the data is coming in to say, this is out of bounds, and this doesn't make any sense? There's a new ordinal value that we need to flag. Oh, these these 2 systems are no longer in sync because our joint characters are gonna be off. And, oh, by the way, now this is gonna lead to a massive skew.

So to summarize, where I've seen data teams do really, really well is build those capabilities around

observability and monitoring.

And for me, there are 2 distinct things. Monitoring is the things that you actually know that you can monitor for, and then observability is everything else that you see coming through the the that you're able to kind of decipher and kind of understand.

And then using machine learning almost to help you understand

the patterns and behaviors, the slow drift that's going on. And relying less on the operational systems to tell you where the problems are because the operational systems, if they have issues going on, they can easily flag it. But, otherwise,

you know, they they kinda go through and drive whatever it is that they need to do, and they do and they can kinda go through and keep producing the results. Right? So having a lot of that infrastructure built on the data engineering side to drive that out is is where I've seen data engineering teams

innovate in Excel.

Right? Because the the alternate

is, oh, why don't we see a lot of data teams innovate on the KPI side or pushing the business to think more about that? They don't. It's almost like having the spark plugs define the car. You know? It it doesn't work that way. And so I think it's an unfair expectation,

to have that of data teams. What I think they do really, really well is optimizing on on the infrastructure piece that I mentioned.

And in your experience of working in this space, what are the most interesting or unexpected or challenging lessons that you've learned in the process?

Always

question the core set of assumptions coming in.

Also,

you know, people will hand you

over code. I mean, oftentimes,

what really happens is, you know, you're trying to build an analytics product,

and, you know, like, you're you're trying to go through and walk all all the way back to the source system. You're trying to analyze the data,

and,

you've got people

telling you how the data is is is manifested in these systems.

And

they

will

talk about it. They give you these diagrams and all those different things. I think taking a synthetic transaction all the way from the left to the right in terms of, hey. Here's how the data originates. This is how it gets manifested in these systems. These are all the assumptions that we're making. These are the edge cases.

Documenting all those items and seeing it and living through it, I think, is is not just key, but it's paramount. Because 1 of the things that always shocks me is you kinda come in and then, you know, people will say in the operational side, right, they will say, and let's just take the the the example of a trucking company. Let's say, hey. Whenever our trucks leave late, our drivers always enter the information. It's a part of our SOP, but we don't see that in the system. And so why don't you see that in the system? Well, the thing is they tried entering it in this field before. It didn't quite work for them. So they started using the comment field afterwards. So, yes, they are doing it. Right? So so the SOP is is still active and relevant.

However,

and that data is in the system. It's just not where they said that it would be. 1 of the good mitigation strategies that I've discovered for this

is a go out and see and take a walk, you know, with the actual executioners

of the process and see what that means. And that's another piece that I also kinda bring to this top of the surface is is business process and understanding business process

and walking that into where the data is manifested and is what operational system and how it's manifested. The top to bottom kind of viewpoint is is important so that you can you can tease these kinds of things out.

And for teams who are starting down the path of trying to incorporate

these strategic processes

into their delivery workflow, what are the cases where going through the whole process of building a data product strategy using that as the means to identify

and prioritize

work to be done is overkill, and you just need to focus on the technical aspects and that that is actually the core capability that you need to deliver.

I think when you're I think when you're fairly small you know, when I say fairly small, like, you've got a team of,

let's say, like, you know, a team of 5 people, and then you kinda provide analytics to the organization,

and you you you formulate and work through it in a solution by solution basis, and that's all that that you have, you can still start thinking about

data and the concept of a product and defining a strategy, but your throughput

or the ammunition that you bring to the table is gonna be far less. So you're gonna accumulate a ton of technical debt as you go through it. And, honestly, in the beginning, it's it's gonna be par for the course. Right? So in that case, the team may not think that it's it's, it's overkill,

but your stakeholders may. Because the initial cost of you building

a data ingestion

pattern based framework that'll automatically auto ingest data, man, the cost of that initially for a single use case will be extremely high. So my suggestion is for places where you don't have a lot of executive leadership support, IE, those leaders haven't come from a very strong data background,

and they can't see

the need for it but need to see hard numbers in the context of a single use case that's very, very myopic. This will be overkill 100%. So so then the question is, well, how do you is it still not right for the organization, and what should we do about it? And so I think this is where making sure that as you work through the use cases, you carve out a certain segment of your backlog

and use that in a in a very nuanced way to start building some shared capabilities. Right? And so this is kind of the point that I had made earlier about the fact that your acceleration is gonna be less, which means you're gonna travel,

you're not gonna travel as fast as you normally would. I think those are those are par for the course, but that's that's kinda what I would do in cases like that. And those are the places where, you know, this would be overkill. Right? In in areas where you've got executive support, you know, you've got a set of people around you who actually have seen have seen the need for building data products at scale, and you have multiple teams that are all producing

data products of different variety.

There may be a a big aspiration to to

to provide some of the central capability source to lower the overall cost of production.

Building the use case for that, showcasing what are the ROI looks like. Right? And doing something that that product managers do day in and day out. Right? In organizations

like that, right, where, you know, you have you have 50 people all producing products, right, or solutions, so to speak, right, that go through and get serviced by consumers,

you know, you could start seeing these kinds of concepts

accepted more so than not. Just to summarize, I think it's relevant in either set

of organizations,

but it's more

pertinent, and investments is a lot easier to make where you have a lot of people just working through

providing data solutions. And you kinda take a look at it, and you said, hey. Didn't we just produce that dataset, like, last week? Yeah. That had 4 columns, but this has 5. So why is that other team doing it? Why don't we just kinda take this dataset and make it into an asset and then put that on there? And, oh, by the way, why don't you why don't we put privacy treatments on it as well? Because that other team did that too. How do we mix it? Oh, you're spinning up, you know, you're spinning up, like, an s 3 bucket in this way. Right?

Why don't we use Terraform to go through and do that? Oh, well, you know, our centers are different. Well, our naming conventions are different.

And so I think I think these kinds of problems come at scale. Right? Because now Tobias can't move from team a to team b. Because even though they use the same cloud provider,

the way that they do business is different, and so the op model is different. So so these are problems at scale versus, you know,

in in smaller sizes,

smaller teams,

it's it's less forgiving because, you know, like, the telephone problem is not that high. And for teams and individuals

who are trying to

upscale into the space of managing data product strategy or understanding how best to integrate it into their work? What are some of the resources that you have found useful and that you recommend people dig into to be able to understand more of the tactical elements of how to bring data product strategy into the work that they're doing for delivering data to their various end consumers?

Really honing in and understanding software development practices and what they mean, I think, is a good space to start off in.

So this involves everything from what does CI and CD mean, you know, what does

what does really, you know, building services look like, what does contracts mean in this space, like, you know, like, you know, API contracts? What does discoverable services look like? And this is very, very software engineering oriented. And then and then that's where that's where I assume there's gotta be a little bit of learning, right, kinda coming

the table. The other part that I think data engineering teams and practitioners currently providing

data and analytics solutions will bring to the table by themselves is the inherent nature where data is different. The data assets being produced, information assets being produced,

they're different than just core services. Right? So how do you think about the app model there? What does that look like, and how do you take these these concepts and build them into this? So

for for us, for example, when we produce a data pipeline, do we have a baseline data set that we can test against every single time? Right? How do we measure drift? What does that mean? Like, you know, to should we build leaderboards or not? And then using that kind of set of introspective

q and a to start building out capabilities to say, okay. Well, this is what it means, what it looks like, and start leveraging and deep diving on those items. That that's what I would suggest. Now tactically,

there there there are a lot of thinkers in this space, right, who have all kind of provided their own perspective on what it means. I mean, I thought Works as a company, I think, has done a lot ton in the in the

space of data products.

Sanjeev Mohan, you know, has done a lot of lot of thinking on the data product space.

You've got you've got data contracts with with Chad,

you know, so for instance. So I I think they're

staying close to the to all of these different vectors coming up is is a big 1 as well. What I found exceptionally helpful is staying close to all the Slack channels, you know, where different people are, like, really ideating and thinking about what this means.

And and and our space is constantly evolving as well. Right? So if you think about metric stores, if you think about, you know, the concept

fabric have kind of come to

fruition and different people are working on different things in that arena. But if you about data observability,

if you think about data contracts, like, so these are all kinda relatively new concepts coming up, right, like, you know, over the 3 years. So they've started to take shape, and they started to take hold. And and thinking about how this impacts our space is gonna be the biggest 1. And for us, what that means is is a ton of change. Right? And so when you are in these Slack channels, whether it's for data quality,

whether it's for data observability,

you know, provided by Bigeye or any of the other companies, you tend to start hearing people talk about these interdisciplinary

concepts and bringing them together. And then, obviously, you know, the the shameless bug plug for your own podcast, Tobias. I mean, like, I think, you know, if you're a data engineer and you're not kinda listening to some of these things, you're probably missing the beat on the trends going on and then kind of incorporating that back into

into your own set of practices. Right? So, tactically, those are the places that I would look for.

Are there any other aspects of this space of data product strategy,

how to think about it from a from a tactical perspective, how to incorporate it into overall work processes that we didn't discuss yet that you'd like to cover before we close out the show? I think I think we we did touch on it, but let me let me double click on it further. I think this concept of metrics and really gauging to see if your strategy is is headed in the direction that it needs to head is core.

When

we start thinking about a data product strategy, the question that we need to ask ourselves is what are we going to get as a result of that? Is it going to be lowering the cost of producing

products? Is it going to be increasing throughput

on capabilities that we already have? Hedging on that and really understanding

why and what that means

is gonna be key and core. Right? Also, understanding if you're doing this for defense or offense purposes, right, if you're doing this to to optimize on cost or you're trying to increase top line, answering those questions initially and kind of grounding yourself in in why you're doing what you're doing is gonna be super important. Otherwise,

this will be just like another flavor of the day. You will be producing solutions and nothing more, and probably at twice the cost and and and for for 1 half the value. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think the biggest well, I mean, firstly, I don't I think, like, there's still the biggest problems that we have is about comprehension of how we use things,

more than the technologies. But 1 aspect that I see that I see we completely lock on is this ability to learn from the way that others are using the tooling and the data within the ecosystem that we have and then making our systems more intelligent.

So,

1 of the things that that we

always think about with respect to data management, right, is is it's kinda like creating it's kinda like being a cartographer. There there are many cartographers all throughout your organization that are doing these queries, right, or merging or culling through data and then formulating these side roads. And oftentimes, you know, whenever you start looking at it, they're interpreting how this data is being assimilated together and then creating this map, right, of the organization. When 1 person does it, how can another person not take advantage of it? And when 1 person does it, how do we have enough confidence that that inroads or that side road can have the right level of throughput? That way, we can actually go through and and use it for other purposes. Right? And then how do we kind of auto migrate that up? That whole building an intelligent ecosystem,

right, where you have data

that helps you derive the way to use new data, I think is completely lacking in this business. And I don't know if if we're doing as much in that arena or not. Right? So intelligence systems and using

AI for BI, I think, is a big 1 that I that I see us having a gap in.

Alright. Well, thank you very much for taking the time today to join me and share the work that you are doing and your experience

of building and executing on data product strategies. It's definitely a very important area, 1 that has been growing in visibility and adoption. So I appreciate the time you've taken to share that with us, and I hope you enjoy the rest of your day. Thanks, Tobias. Appreciate

it.

Thank you for listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for

and

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links