An Introduction To Data And Analytics Engineering For Non-Programmers

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted.

Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders.

With complete API access, a user friendly interface,

and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.

Go to data engineering podcast.com/bigeye

today to sign up and start trusting your analyses.

Your host is Tobias Macy, and today I'm interviewing Brian McMillan about building data products in his book to introduce the work of data analysts and engineers to non programmers. So Brian, can you start by introducing yourself? Thanks for having me on. My name is Brian McMillan. And professionally,

I'm a longtime

enterprise architect working in large corporations

and primarily been focused on data and analytics problems

within those big companies.

Even though EDS and HP are technology companies, a lot of what they're doing is just provide the enterprise services to others.

And they're definitely old school companies. And then more recently, I'm working at a major defense contractor.

I left my job in

October 2020

to write a book about

building data products

called Building Data Products, Data and Analytics Engineering for Non Programmers.

And do you remember how you first got involved in the area of data management?

Yeah. So my

degree is in economics, not computer science. And I am a business guy.

That's really colored a lot of my

career going forward. So they have to go back to the mid nineties. I was a data analyst. It was my first real job. I was a data analyst working for electronic data systems in a General Motors manufacturing plant. And 1 of the things that was kind of unique to the business at that time is that they had actual

the IT people were embedded in

the customer's organization. So I got to participate in the daily management stand ups.

My job was to really run their manufacturing war room.

And

the biggest thing was that they introduced me to the theory of constraints.

It was an engine plant. They were going through, you know, they had old technology.

GM was going through a lot of turmoil at that time, and the plant was always on the verge of being shut down. They decided to go all in on theory of constraints

and packed on me because I saw, you know, basically a failing company

turn themselves around, and it kept that plant open for almost another 10 years.

Next 15 years are basically spent architecting,

product managing, people managing, and hands on building a variety of enterprise data management systems for EDS, HP, and Raytheon. Just because you've asked us in a number of other podcasts, I've been working remotely for most of the last 15 years,

which I think is pretty unique.

So having to having to go, you know, COVID sending us all home was actually

very pleasant for me.

Yeah. Levels the playing field.

I've worked remote off and on throughout my career. And so when everything went fully remote, just no ifs, ands, or buts about it. It was kinda nice for me because it gave me an excuse to go back to being full time remote instead of being primarily in the office with part time remote. Yep. And there are a lot of

really significant downsides to that,

but a lot of positives and, hey, we didn't have a choice. Exactly.

And so as you mentioned, you recently wrote this book.

You made the decision to quit your job to spend the time on it. So I'm wondering if you can talk to some of the

motivation for deciding that this book was necessary, that this was the time to do it, and

that you wanted to actually have that full time focus on it, and just some of the overall story behind how you came up with the idea, the motivation around it, and how it came to be?

So when I started working for Raytheon,

I didn't hire into the IT department. I hired into the quality department.

And, you know, it was pretty clear

when I hired on, they had a really big data problem that they were trying to solve. So that's so I came in to build a

data warehouse for them for their quality data. And

having done this a whole bunch of other times,

this was the first time in a long time that I had actually been really hands on. You know, architects don't typically you know, there's a certain point in an architect's career where they're no longer

allowed to touch things. That was pretty frustrating to me, not being able to you know, I was

feeling like I wasn't able to touch things and work on things. So this is a great opportunity.

They had had a lot of trouble with the IT department getting you know, the idea of getting a database

for a business

organization

to do their own database work was not very popular. So they had managed to get themselves a database.

They had some really good ideas of projects they wanted to work on and problems they wanted to solve,

but they needed somebody to actually come in and

do that work. So got hired in, started to do that work for them. It was pretty clear that it was way more work than 1 person could do, which is never a good idea anyways, but that's where you always start.

I had got the opportunity to do a lot of training on that team with people who had never touched a database before.

And they had good domain knowledge,

but they didn't have the technical skills.

That started a ball rolling.

And I guess probably 1 of the biggest things that I learned was how important

that domain knowledge really is, that domain expertise.

So 1 of the first things we did was, well, what's our production yield? We don't have a good way to look at our production yields,

and that should be pretty simple. We'll just go in. We'll look at the data from the warehouse, and we'll, you know, just write some reports.

That wasn't what it turned out to be at all. And what we quickly found out was that they had

hundreds of parts, that they were serialized parts that they were recycling through the production process. So you go to do your recursive query to figure out what your bill of materials looks like, and you can't do it because you've got loops.

So, you know, as a technical person,

what you do is you'd start

chasing all these rabbit trails

to try to figure out how could this possibly be happening because this doesn't make sense.

Instead, we were really lucky on that team. We had somebody who had been in manufacturing

for, oh, jeez, let's just say decades,

and he had a story for everything. So I learned the story of Swaptronics.

So unlike a lot of businesses, you know, Raytheon, they are rocket scientists,

and

all of the work that they do is just bleeding edge.

And the sensor systems in particular

are tough to test, so they fail Tesla on. They pull it out. They put it back on the shelf. It eventually goes back in another device, and sometimes it matches up with the rest of the components and sometimes it doesn't.

I never would have known that that's what was going on unless I had talked to someone who had been on the plant floor and said, oh, yeah. We do this all the time. It's no problem. Well, it is a big problem, and we need to stop doing it.

That pivoted our whole work to trying to figure out how to solve that problem.

And

it became very clear

that domain knowledge is absolutely the most important thing you can do

to solve the problems you're at. Domain knowledge is the most important thing you could have in order to solve business problems you have. There's no way around it. You can't algorithm your way out of it.

The other thing that I motivated me was

I just couldn't believe it literally took that team 7 years to get

a production

SQL Server database

on the network that they had access to, but it didn't come with any

ETL tools.

Like, what good is that?

So the big thing that prompted me is I've been in this business a long time. And as an industry,

I would like to think that at some point we'd be ready to come to terms with the fact that we keep doing the same things over and over again.

We keep reinventing the wheel.

Most of our projects still fail,

and

we tend to collect a lot of data that we don't know what to do with just because it's fairly easy to collect a lot of data.

But that ends up being

generated a lot of technical debt.

And then

for enterprise companies in particular,

their whole operational model is all centralized.

You know, we have centralized IT departments, and you go to the data model, and you go to the ETL person, and you go to the report developers. No. We've gotta have

project managers

managing the whole thing,

and it just doesn't work.

And outside of big enterprise companies,

we are doing things to

fix that problem. But there's a huge opportunity

inside big enterprise companies to solve that. It's definitely interesting how,

you know, the current buzzword is the modern data stack of everything as a service. It's easy to just get a new database. You just throw a credit card at it, but that's only the case for,

you know, a certain subset of the industry. And as you pointed out, in the enterprise, yeah, you have these procurement paths. Like, you can't just throw a credit card at something because that credit card is, you know, being held by a gatekeeper that has, stockade of paperwork to

fend you off. And so

yeah. That's a big nut to crack. I don't know.

I don't know how to solve that problem.

I mean, I know how to solve it in.

I know how to solve it in a subversive way. Quite frankly, that was 1 of the motivators of the book. There is a bit of subversion in the book. Like, here's a whole bunch of free software that you can do that you can implement

to do

basically everything you'd wanna do. Orchestration,

serving,

storage,

you name it. It's all in here. You can do that if you want. You

may or may not wanna do that.

Sometimes you have no choice. Right. Yeah. I mean, there's definitely the double edged sword of shadow IT of, k. You know, these people are unblocked. They're able to get their job done, but they're not necessarily doing it in the most effective way or they're, you know, reinventing their own wheel that's already been solved by somebody else in the organization. And so there is that problem of being able to connect up all the people who have the right problems and solutions.

Yeah. I lucked out as an architect.

1 of the side jobs I've always had is go find the shadow IT teams. Go find them, find out what they're doing, and decide what we should do about them. And a lot of times, it's

we're gonna give you funding. We're gonna give you additional resources to scale your solution up. Yep. You've got a great start of day server monitoring platform

in Australia.

We're going to take that, but we're gonna have to rewrite the entire thing, and you get a central role to help build it.

That isn't what a lot of IT organizations

do.

The first position is almost always

shut these teams down, make it harder for them so that they can bring us requirements,

and we'll work on those requirements

and rebuild their thing for them. But that isn't what they want. They just want to, you know, meet the business problems they have. And, generally, they have a very good idea of what their business problems are.

As IT folks, we we tend not to know what the real business problems are we should be focusing on because we're focused on technology, because we like bright shiny things. Yeah. And I think that that's probably why

the data ecosystem

has been going through such a long and cyclical

route of self discoveries because

it's never just the technical solution, and it's never just the business problem. And it's always hard to get both of those sides in the same room and agreeing

with each other and speaking even in the same language. So I think that that's probably why we keep going through these, oh, well, we'll build this new evolution of this technology platform, and that's going to solve our problems. And, like, nope. Still have the same problem.

Yeah. Absolutely.

You know,

enterprise

application integration.

Well, it smells a lot like Kafka.

Absolutely.

And now that we have Kafka, we're still running into the same problems of, okay. Well, we just put everything into this service bus, but now we don't actually know everything that's using it. Or, oh, we just broke the contract for this data structure because it was being used by this other service that is actually mission critical now.

Yeah. Yeah. It always has been and probably always will be. And then there's that pendulum between centralized and decentralized. Like, you know, right now, the big thing is, you you know, so there's the modern data stack, the 100 vendors that are inside of that,

and all the overlap and whatnot. And then you have the decentralized pendulum

swinging pretty hard,

and, you know, with things like data mesh. Probably talk later about that. But that centralized decentralized pendulum has always been swinging around.

And, you know, it feels to me like right now, that pendulum is, like, kind of peaked

or close to peaking, and a lot of the modern data stack vendors are starting to get big

and start to be more centralized. And I think that's 1 of the things we're gonna see is

some

pretty aggressive consolidation in the business. Yeah. Absolutely. And another interesting

trend

is the repackaging of the modern data stack with companies such as Mozart Data of okay. So you've got 5 different tools that you need to use to do to this 1 thing. So we're just going to be that 1 bill for you, and we're gonna run those 5 tools on your behalf.

Yeah.

Which I think to

enterprise customers

is gonna be very appealing. Absolutely.

Because the people who are buying those systems are not the people who are actually going to use them.

So that value proposition that we can take this complex,

intertwined architecture

and package it up, and we can guarantee you, knock on wood, that everything works together nicely.

But, again, the problem is

you've got all these tools that people need to learn.

You've got

business problems that you need to try to solve.

And all of that's very difficult to deal with. Absolutely.

And so in terms of the book itself, I'm wondering as you were setting out to write it, what were you keeping in mind as your target audience and the

primary goals that you were trying to achieve and help them realize

through the creation of this book and the kind of core lessons that you're trying to impart?

Well, 1 is, you know, architects at traditional IT departments,

you know, used to the typical traditional IT stack, you know, okay, we have

Oracle databases or Microsoft databases.

We're using Informatica

or integration services.

We may have stood up, you know, if you're a Microsoft stack, you know, analysis services,

we've got a whole bunch of other web apps that we may have built internally.

If you're fortunate enough to have people who are doing application development in your organization,

they probably switched a long time ago to building things, you know,

in

a 30 year old but modern software development.

They're doing continuous integration,

continuous deployments. They're test driven development. All of those

things that help you build a better software product, They're doing that. But on the data side,

we tend to not do that. We gave up on testing data a long time ago. Just too complicated. We can't do it. You know, if the file doesn't show up and the dashboard breaks, well, then we'll go fix the somebody will call us. The executive will call us and say the dashboard hasn't updated. Where's the fix? Then we'll go fix it. And we really shouldn't be behaving that way. Know, people figure these problems out.

So

architects need

to know that there are alternate ways of doing things.

And I'm presenting to the book kind of a ridiculous,

minimum viable product

that's got the entire stack. You know, if we're building in the technical part of the book, we're building a little cupcake

for a data product,

looking at sales data. And the data is actual sales data from a company in the UK. It's very messy data.

And

at the end of the day,

that data gets exposed as APIs and a web GUI

on, you know, Google's cloud infrastructure.

And you can do that. And it's not really that complicated. And you can treat everything as code, and you don't need to resort to gooey applications.

And that may not be where you're at right now, but you need to start thinking about that. The second target, I've talked briefly about this before, are those shadow IT teams. You know, the teams who are just fed up, and

maybe they need to learn some

better tools to really take it to the next level. And then maybe the IT departments will come along and help them out.

In terms of the

contents of the book, you mentioned that you're working with real sales data. You're iterating through building out a

small scoped data product from that information. I'm wondering if you can talk to the overall approach that you took for deciding what was the structure of the book, what were the kind of main technological choices you were going to lean on for determining this is how I'm going to impart these core lessons

and some of the ways that you are able to talk through these are the technologies that we're using, but these are the actual fundamental principles that we really care about here. So, you know, not to get too bogged down into the specifics of tool x or y. Well, that was the first thing. Don't

try to not make it too tool centric. 1 thing I didn't wanna do is I didn't wanna make this a DBT snowflake

text.

So the start of it was really getting back to those core business concepts,

starting with things like

product life cycle.

All products go through

this s curve product life cycle. They start out in,

and I'm using, Kent Beck's model here, 3 x,

that he says he developed at Facebook, but, you know, it's in the extreme programming book already.

Yep. Things start out in exploration. In an exploration, you don't know what's going to work.

You don't know what's gonna bring value. So what you should be trying to do is

run lots of little experiments.

And as you get traction on real business value, do the next step, then do the next step. And, eventually,

you'll start to make this transition where you're you're expanding.

You're bringing on more users. You're delivering more value. The products getting more complicated.

And then

you make another transition at some point where the value starts to level off.

And you're in what he calls extract,

where you're really in

a process of

making a good tool better,

not necessarily

making big improvements

to it, but you're just making small incremental improvements

and starting to

just extract whatever value you can out of that. You're not really investing a whole lot of new

into the product, but you're just getting the most out of it. And then the thing that isn't in the 3 x cycle is the exit phase, which is critically important. At some point, you have to start winding that product down.

Hopefully, it's gonna be replaced by something else.

Or your environment may say, you know, we don't need this thing anymore, and it drops off a cliff.

So that's the first thing people need to understand. And that product lifecycle, if you know where you

prefer to sit, you know, if you're a builder,

and you like challenges, and you like fighting fires, you know, you're an expander.

If you're somebody who's more concept based, you're probably an explorer.

And there are plenty of people who just wanna make things run like clockwork.

That's the first big business concept. The second 1 is that you need to understand how your company makes money.

You need to understand the value chain, and this is probably where all these projects and teams need to start.

Do we know how our company makes money?

And a great example of that was

when HP and EDS merged,

I was the architect for the availability capacity organization, and

both companies were

supporting

about 200, 000 servers a piece, you know, each. So we were gonna

put 400, 000 servers worth of performance data into a single data warehouse,

and

it wasn't working very well. We had lots of arguments about how to do that and what was important. And the key came down to understanding that value chain. And as it turns out,

about 90%

of electronic data systems business was in

hands on capacity management. We had people looking at server performance and making recommendations

for how you should

manage your

servers, where HP

was only 10%

advanced management.

Totally different business models.

The solutions in those 2 cases

are going to have to be different solutions.

The money is being made differently,

and it behooves you to

figure that out and treat

those

2 value chains separately.

The advanced

level of that is starting to look at things like wordly maps

and breaking that value chain down into

the product life cycle phase they're in and treat those pieces of your solution differently.

Yeah. Wordly maps are fantastic. And then the 3rd business concept is

that just like products have life cycles, companies have financial life cycles.

Like a start up, your job is to just figure out how to get somebody to pay you something.

May not be the product you just wanna build,

probably won't be the product you intended to build.

But your job is find a way to make money.

And then eventually, you get customers,

You're profitable, but your profits are fluctuating.

And there are different metrics that you can look at. The either

the academic name for that is throughput economics.

There's some good work in there from the theory constraints folks about what metrics you should be measuring. You know, when is it appropriate to do things like

measure

your throughput

and when does it not? You know, and throughput's kinda central to theory constraints,

but it's not always appropriate to measure that. So that's on the business side, and that's really the first third of the book. And then the second third of the book is about

demonstrating some key technical skills,

how to write some basic stories,

not do a lot of planning or do little to no planning, write some basic stories,

get familiar with command line tools. You know, most, particularly in traditional IT departments, were very GUI application focused,

and we are terrified of the command line. But there are a lot of great utilities out there

that just require you to write, you know, 5 words,

and you can be deployed

on Google's infrastructure.

It's really amazing. And once you start to see that, you go, oh, wait a second. I can work in a completely different way.

SQL, of course,

there's been a renewed realization that just plain chain SQL gets you a lot a long way

into your problems.

Well, yeah, I'll I'll say that into your problems.

Yes.

It brings some problems with it. And it, you know, it can be used inappropriately.

Automation and orchestration.

You know, that's a big problem, particularly for like the shadow IT teams. Yeah. Great. We built this great thing. You know, we've got a great data model. We've got beautiful reports, but I still need to come in on Monday and push buttons

and babysit things. You don't need to do that. Like in the book, I use make

and use make file for everything.

And that 1 hung around in my head for a long time, you know,

watching a presentation on airflow years ago. And the description was, well, it's just like make for data.

That's interesting.

You know, I've compiled data with make. I've compiled programs with make, but I never really thought about using it for data. It's awesome for that. The syntax is easy. You know, if everything you're doing is in command line, you're just stringing a bunch of command lines together without having to write some nasty

Python script or some nastier bash script,

just really clean. And then 1 thing that I don't really talk about in the book, but critically important is version control.

You've got your code in version control.

Stop putting it on file servers.

Yes. I know you are doing that.

Stop it. Don't do that.

You know, even if it's just

check-in and sync up, start doing, you know, do trunk mail, do trunk based development because when you get to CICD, it'll make that a lot easier. Those kind of things,

they're

new concepts to most, you know, enterprise

data people. How do you treat a tableau

a tableau report as text or as code? You can't.

It's hard.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and prophecy generates clean Spark code with tests and stores it in version control. Then you visually schedule these pipelines on Airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.

Given the fact that you are trying to cover these

technical practices

and keep it rooted in the business requirements, and I'm just wondering,

you know, as technologists,

we are often kind of tainted by the hubris of, oh, well, this is technology and it's hard, so I'm not going to kind of expose that level of detail to the business users. I'm just going to try to

make it as pretty as possible so that you know, give them some pretty pictures so that they can make their own decisions, and, you know, we'll use that as the handoff. And I'm wondering what you see as being the kind

of realistic expectations

for business users to actually adopt the

core

technological

tools

and approaches that we as engineers have become accustomed to and vice versa of

both getting engineers to actually

understand the various business concepts and the economic modeling and all of these

more sort of process and organizational concerns beyond the realm of the technologies and tools that we're using to make bits fly around the ether?

Yeah. Well,

by far, the hardest thing is to get technical people to understand the business because the business people have enough trouble with it themselves.

If you're in the marketing department, you're focused on marketing things.

You're not focused on,

is it possible to build the thing that you're trying to sell?

So business people have this trouble as well. But the more people who could understand,

you know, how to think about

the value chains

holistically for the company you're in, the better off everyone will be. It will be much easier to have conversations about what's valuable or not because we spend a lot of time suboptimizing

things

and just making busy work for ourselves because we have to be busy. Right?

So getting technical people understand the business is

first the most difficult.

Getting the business focused people to understand the technical stuff,

you have to keep your eyes open. There are a lot of people who a lot of analysts who are really good SQL developers

or could be if you had taught them

common table expressions and window functions.

There are lots of analysts out there, you know, writing their Tableau reports

who

can write subqueries

and, you know, they can write the Getty code subquery thing. If you just spend 5 minutes teaching them how to do a common table expression and pop that logic out into an independent piece that they can eyeball

and maybe someday test,

they will be ecstatic because it now gives them

some confidence to start doing other things.

If you can get people comfortable with SQL and you can get people comfortable with checking it in

to a centralized repository that's shared with everybody else. You know, try to do some kind of teaming, you know, pair programming,

that kind of thing so that the knowledge is spread around. Eventually, you're gonna find someone who, you know, really wants to know how to write Python and R,

and they've got the basics on. Now you could get a case where you've got someone who knows how to get the data to use that they wanna use, and now they have the technical skills to start doing more data science y type things.

Lord knows there are a lot of data scientists who can't get data out of databases,

which blows my mind, but it happens a lot.

They're just not comfortable with it. So it goes both ways, which gets to a bigger question about you have to start with what you have. A big part of data literacy is just trying to figure out where you are right now. What skills do you have?

What capabilities

could you get the team up to quickly?

That's a hard problem.

That's really the tricky part, data literacy.

Yeah. It's definitely the interesting aspect of it is,

you know, beyond just the technical bits is understanding

how do you

create and propagate context of the information that you're actually using? How do you understand the kind of statistical and semantic elements of manipulating the data?

What are the, you know, downstream impacts of these mutations as far as what you can and can't do with it after you've made them?

So it's definitely a large and complicated sea of

complexity and concepts no matter what your background is.

Yeah. I think the big thing is there aren't enough

data people to go around,

and we have to make more data people. Absolutely. Yeah.

Yeah. That's really what it boils down to. There's a presentation, and I referenced this presentation in another interview recently

by Jez Humble from, you know, early on in the process of the DevOps adoption saying, you know, stop trying to hire your DevOps people and create them instead. There's something to that effect. And we're we're definitely in that same phase with data where we're not gonna hire our way out of these problems. We have to start educating everybody who's already working on the problem to understand it more thoroughly so that they can, you know, do the things that need to get done rather than trying to hire the next, you know, data scientist or data engineer who already has all the skills,

you know, but in a tool that we're not actually using right now.

Yeah. That's a great point. You know that these DevOps practices

are 98%

applicable to data. Like, I hate the term data ops.

It's the same thing, people. We're shooting for the same target.

We want reliable,

reproducible products.

That's all we're looking for. Absolutely.

And so to that point, another thing that has come up in some of my conversations recently is,

you know, maybe the idea of the data engineer or the analytics engineer

or what have you is starting to

be on the wane. And we don't actually need these

specific job titles, but we really just need our developers who understand how to work with data. And so that just needs to become the kind of baseline status quo of if you're an engineer,

everybody needs to understand these concepts because it's just becoming more ubiquitous, and so we need to generalize and not specialize in these regards.

Yeah. You know, it's that Conway's law thing. If you've got data engineers and analytics engineers and data scientists

oh, and then you've got the platform folks. You know, you've got your site reliability engineers and whatnot. You're going to end up with

siloed,

centralized systems.

And there's always gonna be a space for centralization

and, you know, people with deep course you know, that t shaped skills thing. You know? They're gonna be people who who

absolutely have to be

t shaped, deep, long t's

because this stuff is complicated.

But we need more people to be more generalists.

And

if we can combine

people who have more general skills

with

less requirements, you know, and the and the big 1 is

cutting down in the amount of data you have to process.

Like, if you can narrow down the data set you're working with,

suddenly that makes a lot of very complicated things a lot easier to deal with. So start doing more of that. Get more general.

You're going to naturally end up in a more distributed system with a more distributed environment.

And hopefully,

you're going to end up with teams that are

really knowledgeable in the domain and the problem that's interesting to them. You know, that old adage, look for people who are concerned about a problem and let them do things, which gets to another thing that we that we do a poor job of.

We've got to get more diversity in these teams.

You know, we need to hire

for more diversity,

and we need to assemble teams with more diversity in mind. Because whenever you assemble a team that's,

you know, you're looking for these t shaped skills and you're looking for people with lots of experience, and you don't want that. What you really want is you want

people with a wide range of skill sets and a wide range of backgrounds.

They make better problem solvers. There are some cool studies that I referenced some of them in the book about that. You know, having a more diverse team

makes it easier for you to solve problems you don't already know the answer to. And we need to get better at that. In terms

of the

kind of scoped problem of building a data product,

You know, in the book, you kind of focus on the use case of a small team and working in a smaller group to be able to conceptualize and iterate on and build and produce these data products.

And I'm wondering what you see as being the kind of core practices that are necessary

for that venue

and some of the ways that those practices

either

change or mutate or when you start needing to bring in other concepts or specializations

as that team starts to scale and can no longer be considered small

and maybe speak to what that tipping point happens to be, whether it's actually in terms of quantity of people or complexity of problem,

etcetera? Oh, ew. Boy, I think that the answer to that depends. I mean, the short answer is pretty straightforward. 2 pizza teams. Right? You know, no more than 5 to 8 people. Keep the number of people odd so that you can have a tie breaker when you have to decide some, you know, all of that kind of stuff.

That's the good short answer.

But,

you know, the reality is

you can only hold so much stuff in your head at once.

It's difficult to think about

a wide ranging problem with any kind of complexity

by yourself. So you need other people to help you do that.

But the more people you have,

the problem that people think they're working on is gonna diverge,

and that's okay. Maybe you need to split that team off. And, again, this is a centralized versus decentralized

thing. This is 1 of the big

drivers behind

the, I guess, philosophy of

data mesh is that you know, same thing with microservices. I mean, again, I think about data mesh as being microservices for data.

And sometimes,

that's probably

the most appropriate way to solve the problem. Other times, it's not. 1 area where I think is the most appropriate way to solve the problem is when you're in exploration,

When you're just exploring the problem, you've got couple of people who are dedicated to solving a particular problem.

Give them an exploration platform that lets them do their job

as efficiently as possible.

And if it doesn't work, you package it up and you put it on the shelf. If it works, you probably will need to completely change

the way that thing is implemented.

And at that point, you're gonna need to bring in

specialists

to help you do things.

You're gonna need to bring in people with

deeper technical

experience

than the team probably has. And then that's where you start to figure out what's the most appropriate team structure because we're doing something different. We're growing.

As far as the

overall state of the data ecosystem, we mentioned earlier about the modern data stack, and there have been various evolutions

of ETL and ELT tools. And every, you know, few months or every couple of years, there's some new product category that people are exploring.

You know, some of the recent ones are data catalogs

and

now data quality and data observability.

And with all of that activity,

and, you know, a lot of this is stuff that we've had in some variety or another

for years. It's maybe just repackaging it with a particular

focus. And I'm just wondering

what you see as

the

kind of challenges,

particularly for business people who aren't steeped in this every day to

extract useful signal from all of the noise.

And if there's anything

in all of this kind of, you know, funding laden hype of these different elements of the data ecosystem that is actually truly new and innovative and not just

revisiting the same concepts with a, you know, shinier brand?

The first thing is,

again,

you've got to understand what problem you're trying to solve. Why do you bring in a data catalog?

First off, you need to realize that

we've always had data catalog efforts.

You know, for 20 years, there's been a push, you know, every 5, 10 years for

new data catalogs.

Well, why is that? Because we don't have any visibility to the data. We start collecting the data. We don't manage the data well. It's difficult to expose it because we lock it away in databases, and we don't let people get access to those databases. So

it's a legitimate business need. But the problem is,

again, I'll I'll go back to, you know, 1 of the central themes.

Why on earth is your data system so large

and so complicated

that you need to put a data catalog over everything? And how long is it going to take you

to deploy that data catalog and to catalog all of your assets and make them

make them visible? Your system's so big, where it might be better to

distribute that

work out

and have the teams that really understand it figure out, no. Okay. This is valuable. This is not valuable. Let's figure out how to prune the system, you know, get rid of that technological debt. It's difficult

for business people

who are making these buying decisions

to focus on that because they get sold on, oh, we need to have a catalog, and we need to

have APIs on everything, and we need to do this, and we need, you know, to hire a team of 12 data scientists

to do

machine learning models, and

we don't stop to think about why, and do we really need to do that right now. And more importantly, are we even prepared to do that? What would we do with that? I mean, in manufacturing, you see this all the time. You show someone who's working on a production line, statistical process control chart,

XPR and R chart, they're gonna look at you like deer in headlights.

You have to go back and explain variation to them. All they need to know is, am I doing okay today? Maybe a dashboard. The equivalent would be a dashboard. That's probably where you need to be.

And leave the fancy stuff

to come after the training you provide them

to understand why,

you know, this is more valuable.

You know, why is a scatter plot more valuable than a bar chart,

let alone, you know, a pie chart?

We're pretty immature. And as technologists, we need to step back and go, are people ready for the level of maturity that I'm pitching here?

Chances are probably not. So what do you do?

And so in terms of

the

kind of work of building data products and

being able to bring business people into the planning and execution of that

and being able to bring

engineering teams

into the business problems, what do you see as the biggest risks to

those engineering and data teams as they're starting to embark on these products or starting to

evolve their capabilities

to kind of work more closely with the business needs?

We don't get alignment on the problem. We don't clearly articulate what the problem is and have a shared understanding of that.

Take the time to get a shared language around the problem. That's really the core.

And again,

we don't understand the business. Next thing that flows from that is we don't understand the business problem.

We

think too big too quickly.

And this goes for both sides of the fence, the business folks and the technical folks.

We want to jump to a conclusion. We want to pay

for some product to magically take our problems away.

And that's where we first jump. We need to slow down a little bit

and think about

what we are really trying to accomplish.

That's just always a big risk. You know, it's always people in process problems.

And then

I do think that there's

a problem trying to collect all the data, you know, for folks in the data warehousing space,

A famous quote by the

CTO of General Motors,

who was the CTO of Hewlett Packard, he was really famous for saying, I wanna know everything about everything.

Like, that is the most

insanely

crazy thing you could ever say. We're just going to build a big pile of technical debt

that

no 1 will be able to use.

Absolutely.

And so in terms of the work that you've been doing on the book and some of the

kind of core lessons that you've put in there, what are some of the most interesting or innovative or unexpected ways that you've seen those ideas applied or

some of the useful or interesting feedback you've gotten on the book now that it's out? So my wife works for an educational publisher. You know? You know what the educational market looks like right now.

It's just exploding.

And they're trying to deliver their own new products.

And as she's explaining, you know, what's going on at work, it's like, oh, let me tell you about Explore, Expand and Extract.

And I got invited to come and do what I affectionately call a man explaining minute,

lunch and learn

about

their transition because where they were at, they were exiting out of exploring, going into expand.

And that

piece there, you know, it's the crossing the chasm spot. You know, from that famous book, I call it the crappy chasm of doom

because

it's really horrible to make that inflection point.

Nothing works.

You know, everybody's working way too much overtime.

And there's no sign that anything's ever going to get better. And 1 of the big messages that was really nice to say because I've been thinking about this. I've been applying it to where other places where I've worked.

I've been through this transition point a lot,

and it's always horrible.

And it was nice to explain that to non IT people

that, you know, it's going to be okay. You just need to give it a few more months, and you will be past this,

and you'll be back to normal again. And that was really satisfying because these concepts are really universal.

Anytime you try to build things, the worst thing that can happen to you is people actually like it,

and it takes off. Nobody's prepared for that. That was a nice thing

that I got out of the book.

In terms of your experience of writing the book and sort of formulating the

ideas behind it and the core kind of mission of it, what has been the most interesting or unexpected or challenging lessons that you've learned in the

development,

and throw in a whole bunch of business theory at the same time was just ridiculous.

But I couldn't think of another way to do it. I couldn't think of you know, if you learn how to use a particular set of tools, you don't know what to use them for.

If you know where to point a tool, you have to know how to use the tool. And so I tried

to weave a fine line between both of those

difficult situations,

and it was hard. And I'm I'm not really sure I did a great job of it,

but I did the best job I could do.

The other thing that was really good is I learned that I had about 20 years of frustration locked in my head,

and getting it out on paper

was very cathartic. Like, I can sleep at night. I don't think about data problems at night now, and I was before.

They were keeping me up at night.

Lot of imposter syndrome

that I encountered

still have, so

I think everybody who puts themselves out there really suffers from that. Yeah. Those were really the most valuable pieces of learning that I got out of doing this. It was a lot harder than I thought. It was 1 of the hardest things I ever done, And to crank out basically a textbook in 8 months

was

nuts. That was ridiculous to try to do that.

Well, now that you've done it, what's next?

You know, I

don't know. I'm kind of looked at when I left. You know, when I took my resignation leave, I kind of thought about it as I've been planning this for a while, you know, saving up my money for a while. So right now, I'm just looking for new opportunities.

I don't know that I have another book in me right away.

And just trying to think about how to solve these problems. So if anybody else feels that they're interested in these kind of problems,

feel free to get in touch with me. I'd love to talk to you about it. It's something I'm really passionate about and have been passionate for a long time,

and I don't really know what's gonna come next.

Alright. Well, for anybody who does wanna get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think there's a lot of potential in data mesh,

both for good

and for just total disaster. We had the same thing with microservices.

Go look at how microservices panned out, learn from that,

then start applying it to data and hopefully

make fewer mistakes.

That would be 1 thing. The second thing is, man, I wish there was something like SQL for visualization

that didn't require you to write Python code.

Because if you hand a page of Python code

to someone who's not a developer, their eyes glaze over, and

it's terrifying to them. And we need something that's simpler. In the book, I used Vega Lite, and that was fantastic.

But pretty arts and charts, not a whole lot of interactivity.

Some places, that's exactly what you want. Other places, not.

So I would like to see something that's easy

for people to grok as SQL

for visualizations. And I'm not I maybe have a suggestion

for something to take a look at. I haven't found 1 yet. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on writing this book and some of the

thoughts and experiences that went into it. Definitely a very interesting

problem space, and it's definitely great to see people trying to bring business people

more in the fold of working with the data and helping to

educate engineers on the business concepts and requirements that go into what they're actually trying to build for. So appreciate all the time and energy you have put into that, and I hope you enjoy the rest of your day. Great. Thank you. I really appreciate the opportunity.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcastdot

com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links