Proven Patterns For Building Successful Data Teams

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bogged down by having to manually manage data access controls,

repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification

features eliminate the need for time consuming manual processes,

and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.

That's immu

t a. Your host is Tobias Macey. And today, I'm interviewing Jesse Anderson about best practices for organizing and managing data teams. So Jesse, can you start by introducing yourself?

Hi. My name is Jesse Anderson. As you mentioned,

I have spent the past many years, not just on the technical side of things, but the management side of data teams. And that's probably a large amount of what we'll be talking about today.

Do you remember how you first got involved in working in the area of data?

I've always had an interest in data.

I've always been curious, a curious person. And I think curious people like to read,

and they like to have their decisions based on data or actually see what's happening. So throughout my career, I've always been curious about, okay, what is that number? What does that count? What is really happening? Was what's happening

really backed up by

maybe the CEO saying 1 thing and the the numbers and data saying another? And I've always been curious about that.

1 of the things that you've done most recently is publish a book entirely focused on the aspect of building effective data teams with a particular focus on big data.

And I'm wondering if you can just start by giving an overview of how you view the mission and responsibilities of that data team.

Sure. I believe that data team is responsible.

Just to give a 1 sentence definition,

data teams are responsible for creating data products.

And then

expanding out from that, there's the different teams. So the 3 teams are data science, data engineering,

and operations.

And as part of that mission of creating data products, each 1 of those teams does something specific.

Data engineering, for example, they're creating a data product that's usable,

usable by the rest of the organization, that's scalable, using the right infrastructure.

The operations team is giving the

operational excellence that's necessary.

If we're creating data products and they aren't up and running and usable, then what good are the data products?

And finally, the data science team, I see them as kind of the cherry on top. They're creating the advanced analytics

that makes

the highest amount of value for us possible.

1 thing I like to tell people is that none of these teams is the most important team.

They're all important

to that value creation cycle.

So having 1 of them missing

actually creates various problems

within creating data products.

So we really have to focus, and we really have to say, I need all 3 to create the highest level of value for ourselves.

Within the context of data teams,

there

are those 3 core pillars that you mentioned, but also in reviewing your book a bit, you also called out the fact that there are cases where you might need other specialized roles depending on the particular focus of the data product or the level of sophistication of the organization.

And I'm wondering if you can just give some color as to what format those other roles might take and where they might live within that

constellation

of data science, data engineering, and operations?

As far as the operations team, usually, that team is mostly made up, if not entirely made up of operations engineers.

Whatever your organization calls them, for example, whether that's SRE operations,

the operations team is generally holistic of that.

Then on the data engineering team, that's where things get more

cross functional as it were depending on the organization.

So on data engineering,

part of your data product creation may not just be here's a

file in s 3, for example.

It could for for some organizations,

the purview of the data engineering team extends to

the visualization

of that data, the exposing of that data.

And as a direct result, we may need other

teams or functions or titles

represented on there. So in the data engineering team, that's primarily going to be made up of data engineers.

And my definition of data engineer is a software engineer, once again, a software engineer

who has specialized their skills in big data.

So that data engineering team will be mostly made up of those people with that software engineering background

and understanding with the understanding of the big data tools.

However, for some operations, for some parts of that, we're actually going to need other specialties.

We may have to do a visualization,

a real time dashboard, for example.

We may have to expose that data product to the rest of the organization.

So in those senses, we may actually have front end engineers

on our data engineering team. And with the companies I've mentored and consulted with, we've actually done that. We've either hired or put a front end engineer on the data engineering team because that was a key part of what that data engineering team was doing. Yes. They were creating the data products, but they were also responsible for creating the graphical representation

or

the representation to the business users of that data.

Data engineers, I don't expect them to be great UI people.

Back the vast majority of software engineers aren't great user interface people.

So, yes, having a front end engineer

on the team will be quite helpful.

Other people that we may have on that team might be a DBA.

And this is an issue that you'll have to keep an eye on from a management point of view.

DBAs will help out the data engineering team on things such

as schema, making sure that we're laying out our schema properly, that we're about

evolving that.

And they may help out on some of the

more SQLable

parts of the problems.

However,

having a team,

a data engineering team made up of entirely DBAs is a problem unto itself. Talk more about that in the book if you wanna read it more about it. Then we have

other people who are on the data science team. So on the data science team for some organizations,

that front end engineer that I mentioned,

that front end engineer may be on the data science team,

depending on how they're trying to represent that data, who's really responsible for representing that data to the business as it were. However, on the data science team, that's mostly going to be

a data scientist on that team.

There are 2 other

kind of newer manifestations of positions, and 1 of those is

machine learning engineer. So machine learning engineer,

I like to think of it or to tell people is this is what people originally thought data scientists were. Put a different way,

data scientists are often

in my definition is that they have taken their mathematical or statistical background and learned how to program.

Now the issue with most people's perception of what a data scientist is is they thought that they're a master of programming or software engineering

as well as this master of

mathematical and statistical side.

And the vast majority of people who have the title of data scientist

are strong on the math side and are comparatively weak on the software engineering side.

That means that the

systems that data scientists often create

have large amounts of technical debt due to missing

software engineering backgrounds.

So that brings an issue of

who is responsible for creating

really good systems for machine learning,

not just for the models, but who's the 1 who is supposed to go through and

stand this up, make sure it runs right.

And that brings up a machine learning engineer. And that machine learning engineer is what I see is that in between

for when companies really see the difference or the lack of technical skill on them on the data scientist side.

And then the data engineers with that technical skill, but the inability to either understand

models or deal with models, then you may need that

machine learning engineer.

Coming back to that definition or what people originally thought, this is what most business leaders

originally thought that they had with their teams, with their data scientists. They thought, okay, they can

create all of the software engineering side. Well, the reality is

that you still need data engineers and you still need machine learning engineers.

Then there's a second, and that's much more of a

team structure than it is a person per se. And that's DataOps.

And DataOps is there to deal with some of the friction

that we're seeing in organizations.

It may be published by the time this podcast comes out, but I did a survey for data teams.

And I asked questions about where do people have the most problems.

And by far, the biggest problem was around friction. They said that their difficulty was around friction. So how do we deal with friction organizationally?

And, usually, that was trying to take the resources of 2 different teams, usually data science and data engineering,

and say, how do we coordinate that? How do we say how do we give some kind of fairness algorithm

of data engineering resources to data science resources?

And

there may not be a way to do a true fairness algorithm

or

organizational structure.

What we may need to do is we need to may need to break them up

and make them cross functional teams. And that's what DataOps is. And as talked about in the book, how do we

make our teams

break them up so that data engineers and data scientists

and potentially operations

and product people

are all on the same teams.

And by putting them all on the same teams,

it removes some of that contention and friction

so that we are consistently working with a business. We do have all the resources on that team. And then I have in the book some interviews where people are

practicing

DataOps on a day to day basis

and how

DataOps actually helped them remove that friction.

1 of the other things worth calling out is this overall concept of big data where a lot of people might have their own

intuitive

understanding of what that means, or they might have a particular definition that they go by. And 1 of the most common ones that I've seen is the idea of the 3 v's.

And in the beginning of your book, you call out your own particular way of determining whether or not somebody has big data. Wondering if you can talk through what your definition is and how you arrived at that.

My definition is can't.

I've always had that definition.

My definition of can't means that

you, as a manager,

you walk up to your, let's say, BI team or you walk up to 1 of your analysts

and say, hey. Can you run a year over year calculation

on

gross sales?

And they say, no. I can't do that. It's going to take too long.

So what we have is an issue that is based on a technical limitation that can't.

So the can't that

the data analyst told you, it wasn't that I didn't have the skills to do that. It was that I was told by the data warehouse team or

somebody,

the operations team, that if I were to run that sort of query on the database,

they would revoke my rights.

It would just take far too much resources on the database. It would bring the production database down, things like that.

It's always based on a technical limitation.

When I work with teams, this is the definition we look at. We look at can'ts.

And

for those of you who are listening, this is exactly what you should be seeking out. If you don't have a can't,

you may not have big data limitations.

However, if you're getting those can'ts, and this is usually what my clients are hitting. They're hitting a can't where they can't do the report or the analytic because it's going to take too long,

or it's going to exhaust the memory of their

in memory database, for example,

problems like that.

The reason why I prefer this can't definition over the 3 v's

is that it's easier for managers to understand, in my opinion. It's more of a manifestation. It's more of, here, I can point to this and say I have can'ts and I don't.

The 3 v's sort of definition lent themselves to vendors saying,

hey. My product is big data now. Well, you didn't change your product. Your product isn't big data. You just have better marketing saying it's big data.

However, if you do have this can't and you have

technologies that solve that can't, then you have both a big data problem and technologies that solve big data problems.

For organizations

that are

maybe on the border of can't where they are able to fulfill the needs that they have, but maybe they're looking to be able to derive new types of products from the information that they're storing.

How does that impact the overall composition of what they might need for the team structure to be able to realize those data products when they are working with quote unquote small data?

I believe that there's a in between

small and big data, there's a term I coined,

medium data.

And to be honest, I think that more organizations

are in that medium data phase,

where they're too big for small data,

but

big data sounds too big and too unwieldy.

Or too big. They don't have petabytes of data. They only have terabytes, let's say.

And so for those sorts of organizations,

yeah, if you're hitting those can'ts or sometimes we look at it from more of a perspective of,

so this year, you don't have can'ts,

but next year, will you? And as we rewind that and we look at that we rewind that back and we say, okay. If

next year, you are going to have cans,

well, in order for us to hit that time period, we're going to have to start now. These projects aren't, hey. Let's bust this out in a month. These are 6 month projects. These are year long projects.

So if we do it right, we'll avoid those cancel altogether.

So organizationally, what needs to happen in those, let's say,

small, medium, or even big data teams,

I talk a little bit about this in the book.

For small data teams,

the the same sorts of things apply. The same sorts

of organizational structures apply. However, your complications

will be vastly lower. Count your blessings, quite honestly. It's if you don't have to deal with these big data technologies

as much as you have them on your show and you talk about them,

they're far more plentiful on the small data side. They're usually far better

engineered,

easier,

more people know them. There's a whole large amount of reasons why you'd want to avoid this.

But as you get into that medium and big data side, yes, it becomes even more important

to have the right people.

Usually,

the biggest manifestation

isn't so much on your data science side for

going from small data to medium or big data.

If the data engineers have done their job right,

the manifestation to the data scientist should be

not negligible,

but manageable.

The biggest

manifestation will be on your ops and your data engineering team,

where your ops team will have to learn these brand new technologies and how to operate them correctly.

Your data engineers,

if they're coming from either

backgrounds where they didn't have distributed systems, they'll have to learn these new distributed systems.

And that can be cognitively

very difficult for them. I say that having taught many software engineers these skills,

not everybody is has the ability to do this, quite frankly.

And it is difficult. It can take 3 to 6 months.

So organizationally,

what managers should be looking at as as they make that cutover is is the team ready?

Where for some teams, they may have been able to scrape by at small data.

So this is an honest look that that a manager should take at at the team of saying,

are we just scraping by on our success with the small data stuff? If we go to big data, and it's added complexity,

hey, maybe the team really can't make it.

Or maybe you're thinking, yeah, the team's really done really well with the small data side, they do have some background and distributed systems,

You have much better odds that way, and they can make the jump.

But, really, what I encourage is an honest look from management

because it isn't fair,

quite honestly. It isn't fair for either the manager, the company,

or the team to put them in a situation where they're set up for failure.

For those cases where you are trying to either establish a new team with new talent or identify

people who are already in your organization who can be leveled up into these types of roles,

particularly for that internal use case. What are some strategies that you have found to be effective for identifying

what the capabilities are for the internal talent and identifying people who might be a good fit for moving into these types of roles?

What we've done with previous clients is we've actually leveraged 1 of my books I wrote called The Ultimate Guide to Switching Careers to Big Data.

And as opposed to this data teams that we've been talking about,

this data teams is for management,

but

the ultimate guide or the switching careers book is much more focused on individual contributors.

And I wrote that book

to

say,

here is the lowdown on making that switch. This isn't me trying to sell you something. This is really me trying to say, I would rather you read a book and spend, let's say, 2 or 3 hours reading a book and decide, no, I don't want to do this, rather than

either reading hypey

or watching a bunch of YouTube videos saying how easy it is. I'd rather the them say, hey. This is what it is. This is how difficult it is. And so what we've done is we've leveraged that book within the team, within the companies I've mentored, And we've sent that out to all their software engineers, all their operations engineers to say,

read this book, raise your hand if you want to be part of this, and we start founding the data engineering team from there. We get volunteers rather than conscripts.

And that isn't to say there's anything bad with conscription.

It's more to say that I would rather have somebody volunteer for something

knowing that it's going to be a difficult road ahead.

And

I will say that it is difficult. Even the people who

raise their hand and say, I wanna do this,

they may not all make it there. They may not all be ready for it, And that's

okay. But at least they've

been given a good go of it and that they've been given the opportunity to say yes or no.

I have seen other times when companies or teams are sent on death marches.

And usually these death marches are started for a few reasons.

1 is that they're started without any resources,

that

they're conscripts or they may have been volunteer.

And they're saying, here's a cluster, knock yourselves out.

And that really isn't fair to those teams either, that they should have been given some level of resources, some level of help, some level of training.

Otherwise, they just won't be successful.

So it's really key that and I say this when I work with when I mentor the teams is,

hey, we're not going to just put you all in a team and then walk away.

That's how you set people up for failure.

I'm going to be continuing to work with you. We'll make sure that we do the right architecture. I'm there for questions.

1 thing I will say there is as people

get stuck as they first get started,

it is really difficult.

They may not even know what the right question is to ask.

So US management, I if you're management or if you're team lead on this on these projects, I know this.

Getting stuck

is how I've seen some teams just go nowhere, that they get stuck, don't know the questions to ask, can't find the right resources to do this.

And

those sorts of teams, they don't write themselves.

They don't figure it out on their own. What usually happens is that they just spent months getting stuck and don't really go anywhere. So

be really mindful of that. Keep an eye on the team. Keep an eye on the team's progress, their velocity, so that they're not getting stuck.

But overall, we want to make sure that we have the right people.

And

related to the question of having volunteers,

you should vet those volunteers as well. That's what we do. When I mentor a team, we make sure that the people

are actually potentially viable members of a team. For example, if somebody is

doing

soft batch scripting

and wants to be part of the data engineering team, we need to validate that they could learn Java, for example, that they can start dealing with distributed systems. Otherwise, it's a nonstarter for both sides.

Beyond the internal use case where you decide that you need to bring on external talent, that can be challenging because of the fact that the current market for data scientists and data engineers in particular is very difficult where

a lot of these types of roles require some level of seniority in software engineering before you make the transition, as you said, to being a big data engineer or being a data scientist or machine learning engineer. So I'm wondering what you have found to be useful strategies for attracting that type of talent given the tightness in the market right now.

I think it's a mix of things that helps you do that.

1 is

showing people that you are setting people up for our success,

that you are

resourcing the team,

that you are using

not just cutting edge technologies for the sake of doing cutting edge technologies,

but using cutting edge technologies in the way that they want to be.

I would say, 1st and foremost,

smart people

who are experienced, wanna be around other people who are smart,

and

they're going to gain experience from being on that team.

1 thing I would encourage managers to think about is this isn't a 1 way street.

This isn't

if we

if we pull 1 over on these people, they'll join our team.

I've seen that happen. And what usually happens as you hornswoggle them,

they'll just leave. They'll just quit after a few months.

And then you're back to square 1 of trying to find a data engineer again. I think it's about upfront honesty.

We could also talk about the alignment on core values.

The companies that I've worked with,

when we work together, we make sure that we align on core values

for the company.

And then as we look for people, for people to join the team from external,

we make sure that they align on core values as well.

And finding that alignment on core values will mean that you will find the right people that will stick around and do the right things.

So overall, you do need to make sure that

you have a good fertile garden for people. They'll peer in during your interview process and see,

yes, that is the place I want to

work. And once you do have a team established and you have the necessary

roles in place,

how do you tend to recommend folks organize those

individuals

and teams

in terms of the organizational aspects and the communication and reporting patterns that you found to be most effective where I've seen conversations

debating the

utility of having a centralized team or having embedded data scientists with the data engineers as and operations engineers as the platform team or various combinations

of different

reporting structures to

figure out how that fits within the broader organization.

It is an important

thing to think about. And I've seen this happen

several different ways and with several different levels of success.

To kind of go back to that

talk I was having about DataOps,

The issue with DataOps is it's not an initial

team structure that you might have, in my opinion.

I think it's a later, more advanced team structure that you have.

And the reason I point that out during this question is

it goes to saying,

do you create a group of people

together initially and then separate them out? Or do you separate them out and hope for the best?

In my experience, the companies that separate out their teams out initially

never create a cohesive

best practice usage.

Put a different way,

team A has a data engineer, team B has a data engineer, team C has a data engineer.

Well, the issue with that sort of thing is,

each 1 of those data engineers is going to do their own thing. They're going to use different technologies. They're going to use different infrastructure.

They may not even use the same cluster, for example. So you never get any kind

of economy of scale,

either operationally, perhaps even cost wise, perhaps usage wise,

you just never get that.

But I would say that the worst part is that you never get any best practices. You never get homogeneous best practices.

That means that just to give a a trivial example,

they're using spaces versus tabs. Obviously, a trivial thing, but that just kind of gives you an example of

their code would be different unless there was an overall coding standard, for example.

Now as we look at at the rest of the organization

of what could happen,

and I've seen this firsthand, is that

on teams a and b, the data engineers meet the definition of a data engineer, where they're a software engineer, and they've done that right.

However, team

c didn't understand

that there was a difference in data engineer.

If the listeners don't know, there's 2 accepted definitions of data engineer. There's

1 accepted definition is

a more of a SQL focused person, more of a DBA. And the other accepted definition is

a software engineer who specialize in big data.

For my book, I talk about that definition, the software engineer with big data. However, if maybe team c Googled around and said, oh, they have a data engineer. What is that data engineer? We hire them.

Well, they may get that SQL focus person.

And so that SQL focus person

won't be able to do the same level of data engineering.

They may either not be able to do the same level. They may not be able to

handle the same level of complexity.

Or worse yet, they may create a mountain of technical debt for you.

And that makes it so that team c is now this backwater bastion of

technical debt, whereas teams a and b are creating

some level of value.

This is coming back to that centralization.

Well, if we had a centralized data engineering team, we we would have had similar

or sane hiring practices.

And there would have been that team that would have said, no. This person doesn't meet our definition. This person

can't handle the the scale or doesn't know how to program, for example,

the various reasons.

If we don't have that team, that centralized team, we can't set up the best practices,

not just for coding software engineering style things,

but for

infrastructure usage and for

even just definitions of teams and who that person is.

So that brings the question of when should you or how should you

do that? What I found often is the reason that the business side wants

a data engineer

on their team is that your data engineering team isn't giving enough love

to either their organization

or

enough

attention,

what have you. So that could be 1 manifestation of the problem.

And this is a

consistent issue I've as I've talked to the business leaders who've handled hired these people. Why did you hire them? It's because we couldn't get any time from the data engineering team. So look at that, that could be

a whole issue that management is creating for itself.

Then there is the issue where it's just a nature of this. Their domain is unique enough.

So what we can do is we can establish that core team. And that's usually what I recommend doing. We establish that core team, whether that's data engineering or data science.

And then what we can do is we can look at how do we start engaging with the rest of the organization.

I believe that if you engage with the rest of the organization, right,

that core team can

negate the need or that desire from the business to have their own data engineers.

That said, it may not always be possible. So what do we do then? Well, what we do is they could hire onto their own teams. However,

those team members have to actually be interviewed by the central data engineering team, for example.

Or another route that they could do is a kind of a tour of duty

where 1 of the central team is on a, let's say, 6 month stint

reassigned from that data engineering team, the central 1, to that business unit. So they do a tour of duty of 6 months, for example.

Other possibilities that I've seen be successful are dotted line,

where,

for example, our data scientist is still a member of that data engineering team or data science team. However, their dotted line assignment is to that business unit.

So they're still a data engineer still within that team, perhaps even still even sitting together. But their day to day assignment,

their scrum, as it were, would be part of that other team.

What I think is key as you start to look at these is there's a point where the cat gets out of the bag, and there's no going back. It's a Pandora's box.

As other business units start to hire those data engineers, those data scientists,

if those data scientists

aren't

either part of a centralized team or part of a centralized group,

then they'll never

really

attain those best practices

infrastructure wise, code wise, team wise, organization wise.

And the

people that are in those satellite teams will always be this backwater,

where they'll never

be able to get to that same level of value that the other teams do.

And there are a couple of really interesting things to pull out there. But before we go down the road of the question of hiring and evaluation for these types of roles, I'm interested in digging into

the

metrics

and measurement of these types of teams and particularly

the sort of return on investment and determining how much value they're producing versus the amount of cost for the infrastructure,

for the salaries, for training,

and how you have seen companies

determine how to budget for and allocate spend

and time and focus for these teams that are focused on delivering data products and identifying what these data products should even be in the first place?

The way I recommend doing that is

really involving the business.

And throughout this podcast, I haven't really talked about how do we involve the business.

Basically, in the book, after I introduced the teams, it's all about

interactions and business after that.

Because

the things that we create as for data products

have to

be business worthy. They have to be valuable for the business.

As we evaluate that ROI,

we look at that and we say,

is this what the business wanted?

And more importantly, as you're, for example, starting a data team,

you should be looking at and have included the business in that conversation.

But you should be asking questions like, what are your can'ts?

And more importantly,

what is the value of those can'ts?

Put a different way is, if I could take your can't and make it a can,

what would that do? Would that save a €1, 000, 000 or a $1, 000, 000? Would that make $10, 000, 000 or €10, 000, 000,

for example?

By us, clearly establishing the business value of moving those from

from can't to can, then we can start to look at, okay,

our data team

could save the company

€50, 000, 000,

for example.

From there, we can start to look at and make a better business case for

the

training costs and infrastructure for our data teams,

where this is a consistent issue that we as technologists don't do.

We think in terms of technologies.

For example, if for the majority of data teams, they might say, well, we're using Spark. And if we just stand up Spark, and if we just do Kafka, and we do this with Kafka,

well, Kafka is great and all, but it's not going to make your s 1 report for your company. They don't care about

that. They care about

how much money you made and how much money you saved and how much time you saved, those sorts of things. So as we talk about

the value created,

we're able to point back to, we talked to the business. The business said this.

Now if we can just do whatever that is, we can achieve this level of value.

And then that conversation becomes much clearer and much cleaner of, oh, we're actually leaving

€20, 000, 000 on the table.

Those of you who are in management or even those aspiring to management or biz or engineers, this is how you talk to business people.

Talk to them about how much they're either going to make or going to lose.

And then checkbooks start opening.

Telling them about Spark, checkbooks don't start opening for that.

So as we talk about the amounts

and the budgets for this,

then we get a much clearer path to budgets

and money.

1 thing I will say is

sometimes teams will get a

large budget.

And what they do is instead of

scoping that out and scaling that out over time, what they'll do is they'll hire 10 people.

And those 10 people will be idle initially.

Those could be a mix of data scientists and data engineers. So what you wanna do is, as you just start hiring out or hiring people, is that you start with gradually rolling out, gradually increase your size of your teams.

From there, you can start to look at the value created, the actual amounts of how much do we need.

1 common issue with hiring is

thinking about how much people should be paid.

And

this is where, for those of you who are in larger organizations or perhaps even cheap organizations, let's be honest,

this is gonna be an issue

where

data engineers

aren't cheap.

And as as we just talked about, there's a high demand. So we have a high demand, low supply.

And in those sorts of economics,

we have salaries go up for data engineers, because there's a low supply of them and companies want them.

You realize this,

you as the manager, you as the team lead, and then you go to HR who says,

I looked up the salary for data engineer, and it's 40, 000

US, let's say, or 50, 000 US.

Then you have to explain to them, no. You looked up the salary of the data engineer who's SQL focused. This isn't the person we're looking for.

We're looking for somebody who is a

software engineer, specialized in big data.

That sub, sub, sub specialization

is really what gets them

the salaries up much higher,

where we're looking at salaries that are above a

average software engineer.

This talk about

salaries,

this is

where back to that original question of how do we get people in.

Well, these data engineers aren't

money hungry.

But what they are is realists, and they're seeing somebody gave me an offer for, let's say, a $100, 000, €100, 000,

and here you are trying to offer me 40, 000,

there's no way in hell that they're going to

accept that.

So this is where

managers need to work with HR.

They may need to work on their pay bands. They may even need to establish the title of data engineer.

I had companies

where

they don't have the title of data scientist, though that's becoming more often, or they won't have the title of data engineer.

And so their HR will try to do pay bands based on that. They'll try to do

a software engineer's pay band, which isn't high enough either. Or for the data scientist,

they're not money hungry either, but they're still looking at

significant amounts of specialization.

And the companies will try to

give them a data analyst salary. That's not going to work either.

This is definitely some legwork that I'd recommend, management do ahead of time.

Today's episode of the data engineering podcast is sponsored by Datadog,

a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and more.

Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/datadog

today to start your free 14 day trial.

And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

And another interesting aspect of this hiring question

is the ordering in which to bring on these new capabilities

where do you hire

all of the elements of the data team at the same time where you bring on a data scientist and a data engineer and an operations engineer?

Or do you start with the data engineer to build out the capabilities

for running data pipelines, having the data quality so that when you do bring on a data scientist, they have something to work with? Or is it better to have them be part of that conversation from the beginning so that there is that interplay of this is what I'm looking for, and this is what I can give you, and then we

converge on the optimal workflow for

data engineering through to data science.

I definitely recommend

a data engineer first.

But definitely the most popular or the 1 that I've seen in the wild most has been data scientists hired first. So let me talk through what happens there.

Usually, a data science being hired first is based on that management misunderstanding of what a data scientist is. They're thinking that data scientists can do it all, that they're

that person who can program as well as they can do the

modeling side too, the advanced analytics side. And, unfortunately, that isn't the case.

So what you have is you'll have this

mismatch of requirements.

And the company is saying, here, we don't have any data. Go make some data products. Go do some data science y stuff and

question mark, question mark, question mark, profit.

And what usually happens there is data scientist sits idle for

3 to 6 months.

Managements gets mad and says, why aren't you doing data science y things? And and the data scientist says, where's the data products at? I'm here to do data science, not create a bunch of infrastructure.

So usually, what happens in those sorts of scenarios and situations,

you have about 3 to 6 months before your data scientist will start to look for a new job and leave.

And you're faced with the you as the manager are faced with the prospect of trying to hire another data scientist after you

got virtually 0 ROI out of your first 1.

So that is the primary reason why I recommend doing a data engineer first. So your data engineer first would be there to

start looking at the

data products, the can'ts that are available,

trying to evaluate what sorts of data products should be created.

And then from there, they will start to do that, start to

create those data products.

And then you start hiring your data scientists in. And this kind of goes back to 1 of the questions you were talking about of how do we hire

data scientists, for example.

During your interview for the data scientists, you actually tell them, hey, we've already created the data products for you. Here's the technologies.

And then they say, oh, well, I get to go do that cool data science stuff that I've always wanted to do instead of, hey. You're going to be hired, and

you'll have to do a bunch of your own data work. So the data engineer has done all that. From there, we can hire that data scientist.

From there, as we start to operationalize

more and more, we may do an operations person early on. There's various ways to do that.

Probably the inverse of your question,

Tobias,

is how would we go about dealing with let's say we have a data science team and we don't have any data engineers.

What we'd want to do there is we'd want to get our ratios back in in order. The ratio of data scientist to data engineer, it's usually it's a 1 to 2 to 5. So there should be 2 to 5 data engineers per data scientist.

It's a lot around

trying to get your ratio back and and

hiring the right data engineers,

getting a good strong lead on the team, a project veteran is what I call them,

and then starting to really build out that data engineering team and

figuring out a way to pay down or to deal with the technical debt usually created by the data science team. That was another question I was going to ask in there is the proper ratios of data engineer versus data scientist versus operations engineer because it's not necessarily obvious where some people might think, oh, well, as long as I have a data engineer, then I can hire on 3 data scientists, and then they could do all kinds of more magic and, you know, it'll it'll be puppies and rainbows all day. Whereas, you know, then you have 1 data engineer who's overtasked and not able to deliver, and so they might get frustrated and go find another job where, you know, it's the opposite situation of the data scientist who's brought on with no data engineer on staff.

It's definitely that. You'll get frustration on both sides. And

this is 1 of my advice to you as management to be looking then. If you're a team leader, manager on the team, you should be looking for signs of frustration

and signs of frustration that

it's okay for team members to be frustrated.

What it's not okay is for the team to

be perceived that frustration is never going to go away, that that bad juju, the lack of ratios, the poor ratios, it's never going to change. That's when people start

looking for new jobs. They start quitting.

And 1 thing to know about when you lose a data scientist or a data engineer,

there's more tribal knowledge

that

is held in data scientists and data engineers than other roles, in my opinion.

And as that person quits,

you lose that tribal knowledge of the person.

It's not like you can say, hey. Spend the next month before you leave writing down all of your

all of your thoughts on this and all of your tribal knowledge,

there's just in some inherent tribal knowledge that you'll lose.

And that loss will be difficult to regain unless there was a team unless the entire team was in on, for example, that design or that coding.

So do be careful.

Losing team members is going to be costly,

not just for that tribal knowledge, but the actual effort of going through and finding the person.

The experience I've had in Europe, US, and in Mexico on hiring people,

data engineers, data scientists, you're looking at a good 3 to 6 month lead time

of

starting to advertise that job to someone sitting in a chair with that title. So really think about this as you're hiring to make sure that they're in the right place.

Another interesting aspect of the role of data teams within an organization

is how they might interact with the other engineering groups in the business where if you have a set of software engineers who are building applications

and software products, what do you see as being some of the

useful relationships

between those software teams and the data teams?

It was an interesting question. 1 thing I haven't talked about from the book is the

case studies I did. I didn't just want people to be reading a bunch of my experience and my thoughts and ideas.

I wanted other people's thoughts and ideas to be part of it. So 1 interview I did was with Dmitry Raya Boy, who started Twitter's data teams.

And we talked a lot about this because it was an issue early on at Twitter of there was the data teams and then there was the software engineers.

And the software engineers

wouldn't be on the same page.

Either they wouldn't be on the same page in terms of product changes.

So what could happen there, at least the stories that Dimitri told me about was

Twitter would change an API,

or the way data was laid out. And then the data teams wouldn't know, and they wouldn't know until something broke.

So what they did is they had to

get the data engineers had to have a seat at the table

during those design discussions to say, yes. I know ahead of time when something is changing.

In a very similar sense,

operations,

they would want to know when there is an operational change, when is software deployed,

when are we EOL ing this or that.

Dmitry went as far as to say that

data teams,

data engineers should be part of the software engineering

organization.

Otherwise, they may not be perceived as real software engineers.

And this can be a definite perception

where the software engineers, when they hear data engineer, they think of that SQL focused person. So they think of the data engineers don't know how to code their way out of a paper bag.

Well, that could have been true for

other data engineers that they have dealt with over their careers,

But the data engineers need to actually show, no. We're not just

good software engineers.

Distributed systems are even more difficult,

in my opinion, than software engineering

or than small data engineering.

So it's really important that they have that similar symbiotic relationship

to the rest of the organization.

We do want our data engineers to be able to

know when things are changing, know what is changing.

And I would say,

1 of the differentiations

that I've seen just really apparent

as the difference between a software engineer and a data engineer.

It's not just that the data engineer knows the distributed system side. It's also a love of data, or it's a

significantly

different appreciation of data than a software engineer has.

And I say this having been a software engineer for a while. The software engineers

think of data as

I put data into the database and that's my

interaction with data.

When we get into data engineering, we actually think about data as a life cycle. We have to think about how data is dealt with over time.

And so by having a seat at the table,

the data engineers will be able to say, hey. That change you're making,

it's breaking the data model. We're exposing this data model. Could we do it this way? Could we do it that way? Or could we not do this at all? For example,

The software engineers may not be thinking about it from that data perspective.

And by having that seat at the table,

the data engineer, perhaps an architect, can actually raise their hand and say, hey, this is different. We wanna do something different.

And so going back to the hiring question and onboarding new people, what have you found to be useful ways

of evaluating their talent and their capabilities,

recognizing that

software engineering and just engineering interviews in general are typically

poorly structured and not very well thought out or not very consistent. And so I'm wondering if you can just talk through your thoughts on how that manifests for data roles in particular and ways to

avoid the pain on both sides that exist as a result of these practices?

It is a difficult thing. In the book,

I didn't specifically address how do you do interviews and hiring. And so what I did for that instead is on my YouTube channel, I did a 10 minute video talking about

interviews.

And in that video, I share,

this is what you have to do. For example, for data engineers,

you basically have to give them a software engineering interview

and

a data engineering slash big data interview as well. It's a pretty rigorous thing that has to happen. At my clients where we've done this,

usually, there's a

4 to 5 rounds of interview

ranging from,

let's do some coding, some software coding,

your regular software engineering interview,

and then

starting a separate interview of, now let's talk about distributed systems.

A few other difficult parts I see with specifically, with this data engineering review is

what are the key parts of software engineering that a data engineer should know?

So at 1 client, they were giving them the standard

software engineering

interview that they gave everybody else. In that particular case, it was very web dev heavy.

And to be honest, some of the questions were just completely not relevant

or were in a different language. A lot of their web dev was in Ruby.

And you're just not going to throw a bunch of Ruby questions at a data engineer, for example.

Or how do you do x with something on the web,

it just isn't relevant.

What we did is we took the

points or the parts of these types of questions

that were data engineer relevant

and made that be part of the software engineering side of that. These were questions such as

not so much hardcore

data structures,

but it was kind of touching on data structures,

touching on what do you know. And then for the distributed systems part, it's asking them about the technologies that they'd be expected to know. Do they understand the scale?

Scale questions.

Do you understand what SKU is, for example. Those sorts of questions are what distinguishes

The separating out to the operation side,

operations people, they'll need to know

the usual operations. And in that sense, you'll have to ask them

the usual Linux questions.

Do you know Linux hardware? For how do you do

troubleshooting of Linux hardware, for example? Those sorts of questions added on to that. Do you know how to operate the framework that we're going to do? Let's say you're doing Spark. Let's say you're doing Kafka.

Have you actually done that? And then there's the whole other aspect of

scale to that question.

Can you operate that at the scale that we need? For data scientists, you all have a mix of

math slash

modeling, as well as programming.

So you aren't going to be throwing a full software engineering programming

interview at data scientists and expect them to pass.

The general metric I usually say is they should be an intermediate level

programmer.

And I say programmer, not software engineer.

That means that the data scientists will understand syntax.

But if you're going to ask them about the finer points of

computer science, etcetera,

that just isn't

either their forte or frankly necessary for what they're going to be doing.

Then we have to have the other part of that interview being the statistics, the modeling, choosing the right models,

those sorts of questions.

In this sense, we have an expanded interview cycle.

Then some companies have the whole culture fit sort of interviews.

It is a significant amount of interviewing to be done.

And what are the challenges in particular for hiring for these types of roles, especially if you're starting a brand new team, is having the appropriate amount of expertise

already existing in the company to be able to determine

if the responses to some of these questions, particularly on the subject of things like data modeling and understanding, you know, how do you evaluate the area under the curve and how applicable that is to this particular problem domain? Or for a data engineer,

you know, how do you

structure

the execution pattern for

the DAG within a spark cluster

and then understanding whether or not the answer they gave you makes sense. I'm wondering what you've found to be the necessary level of

understanding and capacity for hiring managers or for team leads who are trying to

run these types of teams?

It's frankly difficult if they don't have this experience already.

And I've seen this firsthand as well, where they've hired the wrong person simply because the person that they talk to could

talk the talk as it were,

where they kind of knew the questions that somebody would Google, and they memorized those answers.

And the person didn't know how to push deeper, and, therefore, they couldn't get a specific,

you know,

pass fail. They didn't know the follow-up questions to ask.

That can definitely be an issue.

What I've seen people do is and this is something I've done for my clients as well.

So my clients, they're starting out with our data teams and on their big data journey. They don't know the questions to ask. They don't know the right things to do. And so what we do is we will do that final interview, or we're part of that interview cycle

where I'm there to ask them the big data questions I

can't remember the person off the top of my head.

They

I can't remember the person off the top of my head. They were in some kind of consulting firm too.

They weren't doing mentoring like my company, but they were doing more consulting style.

They offered to do those sorts of interviews.

The other thing that you can do is you can kind of draft off somebody else's

choices.

So that means that let's say there's a company that you know has a good reputation,

not just for

people, but for engineering quality.

So what you could do is you could try to find somebody who has that pedigree.

And let's say they were a data engineer at company x. Company x is known for turning out really good data engineers, having really good data engineering.

Maybe what you do is you do some culture fits,

you do some smoke

questions,

you try to make sure that they aren't bluffing.

And then you basically take it on faith that they've worked at that company,

know what they're doing, and will join your company and start that out.

1 thing to note that's kind of related to this and probably to previous questions is those initial engineering hires

will set you up for success or failure in other ways.

That means that

if you hire that first person,

and that first person is really good,

then

during your interview cycle, people will realize, oh, wow, this person is really good, I can actually learn something from them. And I'll enjoy being around them. I will think very highly of working with that company.

Conversely,

if you're sitting across the table from that person and thinking,

oh, my god, this person does not know what they're doing. And you're thinking,

okay. How much of their stuff am I going to have to clean up? What kind of mess have they created?

You're thinking all kinds of negative thoughts.

I've seen this firsthand as well.

Hey. This will actually turn people away.

The sort of people that would turn you around

are going to be turned off and turned away

by seeing that and saying, I do not want to be part of your turnaround effort.

So for managers who are responsible for data teams, what are some of the types of anti patterns that you have seen that might have led a team that could otherwise have succeeded down the path of failure?

I talk about this in the book. There's kind of a chapter that I'm really proud of called diagnosing and fixing teams.

It's sort of a manual for

how do you go through and look at a team and figure out why they're failing.

And some of those failures are individuals. Some of those things are something in the management team did that.

There's 2 ones that come to mind. And the first 1 is they're looking for a silver bullet.

The other 1 is they're searching for the holy grail.

Silver bullet means

1 of the executives, 1 of the management team has went to some conference,

heard somebody talk, read a vendor white paper, and said,

oh, wow. This big data thing, this data thing, it's gonna save everybody. It's gonna save the company.

Let's do this. And so they, a hair on fire, haul off, and start trying to create that data team,

Except they don't go through the right process. They don't go through the right steps. They just kind

of model their way through it as best they can.

That doesn't work very well.

The other 1 is that holy grail. And I would say that the holy grail is 1 of the worst ones. Seeing that firsthand as well.

What holy grail means is somebody went to a conference and they saw

some big respectable

the kind of tech company everybody wants to be like,

Apple,

Uber, those sorts of companies.

Nothing wrong with those companies, and they do awesome stuff.

What happens is that management will sit in those audiences and and or those read those blog pay posts and say, that is exactly what we were going to do. Forward that on and say, I want this.

And what they don't really realize is

that blog post

or that conference talk

misses out on a few things that are actually really relevant.

1 is how long it took them to get there.

Speaking of Uber again, I think Uber is probably given the only blog post I've ever seen where it talked about a progression

of how they actually got to their current

data infrastructure and data pipelines. They talked about it, how it was I think it was over the course of 10 or 15 years.

And it was the first time I'd ever seen anybody really say,

it took us 15 years or 10 years to get to this point.

Usually, what a manager, especially a layperson manager, is thinking is, oh, they did this over the course of 6 months to a year

rather than, no, this actually took them a significant amount of time, and it took the team getting a lot of velocity behind themselves to actually do this and to do it right.

The other thing I think they miss out on or

is often not talked about is how much help they had upfront

and

what the starting point of the team was.

Sometimes the team that they will have, let's say, in a Bay Area, San Francisco Bay Area company,

they may have been doing distributed systems and big data for the past 10 years and then joined that company.

So they're leveraging 10 years of experience.

Let's say the rest of the world, rest of the country, they may be leveraging 1 year experience, 2 years, maybe even 0 years of experience.

That's a huge amount of difference in experience.

Or what they may not talk about is, no, there was actually a

management

or a consulting team that helped them out significantly,

where that consulting team was much more in the background, but there's the exec up there on the stage

taking the credit for it, which is completely fine.

But what the manager there in the audience thinking that they want the holy grail doesn't realize is, yeah, that company also spent 1, 000, 000 of dollars or 1, 000, 000 of euros

on consultants

to help them implement that.

And that's okay.

But what it isn't really known by that manager,

there was a small army of consultants working on that project.

In your experience

of working with all these organizations

and extracting these lessons and doing your case studies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

How I came away from those interviews

thinking about DataOps.

When I first was writing the book and doing the research for the book,

I wasn't sure what DataOps would be and how important DataOps would be. I definitely came away from doing those

case studies thinking,

once you get to that advanced state and your state is you're at a limit of friction,

DataOps could be the thing that changes your business significantly.

That was definitely a very interesting realization, honestly.

Other ones that I was surprised about from the book,

how many companies don't understand how you have to work with the business?

So you and I have done, or I don't know if you have done agile.

And 1 of the basic precepts of agile is that you work with the business. You work with whoever

your product owner is, and you work with them consistently.

And then as I started to work with teams and started to do that, I saw

how often it was that

software engineers didn't actually work with the product owners and weren't getting things in front of the product owners to show them, hey. This is what I'm doing. This is what the product looks like. This is what the data product looks like. This is what the dashboard looks like.

And it really underscored that

realization that I had

much before I wrote the book. And it's basically the reason why I wrote the book. And that is

early success and failure

is not the result of technology. It is the result of management.

And what that means is

if you're going to be successful early on,

that is actually get something into production,

it wasn't that Spark worked well. It wasn't that Kafka worked well. It was that the management found the right people,

organized them the right way, worked with the business the right way.

And that was really what success was.

And I had that realization while I was working at Cloudera

of, yeah, Cloudera was a vendor. So my perception was

as long as you use Tadoop the right way, you'd be successful.

And as I went into companies and started working with companies,

seeing that,

hey, this wasn't a technical problem. Even though the company may have said it was a technical problem, it was that Hadoop didn't work or it was that Hadoop was too difficult.

Usually, that was

just plain an issue of

the team wasn't set up right initially.

That's a big thing for I'm hoping the people who are listening to internalize.

It is

management that sets you up for success initially.

And are there any other aspects of the formation and management of data teams or the

interactions

of the members within those teams or just the overall space of building data products and building that capacity within an organization or other resources or advice that you have that we didn't discuss that you'd like to cover before we close out the show?

Yes. There is 1 that I didn't talk about, and that is the symbiotic relationship

between all 3 teams

or what I've called high bandwidth connections in previous books.

This means that

when you have these 3 teams,

if they aren't

working well together,

then that's another example of friction.

There's friction that's

data teams to the rest of the organization. There's that level of friction.

And then there's friction within

between the teams themselves, where

the data scientists don't work with well with the data engineers.

And what I found super interesting in those teams is that

neither are leveraging each other. The most common manifestation I've seen is

the data science team does not believe in or

trust the data products coming out of the data engineering team.

So they create their own data products and their own data infrastructure.

And so you have a

complete lack of leveraging anything

or use of anything.

And you have a complete duplication of efforts, quite frankly.

And so as this happens, the teams just really go nowhere.

And it all starts with

either a political issue or it starts with a lack of teams or understanding issue.

It's a big problem.

What I would look at if I were a manager of a team or an organization with existing data teams is,

do they have high bandwidth connections? Do they have a symbiotic relationship? Or do they have an adversarial relationship?

They have an adversarial relationship.

You need to start looking at why.

Why do the teams not like each other? Do the teams,

whatever,

fill in the blank? It could be something as simple as they've just never had to work together before.

And it's up to the management, in my opinion, to fix this. It's not going to be the individual contributors that say, we should all hold hands and sing kumbaya.

It should be the manager saying, no. This isn't right. This isn't the way that we should be working together. We should be duplicating our efforts. We should be having a lot less friction amongst ourselves.

And

how would a manager do that? It's taking some real honest looks at what is happening.

Do the data teams just completely blow off the data scientists? Did the data scientists completely blow off the data engineers?

Did you have, let's say, on the team, are the data engineers actually meeting the definition of data engineers? I've seen that before where the data scientists say, no. You guys actually can't create data products.

You're SQL focused data engineers,

and we really can't do anything with you.

It takes more of like a 360 degree view of yourselves,

some real honesty and some real deep introspection

to figure out what's happened

and how we need to fix it.

For anybody who wants to get in touch with you or follow along or dig deeper on these topics, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

It's actually a really

more holistic thing, I believe,

that nobody's ever taken a whack at the whole ecosystem.

Eric Sammer, if you know him, he and I used to work together at Cloudera. He's now at Splunk, and he's been tweeting about this, and I've been thinking about this.

Part of our issue in data

is that we're trying to bring together several different

technologies

that were either

loosely thought about or never really even thought about of how they integrate with the rest of the the ecosystem.

So let's say,

well, technology x was built to do this, but they never thought about integrating with technology y. So that forces the data engineers or perhaps even a separate third party product to bring that integration together.

And so we spend a ton of our time trying to wire

technologies

together that,

well, if we were to wave our magic wand, it would actually be better if they would have all been integrated better together.

And that was my hope for what we were doing at Cloudera. It didn't quite pan out.

So then I started thinking, well, maybe the cloud providers. It's almost as if the cloud providers may have the best odds of doing this.

Somebody tweeted out. I can't remember who. They were saying, imagine what could happen is that

if somebody were to just start acquiring some of these companies together, Snowflake,

and bring Snowflake together, bring, let's say, a Kafka company, Confluent or

database

plus

PubSub plus processing engine.

If somebody were to buy all 3 of those and really just put all of their time into making integration just so dead simple,

It won't make the programming side dead simple. It would make the integration side dead simple.

I think it would change the landscape of what we'd have to do dramatically,

and it would make it our job so much easier.

But, alas, it hasn't happened yet. And we can cross our fingers that maybe 1 of maybe Amazon will start doing this better

or Microsoft or GCP maybe.

Well, thank you very much for taking the time today to join me and discuss the work that you've done on writing the book about how to manage data teams effectively, but also all of the work that you've done with these organizations to understand the space. It's definitely a very interesting and necessary topic. So I appreciate the time you've put into that, and I hope you enjoy the rest of your day.

Thank you. I appreciate you having me. And everyone who's listening, I appreciate you listening and taking the time

to think about this and see how we can improve things in the data engineering field.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links