Summary
Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- Your host is Tobias Macey and today I’m interviewing Jesse Anderson about best practices for organizing and managing data teams
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of how you view the mission and responsibilities of a data team?
- What are the critical elements of a successful data team?
- Beyond the core pillars of data science, data engineering, and operations, what other specialized roles do you find helpful for larger or more sophisticated teams?
- For organizations that have "small data", how does that change the necessary composition of roles for successful data projects?
- What are the signs and symptoms that point to the need for a dedicated team that focuses on data?
- With data scientists and data engineers in particular being in such high demand, what are strategies that you have found effective for attracting new talent?
- In the case where you have engineers on staff, how do you identify internal talent that can be trained into these specialized roles?
- Another challenge that organizations face in dealing with data is how the team is organized. What are your thoughts on effective strategies for how to structure the communication and reporting structures of data teams? (e.g. centralized, embedded, etc.)
- How do you recommend evaluating potential candidates for each of the necessary roles?
- What are your thoughts on when to hire an outside consultant, vs building internal capacity?
- For managers who are responsible for data teams, how much understanding of data and analytics do they need to be effective?
- How do you define success or measure performance of a team focused on working with data?
- What are some of the anti-patterns that you have seen in managers who oversee data professionals?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of helping organizations and individuals achieve success in data and analytics?
- What advice or additional resources do you have for anyone who is interested in learning more about how to build and grow a successful data team?
Contact Info
- Website
- @jessetanderson on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Data Teams Book
- DBA == Database Administrator
- ML Engineer
- DataOps
- Three Vs
- The Ultimate Guide To Switching Careers To Big Data
- S-1 Report
- Jesse Anderson’s Youtube Channel
- Uber Data Infrastructure Progression Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:18] Unknown:
When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
That's immu t a. Your host is Tobias Macey. And today, I'm interviewing Jesse Anderson about best practices for organizing and managing data teams. So Jesse, can you start by introducing yourself?
[00:01:55] Unknown:
Hi. My name is Jesse Anderson. As you mentioned, I have spent the past many years, not just on the technical side of things, but the management side of data teams. And that's probably a large amount of what we'll be talking about today.
[00:02:09] Unknown:
Do you remember how you first got involved in working in the area of data?
[00:02:13] Unknown:
I've always had an interest in data. I've always been curious, a curious person. And I think curious people like to read, and they like to have their decisions based on data or actually see what's happening. So throughout my career, I've always been curious about, okay, what is that number? What does that count? What is really happening? Was what's happening really backed up by maybe the CEO saying 1 thing and the the numbers and data saying another? And I've always been curious about that.
[00:02:42] Unknown:
1 of the things that you've done most recently is publish a book entirely focused on the aspect of building effective data teams with a particular focus on big data. And I'm wondering if you can just start by giving an overview of how you view the mission and responsibilities of that data team.
[00:02:59] Unknown:
Sure. I believe that data team is responsible. Just to give a 1 sentence definition, data teams are responsible for creating data products. And then expanding out from that, there's the different teams. So the 3 teams are data science, data engineering, and operations. And as part of that mission of creating data products, each 1 of those teams does something specific. Data engineering, for example, they're creating a data product that's usable, usable by the rest of the organization, that's scalable, using the right infrastructure. The operations team is giving the operational excellence that's necessary.
If we're creating data products and they aren't up and running and usable, then what good are the data products? And finally, the data science team, I see them as kind of the cherry on top. They're creating the advanced analytics that makes the highest amount of value for us possible. 1 thing I like to tell people is that none of these teams is the most important team. They're all important to that value creation cycle. So having 1 of them missing actually creates various problems within creating data products. So we really have to focus, and we really have to say, I need all 3 to create the highest level of value for ourselves.
[00:04:22] Unknown:
Within the context of data teams, there are those 3 core pillars that you mentioned, but also in reviewing your book a bit, you also called out the fact that there are cases where you might need other specialized roles depending on the particular focus of the data product or the level of sophistication of the organization. And I'm wondering if you can just give some color as to what format those other roles might take and where they might live within that constellation of data science, data engineering, and operations?
[00:04:54] Unknown:
As far as the operations team, usually, that team is mostly made up, if not entirely made up of operations engineers. Whatever your organization calls them, for example, whether that's SRE operations, the operations team is generally holistic of that. Then on the data engineering team, that's where things get more cross functional as it were depending on the organization. So on data engineering, part of your data product creation may not just be here's a file in s 3, for example. It could for for some organizations, the purview of the data engineering team extends to the visualization of that data, the exposing of that data.
And as a direct result, we may need other teams or functions or titles represented on there. So in the data engineering team, that's primarily going to be made up of data engineers. And my definition of data engineer is a software engineer, once again, a software engineer who has specialized their skills in big data. So that data engineering team will be mostly made up of those people with that software engineering background and understanding with the understanding of the big data tools. However, for some operations, for some parts of that, we're actually going to need other specialties. We may have to do a visualization, a real time dashboard, for example.
We may have to expose that data product to the rest of the organization. So in those senses, we may actually have front end engineers on our data engineering team. And with the companies I've mentored and consulted with, we've actually done that. We've either hired or put a front end engineer on the data engineering team because that was a key part of what that data engineering team was doing. Yes. They were creating the data products, but they were also responsible for creating the graphical representation or the representation to the business users of that data.
Data engineers, I don't expect them to be great UI people. Back the vast majority of software engineers aren't great user interface people. So, yes, having a front end engineer on the team will be quite helpful. Other people that we may have on that team might be a DBA. And this is an issue that you'll have to keep an eye on from a management point of view. DBAs will help out the data engineering team on things such as schema, making sure that we're laying out our schema properly, that we're about evolving that. And they may help out on some of the more SQLable parts of the problems.
However, having a team, a data engineering team made up of entirely DBAs is a problem unto itself. Talk more about that in the book if you wanna read it more about it. Then we have other people who are on the data science team. So on the data science team for some organizations, that front end engineer that I mentioned, that front end engineer may be on the data science team, depending on how they're trying to represent that data, who's really responsible for representing that data to the business as it were. However, on the data science team, that's mostly going to be a data scientist on that team.
There are 2 other kind of newer manifestations of positions, and 1 of those is machine learning engineer. So machine learning engineer, I like to think of it or to tell people is this is what people originally thought data scientists were. Put a different way, data scientists are often in my definition is that they have taken their mathematical or statistical background and learned how to program. Now the issue with most people's perception of what a data scientist is is they thought that they're a master of programming or software engineering as well as this master of mathematical and statistical side.
And the vast majority of people who have the title of data scientist are strong on the math side and are comparatively weak on the software engineering side. That means that the systems that data scientists often create have large amounts of technical debt due to missing software engineering backgrounds. So that brings an issue of who is responsible for creating really good systems for machine learning, not just for the models, but who's the 1 who is supposed to go through and stand this up, make sure it runs right. And that brings up a machine learning engineer. And that machine learning engineer is what I see is that in between for when companies really see the difference or the lack of technical skill on them on the data scientist side.
And then the data engineers with that technical skill, but the inability to either understand models or deal with models, then you may need that machine learning engineer. Coming back to that definition or what people originally thought, this is what most business leaders originally thought that they had with their teams, with their data scientists. They thought, okay, they can create all of the software engineering side. Well, the reality is that you still need data engineers and you still need machine learning engineers. Then there's a second, and that's much more of a team structure than it is a person per se. And that's DataOps.
And DataOps is there to deal with some of the friction that we're seeing in organizations. It may be published by the time this podcast comes out, but I did a survey for data teams. And I asked questions about where do people have the most problems. And by far, the biggest problem was around friction. They said that their difficulty was around friction. So how do we deal with friction organizationally? And, usually, that was trying to take the resources of 2 different teams, usually data science and data engineering, and say, how do we coordinate that? How do we say how do we give some kind of fairness algorithm of data engineering resources to data science resources?
And there may not be a way to do a true fairness algorithm or organizational structure. What we may need to do is we need to may need to break them up and make them cross functional teams. And that's what DataOps is. And as talked about in the book, how do we make our teams break them up so that data engineers and data scientists and potentially operations and product people are all on the same teams. And by putting them all on the same teams, it removes some of that contention and friction so that we are consistently working with a business. We do have all the resources on that team. And then I have in the book some interviews where people are practicing DataOps on a day to day basis and how DataOps actually helped them remove that friction.
[00:12:07] Unknown:
1 of the other things worth calling out is this overall concept of big data where a lot of people might have their own intuitive understanding of what that means, or they might have a particular definition that they go by. And 1 of the most common ones that I've seen is the idea of the 3 v's. And in the beginning of your book, you call out your own particular way of determining whether or not somebody has big data. Wondering if you can talk through what your definition is and how you arrived at that.
[00:12:36] Unknown:
My definition is can't. I've always had that definition. My definition of can't means that you, as a manager, you walk up to your, let's say, BI team or you walk up to 1 of your analysts and say, hey. Can you run a year over year calculation on gross sales? And they say, no. I can't do that. It's going to take too long. So what we have is an issue that is based on a technical limitation that can't. So the can't that the data analyst told you, it wasn't that I didn't have the skills to do that. It was that I was told by the data warehouse team or somebody, the operations team, that if I were to run that sort of query on the database, they would revoke my rights.
It would just take far too much resources on the database. It would bring the production database down, things like that. It's always based on a technical limitation. When I work with teams, this is the definition we look at. We look at can'ts. And for those of you who are listening, this is exactly what you should be seeking out. If you don't have a can't, you may not have big data limitations. However, if you're getting those can'ts, and this is usually what my clients are hitting. They're hitting a can't where they can't do the report or the analytic because it's going to take too long, or it's going to exhaust the memory of their in memory database, for example, problems like that.
The reason why I prefer this can't definition over the 3 v's is that it's easier for managers to understand, in my opinion. It's more of a manifestation. It's more of, here, I can point to this and say I have can'ts and I don't. The 3 v's sort of definition lent themselves to vendors saying, hey. My product is big data now. Well, you didn't change your product. Your product isn't big data. You just have better marketing saying it's big data. However, if you do have this can't and you have technologies that solve that can't, then you have both a big data problem and technologies that solve big data problems.
[00:14:44] Unknown:
For organizations that are maybe on the border of can't where they are able to fulfill the needs that they have, but maybe they're looking to be able to derive new types of products from the information that they're storing. How does that impact the overall composition of what they might need for the team structure to be able to realize those data products when they are working with quote unquote small data?
[00:15:09] Unknown:
I believe that there's a in between small and big data, there's a term I coined, medium data. And to be honest, I think that more organizations are in that medium data phase, where they're too big for small data, but big data sounds too big and too unwieldy. Or too big. They don't have petabytes of data. They only have terabytes, let's say. And so for those sorts of organizations, yeah, if you're hitting those can'ts or sometimes we look at it from more of a perspective of, so this year, you don't have can'ts, but next year, will you? And as we rewind that and we look at that we rewind that back and we say, okay. If next year, you are going to have cans, well, in order for us to hit that time period, we're going to have to start now. These projects aren't, hey. Let's bust this out in a month. These are 6 month projects. These are year long projects.
So if we do it right, we'll avoid those cancel altogether. So organizationally, what needs to happen in those, let's say, small, medium, or even big data teams, I talk a little bit about this in the book. For small data teams, the the same sorts of things apply. The same sorts of organizational structures apply. However, your complications will be vastly lower. Count your blessings, quite honestly. It's if you don't have to deal with these big data technologies as much as you have them on your show and you talk about them, they're far more plentiful on the small data side. They're usually far better engineered, easier, more people know them. There's a whole large amount of reasons why you'd want to avoid this.
But as you get into that medium and big data side, yes, it becomes even more important to have the right people. Usually, the biggest manifestation isn't so much on your data science side for going from small data to medium or big data. If the data engineers have done their job right, the manifestation to the data scientist should be not negligible, but manageable. The biggest manifestation will be on your ops and your data engineering team, where your ops team will have to learn these brand new technologies and how to operate them correctly. Your data engineers, if they're coming from either backgrounds where they didn't have distributed systems, they'll have to learn these new distributed systems.
And that can be cognitively very difficult for them. I say that having taught many software engineers these skills, not everybody is has the ability to do this, quite frankly. And it is difficult. It can take 3 to 6 months. So organizationally, what managers should be looking at as as they make that cutover is is the team ready? Where for some teams, they may have been able to scrape by at small data. So this is an honest look that that a manager should take at at the team of saying, are we just scraping by on our success with the small data stuff? If we go to big data, and it's added complexity, hey, maybe the team really can't make it.
Or maybe you're thinking, yeah, the team's really done really well with the small data side, they do have some background and distributed systems, You have much better odds that way, and they can make the jump. But, really, what I encourage is an honest look from management because it isn't fair, quite honestly. It isn't fair for either the manager, the company, or the team to put them in a situation where they're set up for failure.
[00:18:49] Unknown:
For those cases where you are trying to either establish a new team with new talent or identify people who are already in your organization who can be leveled up into these types of roles, particularly for that internal use case. What are some strategies that you have found to be effective for identifying what the capabilities are for the internal talent and identifying people who might be a good fit for moving into these types of roles?
[00:19:16] Unknown:
What we've done with previous clients is we've actually leveraged 1 of my books I wrote called The Ultimate Guide to Switching Careers to Big Data. And as opposed to this data teams that we've been talking about, this data teams is for management, but the ultimate guide or the switching careers book is much more focused on individual contributors. And I wrote that book to say, here is the lowdown on making that switch. This isn't me trying to sell you something. This is really me trying to say, I would rather you read a book and spend, let's say, 2 or 3 hours reading a book and decide, no, I don't want to do this, rather than either reading hypey or watching a bunch of YouTube videos saying how easy it is. I'd rather the them say, hey. This is what it is. This is how difficult it is. And so what we've done is we've leveraged that book within the team, within the companies I've mentored, And we've sent that out to all their software engineers, all their operations engineers to say, read this book, raise your hand if you want to be part of this, and we start founding the data engineering team from there. We get volunteers rather than conscripts.
And that isn't to say there's anything bad with conscription. It's more to say that I would rather have somebody volunteer for something knowing that it's going to be a difficult road ahead. And I will say that it is difficult. Even the people who raise their hand and say, I wanna do this, they may not all make it there. They may not all be ready for it, And that's okay. But at least they've been given a good go of it and that they've been given the opportunity to say yes or no. I have seen other times when companies or teams are sent on death marches. And usually these death marches are started for a few reasons.
1 is that they're started without any resources, that they're conscripts or they may have been volunteer. And they're saying, here's a cluster, knock yourselves out. And that really isn't fair to those teams either, that they should have been given some level of resources, some level of help, some level of training. Otherwise, they just won't be successful. So it's really key that and I say this when I work with when I mentor the teams is, hey, we're not going to just put you all in a team and then walk away. That's how you set people up for failure.
I'm going to be continuing to work with you. We'll make sure that we do the right architecture. I'm there for questions. 1 thing I will say there is as people get stuck as they first get started, it is really difficult. They may not even know what the right question is to ask. So US management, I if you're management or if you're team lead on this on these projects, I know this. Getting stuck is how I've seen some teams just go nowhere, that they get stuck, don't know the questions to ask, can't find the right resources to do this. And those sorts of teams, they don't write themselves.
They don't figure it out on their own. What usually happens is that they just spent months getting stuck and don't really go anywhere. So be really mindful of that. Keep an eye on the team. Keep an eye on the team's progress, their velocity, so that they're not getting stuck. But overall, we want to make sure that we have the right people. And related to the question of having volunteers, you should vet those volunteers as well. That's what we do. When I mentor a team, we make sure that the people are actually potentially viable members of a team. For example, if somebody is doing soft batch scripting and wants to be part of the data engineering team, we need to validate that they could learn Java, for example, that they can start dealing with distributed systems. Otherwise, it's a nonstarter for both sides.
[00:23:13] Unknown:
Beyond the internal use case where you decide that you need to bring on external talent, that can be challenging because of the fact that the current market for data scientists and data engineers in particular is very difficult where a lot of these types of roles require some level of seniority in software engineering before you make the transition, as you said, to being a big data engineer or being a data scientist or machine learning engineer. So I'm wondering what you have found to be useful strategies for attracting that type of talent given the tightness in the market right now.
[00:23:46] Unknown:
I think it's a mix of things that helps you do that. 1 is showing people that you are setting people up for our success, that you are resourcing the team, that you are using not just cutting edge technologies for the sake of doing cutting edge technologies, but using cutting edge technologies in the way that they want to be. I would say, 1st and foremost, smart people who are experienced, wanna be around other people who are smart, and they're going to gain experience from being on that team. 1 thing I would encourage managers to think about is this isn't a 1 way street. This isn't if we if we pull 1 over on these people, they'll join our team.
I've seen that happen. And what usually happens as you hornswoggle them, they'll just leave. They'll just quit after a few months. And then you're back to square 1 of trying to find a data engineer again. I think it's about upfront honesty. We could also talk about the alignment on core values. The companies that I've worked with, when we work together, we make sure that we align on core values for the company. And then as we look for people, for people to join the team from external, we make sure that they align on core values as well. And finding that alignment on core values will mean that you will find the right people that will stick around and do the right things.
So overall, you do need to make sure that you have a good fertile garden for people. They'll peer in during your interview process and see, yes, that is the place I want to
[00:25:28] Unknown:
work. And once you do have a team established and you have the necessary roles in place, how do you tend to recommend folks organize those individuals and teams in terms of the organizational aspects and the communication and reporting patterns that you found to be most effective where I've seen conversations debating the utility of having a centralized team or having embedded data scientists with the data engineers as and operations engineers as the platform team or various combinations of different reporting structures to figure out how that fits within the broader organization.
[00:26:11] Unknown:
It is an important thing to think about. And I've seen this happen several different ways and with several different levels of success. To kind of go back to that talk I was having about DataOps, The issue with DataOps is it's not an initial team structure that you might have, in my opinion. I think it's a later, more advanced team structure that you have. And the reason I point that out during this question is it goes to saying, do you create a group of people together initially and then separate them out? Or do you separate them out and hope for the best? In my experience, the companies that separate out their teams out initially never create a cohesive best practice usage.
Put a different way, team A has a data engineer, team B has a data engineer, team C has a data engineer. Well, the issue with that sort of thing is, each 1 of those data engineers is going to do their own thing. They're going to use different technologies. They're going to use different infrastructure. They may not even use the same cluster, for example. So you never get any kind of economy of scale, either operationally, perhaps even cost wise, perhaps usage wise, you just never get that. But I would say that the worst part is that you never get any best practices. You never get homogeneous best practices. That means that just to give a a trivial example, they're using spaces versus tabs. Obviously, a trivial thing, but that just kind of gives you an example of their code would be different unless there was an overall coding standard, for example.
Now as we look at at the rest of the organization of what could happen, and I've seen this firsthand, is that on teams a and b, the data engineers meet the definition of a data engineer, where they're a software engineer, and they've done that right. However, team c didn't understand that there was a difference in data engineer. If the listeners don't know, there's 2 accepted definitions of data engineer. There's 1 accepted definition is a more of a SQL focused person, more of a DBA. And the other accepted definition is a software engineer who specialize in big data.
For my book, I talk about that definition, the software engineer with big data. However, if maybe team c Googled around and said, oh, they have a data engineer. What is that data engineer? We hire them. Well, they may get that SQL focus person. And so that SQL focus person won't be able to do the same level of data engineering. They may either not be able to do the same level. They may not be able to handle the same level of complexity. Or worse yet, they may create a mountain of technical debt for you. And that makes it so that team c is now this backwater bastion of technical debt, whereas teams a and b are creating some level of value.
This is coming back to that centralization. Well, if we had a centralized data engineering team, we we would have had similar or sane hiring practices. And there would have been that team that would have said, no. This person doesn't meet our definition. This person can't handle the the scale or doesn't know how to program, for example, the various reasons. If we don't have that team, that centralized team, we can't set up the best practices, not just for coding software engineering style things, but for infrastructure usage and for even just definitions of teams and who that person is.
So that brings the question of when should you or how should you do that? What I found often is the reason that the business side wants a data engineer on their team is that your data engineering team isn't giving enough love to either their organization or enough attention, what have you. So that could be 1 manifestation of the problem. And this is a consistent issue I've as I've talked to the business leaders who've handled hired these people. Why did you hire them? It's because we couldn't get any time from the data engineering team. So look at that, that could be a whole issue that management is creating for itself.
Then there is the issue where it's just a nature of this. Their domain is unique enough. So what we can do is we can establish that core team. And that's usually what I recommend doing. We establish that core team, whether that's data engineering or data science. And then what we can do is we can look at how do we start engaging with the rest of the organization. I believe that if you engage with the rest of the organization, right, that core team can negate the need or that desire from the business to have their own data engineers. That said, it may not always be possible. So what do we do then? Well, what we do is they could hire onto their own teams. However, those team members have to actually be interviewed by the central data engineering team, for example.
Or another route that they could do is a kind of a tour of duty where 1 of the central team is on a, let's say, 6 month stint reassigned from that data engineering team, the central 1, to that business unit. So they do a tour of duty of 6 months, for example. Other possibilities that I've seen be successful are dotted line, where, for example, our data scientist is still a member of that data engineering team or data science team. However, their dotted line assignment is to that business unit. So they're still a data engineer still within that team, perhaps even still even sitting together. But their day to day assignment, their scrum, as it were, would be part of that other team.
What I think is key as you start to look at these is there's a point where the cat gets out of the bag, and there's no going back. It's a Pandora's box. As other business units start to hire those data engineers, those data scientists, if those data scientists aren't either part of a centralized team or part of a centralized group, then they'll never really attain those best practices infrastructure wise, code wise, team wise, organization wise. And the people that are in those satellite teams will always be this backwater, where they'll never be able to get to that same level of value that the other teams do.
[00:32:51] Unknown:
And there are a couple of really interesting things to pull out there. But before we go down the road of the question of hiring and evaluation for these types of roles, I'm interested in digging into the metrics and measurement of these types of teams and particularly the sort of return on investment and determining how much value they're producing versus the amount of cost for the infrastructure, for the salaries, for training, and how you have seen companies determine how to budget for and allocate spend and time and focus for these teams that are focused on delivering data products and identifying what these data products should even be in the first place?
[00:33:34] Unknown:
The way I recommend doing that is really involving the business. And throughout this podcast, I haven't really talked about how do we involve the business. Basically, in the book, after I introduced the teams, it's all about interactions and business after that. Because the things that we create as for data products have to be business worthy. They have to be valuable for the business. As we evaluate that ROI, we look at that and we say, is this what the business wanted? And more importantly, as you're, for example, starting a data team, you should be looking at and have included the business in that conversation.
But you should be asking questions like, what are your can'ts? And more importantly, what is the value of those can'ts? Put a different way is, if I could take your can't and make it a can, what would that do? Would that save a €1, 000, 000 or a $1, 000, 000? Would that make $10, 000, 000 or €10, 000, 000, for example? By us, clearly establishing the business value of moving those from from can't to can, then we can start to look at, okay, our data team could save the company €50, 000, 000, for example. From there, we can start to look at and make a better business case for the training costs and infrastructure for our data teams, where this is a consistent issue that we as technologists don't do.
We think in terms of technologies. For example, if for the majority of data teams, they might say, well, we're using Spark. And if we just stand up Spark, and if we just do Kafka, and we do this with Kafka, well, Kafka is great and all, but it's not going to make your s 1 report for your company. They don't care about that. They care about how much money you made and how much money you saved and how much time you saved, those sorts of things. So as we talk about the value created, we're able to point back to, we talked to the business. The business said this. Now if we can just do whatever that is, we can achieve this level of value.
And then that conversation becomes much clearer and much cleaner of, oh, we're actually leaving €20, 000, 000 on the table. Those of you who are in management or even those aspiring to management or biz or engineers, this is how you talk to business people. Talk to them about how much they're either going to make or going to lose. And then checkbooks start opening. Telling them about Spark, checkbooks don't start opening for that. So as we talk about the amounts and the budgets for this, then we get a much clearer path to budgets and money.
1 thing I will say is sometimes teams will get a large budget. And what they do is instead of scoping that out and scaling that out over time, what they'll do is they'll hire 10 people. And those 10 people will be idle initially. Those could be a mix of data scientists and data engineers. So what you wanna do is, as you just start hiring out or hiring people, is that you start with gradually rolling out, gradually increase your size of your teams. From there, you can start to look at the value created, the actual amounts of how much do we need. 1 common issue with hiring is thinking about how much people should be paid.
And this is where, for those of you who are in larger organizations or perhaps even cheap organizations, let's be honest, this is gonna be an issue where data engineers aren't cheap. And as as we just talked about, there's a high demand. So we have a high demand, low supply. And in those sorts of economics, we have salaries go up for data engineers, because there's a low supply of them and companies want them. You realize this, you as the manager, you as the team lead, and then you go to HR who says, I looked up the salary for data engineer, and it's 40, 000 US, let's say, or 50, 000 US.
Then you have to explain to them, no. You looked up the salary of the data engineer who's SQL focused. This isn't the person we're looking for. We're looking for somebody who is a software engineer, specialized in big data. That sub, sub, sub specialization is really what gets them the salaries up much higher, where we're looking at salaries that are above a average software engineer. This talk about salaries, this is where back to that original question of how do we get people in. Well, these data engineers aren't money hungry.
But what they are is realists, and they're seeing somebody gave me an offer for, let's say, a $100, 000, €100, 000, and here you are trying to offer me 40, 000, there's no way in hell that they're going to accept that. So this is where managers need to work with HR. They may need to work on their pay bands. They may even need to establish the title of data engineer. I had companies where they don't have the title of data scientist, though that's becoming more often, or they won't have the title of data engineer. And so their HR will try to do pay bands based on that. They'll try to do a software engineer's pay band, which isn't high enough either. Or for the data scientist, they're not money hungry either, but they're still looking at significant amounts of specialization.
And the companies will try to give them a data analyst salary. That's not going to work either. This is definitely some legwork that I'd recommend, management do ahead of time.
[00:39:35] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.
And another interesting aspect of this hiring question is the ordering in which to bring on these new capabilities where do you hire all of the elements of the data team at the same time where you bring on a data scientist and a data engineer and an operations engineer? Or do you start with the data engineer to build out the capabilities for running data pipelines, having the data quality so that when you do bring on a data scientist, they have something to work with? Or is it better to have them be part of that conversation from the beginning so that there is that interplay of this is what I'm looking for, and this is what I can give you, and then we converge on the optimal workflow for data engineering through to data science.
[00:40:54] Unknown:
I definitely recommend a data engineer first. But definitely the most popular or the 1 that I've seen in the wild most has been data scientists hired first. So let me talk through what happens there. Usually, a data science being hired first is based on that management misunderstanding of what a data scientist is. They're thinking that data scientists can do it all, that they're that person who can program as well as they can do the modeling side too, the advanced analytics side. And, unfortunately, that isn't the case. So what you have is you'll have this mismatch of requirements.
And the company is saying, here, we don't have any data. Go make some data products. Go do some data science y stuff and question mark, question mark, question mark, profit. And what usually happens there is data scientist sits idle for 3 to 6 months. Managements gets mad and says, why aren't you doing data science y things? And and the data scientist says, where's the data products at? I'm here to do data science, not create a bunch of infrastructure. So usually, what happens in those sorts of scenarios and situations, you have about 3 to 6 months before your data scientist will start to look for a new job and leave.
And you're faced with the you as the manager are faced with the prospect of trying to hire another data scientist after you got virtually 0 ROI out of your first 1. So that is the primary reason why I recommend doing a data engineer first. So your data engineer first would be there to start looking at the data products, the can'ts that are available, trying to evaluate what sorts of data products should be created. And then from there, they will start to do that, start to create those data products. And then you start hiring your data scientists in. And this kind of goes back to 1 of the questions you were talking about of how do we hire data scientists, for example.
During your interview for the data scientists, you actually tell them, hey, we've already created the data products for you. Here's the technologies. And then they say, oh, well, I get to go do that cool data science stuff that I've always wanted to do instead of, hey. You're going to be hired, and you'll have to do a bunch of your own data work. So the data engineer has done all that. From there, we can hire that data scientist. From there, as we start to operationalize more and more, we may do an operations person early on. There's various ways to do that. Probably the inverse of your question, Tobias, is how would we go about dealing with let's say we have a data science team and we don't have any data engineers.
What we'd want to do there is we'd want to get our ratios back in in order. The ratio of data scientist to data engineer, it's usually it's a 1 to 2 to 5. So there should be 2 to 5 data engineers per data scientist. It's a lot around trying to get your ratio back and and hiring the right data engineers, getting a good strong lead on the team, a project veteran is what I call them, and then starting to really build out that data engineering team and
[00:44:10] Unknown:
figuring out a way to pay down or to deal with the technical debt usually created by the data science team. That was another question I was going to ask in there is the proper ratios of data engineer versus data scientist versus operations engineer because it's not necessarily obvious where some people might think, oh, well, as long as I have a data engineer, then I can hire on 3 data scientists, and then they could do all kinds of more magic and, you know, it'll it'll be puppies and rainbows all day. Whereas, you know, then you have 1 data engineer who's overtasked and not able to deliver, and so they might get frustrated and go find another job where, you know, it's the opposite situation of the data scientist who's brought on with no data engineer on staff.
[00:44:48] Unknown:
It's definitely that. You'll get frustration on both sides. And this is 1 of my advice to you as management to be looking then. If you're a team leader, manager on the team, you should be looking for signs of frustration and signs of frustration that it's okay for team members to be frustrated. What it's not okay is for the team to be perceived that frustration is never going to go away, that that bad juju, the lack of ratios, the poor ratios, it's never going to change. That's when people start looking for new jobs. They start quitting. And 1 thing to know about when you lose a data scientist or a data engineer, there's more tribal knowledge that is held in data scientists and data engineers than other roles, in my opinion.
And as that person quits, you lose that tribal knowledge of the person. It's not like you can say, hey. Spend the next month before you leave writing down all of your all of your thoughts on this and all of your tribal knowledge, there's just in some inherent tribal knowledge that you'll lose. And that loss will be difficult to regain unless there was a team unless the entire team was in on, for example, that design or that coding. So do be careful. Losing team members is going to be costly, not just for that tribal knowledge, but the actual effort of going through and finding the person. The experience I've had in Europe, US, and in Mexico on hiring people, data engineers, data scientists, you're looking at a good 3 to 6 month lead time of starting to advertise that job to someone sitting in a chair with that title. So really think about this as you're hiring to make sure that they're in the right place.
[00:46:36] Unknown:
Another interesting aspect of the role of data teams within an organization is how they might interact with the other engineering groups in the business where if you have a set of software engineers who are building applications and software products, what do you see as being some of the useful relationships between those software teams and the data teams?
[00:46:58] Unknown:
It was an interesting question. 1 thing I haven't talked about from the book is the case studies I did. I didn't just want people to be reading a bunch of my experience and my thoughts and ideas. I wanted other people's thoughts and ideas to be part of it. So 1 interview I did was with Dmitry Raya Boy, who started Twitter's data teams. And we talked a lot about this because it was an issue early on at Twitter of there was the data teams and then there was the software engineers. And the software engineers wouldn't be on the same page.
Either they wouldn't be on the same page in terms of product changes. So what could happen there, at least the stories that Dimitri told me about was Twitter would change an API, or the way data was laid out. And then the data teams wouldn't know, and they wouldn't know until something broke. So what they did is they had to get the data engineers had to have a seat at the table during those design discussions to say, yes. I know ahead of time when something is changing. In a very similar sense, operations, they would want to know when there is an operational change, when is software deployed, when are we EOL ing this or that.
Dmitry went as far as to say that data teams, data engineers should be part of the software engineering organization. Otherwise, they may not be perceived as real software engineers. And this can be a definite perception where the software engineers, when they hear data engineer, they think of that SQL focused person. So they think of the data engineers don't know how to code their way out of a paper bag. Well, that could have been true for other data engineers that they have dealt with over their careers, But the data engineers need to actually show, no. We're not just good software engineers.
Distributed systems are even more difficult, in my opinion, than software engineering or than small data engineering. So it's really important that they have that similar symbiotic relationship to the rest of the organization. We do want our data engineers to be able to know when things are changing, know what is changing. And I would say, 1 of the differentiations that I've seen just really apparent as the difference between a software engineer and a data engineer. It's not just that the data engineer knows the distributed system side. It's also a love of data, or it's a significantly different appreciation of data than a software engineer has.
And I say this having been a software engineer for a while. The software engineers think of data as I put data into the database and that's my interaction with data. When we get into data engineering, we actually think about data as a life cycle. We have to think about how data is dealt with over time. And so by having a seat at the table, the data engineers will be able to say, hey. That change you're making, it's breaking the data model. We're exposing this data model. Could we do it this way? Could we do it that way? Or could we not do this at all? For example, The software engineers may not be thinking about it from that data perspective.
And by having that seat at the table, the data engineer, perhaps an architect, can actually raise their hand and say, hey, this is different. We wanna do something different.
[00:50:21] Unknown:
And so going back to the hiring question and onboarding new people, what have you found to be useful ways of evaluating their talent and their capabilities, recognizing that software engineering and just engineering interviews in general are typically poorly structured and not very well thought out or not very consistent. And so I'm wondering if you can just talk through your thoughts on how that manifests for data roles in particular and ways to avoid the pain on both sides that exist as a result of these practices?
[00:50:54] Unknown:
It is a difficult thing. In the book, I didn't specifically address how do you do interviews and hiring. And so what I did for that instead is on my YouTube channel, I did a 10 minute video talking about interviews. And in that video, I share, this is what you have to do. For example, for data engineers, you basically have to give them a software engineering interview and a data engineering slash big data interview as well. It's a pretty rigorous thing that has to happen. At my clients where we've done this, usually, there's a 4 to 5 rounds of interview ranging from, let's do some coding, some software coding, your regular software engineering interview, and then starting a separate interview of, now let's talk about distributed systems.
A few other difficult parts I see with specifically, with this data engineering review is what are the key parts of software engineering that a data engineer should know? So at 1 client, they were giving them the standard software engineering interview that they gave everybody else. In that particular case, it was very web dev heavy. And to be honest, some of the questions were just completely not relevant or were in a different language. A lot of their web dev was in Ruby. And you're just not going to throw a bunch of Ruby questions at a data engineer, for example. Or how do you do x with something on the web, it just isn't relevant.
What we did is we took the points or the parts of these types of questions that were data engineer relevant and made that be part of the software engineering side of that. These were questions such as not so much hardcore data structures, but it was kind of touching on data structures, touching on what do you know. And then for the distributed systems part, it's asking them about the technologies that they'd be expected to know. Do they understand the scale? Scale questions. Do you understand what SKU is, for example. Those sorts of questions are what distinguishes The separating out to the operation side, operations people, they'll need to know the usual operations. And in that sense, you'll have to ask them the usual Linux questions.
Do you know Linux hardware? For how do you do troubleshooting of Linux hardware, for example? Those sorts of questions added on to that. Do you know how to operate the framework that we're going to do? Let's say you're doing Spark. Let's say you're doing Kafka. Have you actually done that? And then there's the whole other aspect of scale to that question. Can you operate that at the scale that we need? For data scientists, you all have a mix of math slash modeling, as well as programming. So you aren't going to be throwing a full software engineering programming interview at data scientists and expect them to pass.
The general metric I usually say is they should be an intermediate level programmer. And I say programmer, not software engineer. That means that the data scientists will understand syntax. But if you're going to ask them about the finer points of computer science, etcetera, that just isn't either their forte or frankly necessary for what they're going to be doing. Then we have to have the other part of that interview being the statistics, the modeling, choosing the right models, those sorts of questions. In this sense, we have an expanded interview cycle.
Then some companies have the whole culture fit sort of interviews. It is a significant amount of interviewing to be done.
[00:54:38] Unknown:
And what are the challenges in particular for hiring for these types of roles, especially if you're starting a brand new team, is having the appropriate amount of expertise already existing in the company to be able to determine if the responses to some of these questions, particularly on the subject of things like data modeling and understanding, you know, how do you evaluate the area under the curve and how applicable that is to this particular problem domain? Or for a data engineer, you know, how do you structure the execution pattern for the DAG within a spark cluster and then understanding whether or not the answer they gave you makes sense. I'm wondering what you've found to be the necessary level of understanding and capacity for hiring managers or for team leads who are trying to run these types of teams?
[00:55:28] Unknown:
It's frankly difficult if they don't have this experience already. And I've seen this firsthand as well, where they've hired the wrong person simply because the person that they talk to could talk the talk as it were, where they kind of knew the questions that somebody would Google, and they memorized those answers. And the person didn't know how to push deeper, and, therefore, they couldn't get a specific, you know, pass fail. They didn't know the follow-up questions to ask. That can definitely be an issue. What I've seen people do is and this is something I've done for my clients as well.
So my clients, they're starting out with our data teams and on their big data journey. They don't know the questions to ask. They don't know the right things to do. And so what we do is we will do that final interview, or we're part of that interview cycle where I'm there to ask them the big data questions I can't remember the person off the top of my head. They I can't remember the person off the top of my head. They were in some kind of consulting firm too. They weren't doing mentoring like my company, but they were doing more consulting style. They offered to do those sorts of interviews.
The other thing that you can do is you can kind of draft off somebody else's choices. So that means that let's say there's a company that you know has a good reputation, not just for people, but for engineering quality. So what you could do is you could try to find somebody who has that pedigree. And let's say they were a data engineer at company x. Company x is known for turning out really good data engineers, having really good data engineering. Maybe what you do is you do some culture fits, you do some smoke questions, you try to make sure that they aren't bluffing.
And then you basically take it on faith that they've worked at that company, know what they're doing, and will join your company and start that out. 1 thing to note that's kind of related to this and probably to previous questions is those initial engineering hires will set you up for success or failure in other ways. That means that if you hire that first person, and that first person is really good, then during your interview cycle, people will realize, oh, wow, this person is really good, I can actually learn something from them. And I'll enjoy being around them. I will think very highly of working with that company. Conversely, if you're sitting across the table from that person and thinking, oh, my god, this person does not know what they're doing. And you're thinking, okay. How much of their stuff am I going to have to clean up? What kind of mess have they created?
You're thinking all kinds of negative thoughts. I've seen this firsthand as well. Hey. This will actually turn people away. The sort of people that would turn you around are going to be turned off and turned away by seeing that and saying, I do not want to be part of your turnaround effort.
[00:58:34] Unknown:
So for managers who are responsible for data teams, what are some of the types of anti patterns that you have seen that might have led a team that could otherwise have succeeded down the path of failure?
[00:58:45] Unknown:
I talk about this in the book. There's kind of a chapter that I'm really proud of called diagnosing and fixing teams. It's sort of a manual for how do you go through and look at a team and figure out why they're failing. And some of those failures are individuals. Some of those things are something in the management team did that. There's 2 ones that come to mind. And the first 1 is they're looking for a silver bullet. The other 1 is they're searching for the holy grail. Silver bullet means 1 of the executives, 1 of the management team has went to some conference, heard somebody talk, read a vendor white paper, and said, oh, wow. This big data thing, this data thing, it's gonna save everybody. It's gonna save the company.
Let's do this. And so they, a hair on fire, haul off, and start trying to create that data team, Except they don't go through the right process. They don't go through the right steps. They just kind of model their way through it as best they can. That doesn't work very well. The other 1 is that holy grail. And I would say that the holy grail is 1 of the worst ones. Seeing that firsthand as well. What holy grail means is somebody went to a conference and they saw some big respectable the kind of tech company everybody wants to be like, Apple, Uber, those sorts of companies.
Nothing wrong with those companies, and they do awesome stuff. What happens is that management will sit in those audiences and and or those read those blog pay posts and say, that is exactly what we were going to do. Forward that on and say, I want this. And what they don't really realize is that blog post or that conference talk misses out on a few things that are actually really relevant. 1 is how long it took them to get there. Speaking of Uber again, I think Uber is probably given the only blog post I've ever seen where it talked about a progression of how they actually got to their current data infrastructure and data pipelines. They talked about it, how it was I think it was over the course of 10 or 15 years.
And it was the first time I'd ever seen anybody really say, it took us 15 years or 10 years to get to this point. Usually, what a manager, especially a layperson manager, is thinking is, oh, they did this over the course of 6 months to a year rather than, no, this actually took them a significant amount of time, and it took the team getting a lot of velocity behind themselves to actually do this and to do it right. The other thing I think they miss out on or is often not talked about is how much help they had upfront and what the starting point of the team was.
Sometimes the team that they will have, let's say, in a Bay Area, San Francisco Bay Area company, they may have been doing distributed systems and big data for the past 10 years and then joined that company. So they're leveraging 10 years of experience. Let's say the rest of the world, rest of the country, they may be leveraging 1 year experience, 2 years, maybe even 0 years of experience. That's a huge amount of difference in experience. Or what they may not talk about is, no, there was actually a management or a consulting team that helped them out significantly, where that consulting team was much more in the background, but there's the exec up there on the stage taking the credit for it, which is completely fine.
But what the manager there in the audience thinking that they want the holy grail doesn't realize is, yeah, that company also spent 1, 000, 000 of dollars or 1, 000, 000 of euros on consultants to help them implement that. And that's okay. But what it isn't really known by that manager, there was a small army of consultants working on that project.
[01:02:34] Unknown:
In your experience of working with all these organizations and extracting these lessons and doing your case studies, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:02:47] Unknown:
How I came away from those interviews thinking about DataOps. When I first was writing the book and doing the research for the book, I wasn't sure what DataOps would be and how important DataOps would be. I definitely came away from doing those case studies thinking, once you get to that advanced state and your state is you're at a limit of friction, DataOps could be the thing that changes your business significantly. That was definitely a very interesting realization, honestly. Other ones that I was surprised about from the book, how many companies don't understand how you have to work with the business?
So you and I have done, or I don't know if you have done agile. And 1 of the basic precepts of agile is that you work with the business. You work with whoever your product owner is, and you work with them consistently. And then as I started to work with teams and started to do that, I saw how often it was that software engineers didn't actually work with the product owners and weren't getting things in front of the product owners to show them, hey. This is what I'm doing. This is what the product looks like. This is what the data product looks like. This is what the dashboard looks like. And it really underscored that realization that I had much before I wrote the book. And it's basically the reason why I wrote the book. And that is early success and failure is not the result of technology. It is the result of management.
And what that means is if you're going to be successful early on, that is actually get something into production, it wasn't that Spark worked well. It wasn't that Kafka worked well. It was that the management found the right people, organized them the right way, worked with the business the right way. And that was really what success was. And I had that realization while I was working at Cloudera of, yeah, Cloudera was a vendor. So my perception was as long as you use Tadoop the right way, you'd be successful. And as I went into companies and started working with companies, seeing that, hey, this wasn't a technical problem. Even though the company may have said it was a technical problem, it was that Hadoop didn't work or it was that Hadoop was too difficult.
Usually, that was just plain an issue of the team wasn't set up right initially. That's a big thing for I'm hoping the people who are listening to internalize. It is management that sets you up for success initially.
[01:05:29] Unknown:
And are there any other aspects of the formation and management of data teams or the interactions of the members within those teams or just the overall space of building data products and building that capacity within an organization or other resources or advice that you have that we didn't discuss that you'd like to cover before we close out the show?
[01:05:50] Unknown:
Yes. There is 1 that I didn't talk about, and that is the symbiotic relationship between all 3 teams or what I've called high bandwidth connections in previous books. This means that when you have these 3 teams, if they aren't working well together, then that's another example of friction. There's friction that's data teams to the rest of the organization. There's that level of friction. And then there's friction within between the teams themselves, where the data scientists don't work with well with the data engineers. And what I found super interesting in those teams is that neither are leveraging each other. The most common manifestation I've seen is the data science team does not believe in or trust the data products coming out of the data engineering team.
So they create their own data products and their own data infrastructure. And so you have a complete lack of leveraging anything or use of anything. And you have a complete duplication of efforts, quite frankly. And so as this happens, the teams just really go nowhere. And it all starts with either a political issue or it starts with a lack of teams or understanding issue. It's a big problem. What I would look at if I were a manager of a team or an organization with existing data teams is, do they have high bandwidth connections? Do they have a symbiotic relationship? Or do they have an adversarial relationship? They have an adversarial relationship.
You need to start looking at why. Why do the teams not like each other? Do the teams, whatever, fill in the blank? It could be something as simple as they've just never had to work together before. And it's up to the management, in my opinion, to fix this. It's not going to be the individual contributors that say, we should all hold hands and sing kumbaya. It should be the manager saying, no. This isn't right. This isn't the way that we should be working together. We should be duplicating our efforts. We should be having a lot less friction amongst ourselves. And how would a manager do that? It's taking some real honest looks at what is happening.
Do the data teams just completely blow off the data scientists? Did the data scientists completely blow off the data engineers? Did you have, let's say, on the team, are the data engineers actually meeting the definition of data engineers? I've seen that before where the data scientists say, no. You guys actually can't create data products. You're SQL focused data engineers, and we really can't do anything with you. It takes more of like a 360 degree view of yourselves, some real honesty and some real deep introspection to figure out what's happened and how we need to fix it.
[01:08:44] Unknown:
For anybody who wants to get in touch with you or follow along or dig deeper on these topics, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:09:00] Unknown:
It's actually a really more holistic thing, I believe, that nobody's ever taken a whack at the whole ecosystem. Eric Sammer, if you know him, he and I used to work together at Cloudera. He's now at Splunk, and he's been tweeting about this, and I've been thinking about this. Part of our issue in data is that we're trying to bring together several different technologies that were either loosely thought about or never really even thought about of how they integrate with the rest of the the ecosystem. So let's say, well, technology x was built to do this, but they never thought about integrating with technology y. So that forces the data engineers or perhaps even a separate third party product to bring that integration together.
And so we spend a ton of our time trying to wire technologies together that, well, if we were to wave our magic wand, it would actually be better if they would have all been integrated better together. And that was my hope for what we were doing at Cloudera. It didn't quite pan out. So then I started thinking, well, maybe the cloud providers. It's almost as if the cloud providers may have the best odds of doing this. Somebody tweeted out. I can't remember who. They were saying, imagine what could happen is that if somebody were to just start acquiring some of these companies together, Snowflake, and bring Snowflake together, bring, let's say, a Kafka company, Confluent or database plus PubSub plus processing engine.
If somebody were to buy all 3 of those and really just put all of their time into making integration just so dead simple, It won't make the programming side dead simple. It would make the integration side dead simple. I think it would change the landscape of what we'd have to do dramatically, and it would make it our job so much easier. But, alas, it hasn't happened yet. And we can cross our fingers that maybe 1 of maybe Amazon will start doing this better or Microsoft or GCP maybe.
[01:11:07] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've done on writing the book about how to manage data teams effectively, but also all of the work that you've done with these organizations to understand the space. It's definitely a very interesting and necessary topic. So I appreciate the time you've put into that, and I hope you enjoy the rest of your day.
[01:11:28] Unknown:
Thank you. I appreciate you having me. And everyone who's listening, I appreciate you listening and taking the time to think about this and see how we can improve things in the data engineering field.
[01:11:44] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Jesse Anderson: Best Practices for Data Teams
Defining the Mission of Data Teams
Specialized Roles in Data Teams
Understanding Big Data: The 'Can't' Definition
Medium Data: The In-Between
Identifying and Training Internal Talent
Strategies for Attracting External Talent
Organizing Data Teams: Centralized vs. Embedded
Measuring ROI and Value Creation
Hiring Order: Data Engineer vs. Data Scientist
Interaction with Other Engineering Teams
Evaluating Talent and Interview Strategies
Anti-Patterns in Data Team Management
Lessons Learned from Case Studies
Symbiotic Relationships in Data Teams
Contact Information and Closing Remarks