Summary
Applications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book "Building Data Products" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
- Your host is Tobias Macey and today I’m interviewing Brian McMillan about building data products and his book to introduce the work of data analysts and engineers to non-programmers
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what motivated you to write a book about the work of building data products?
- Who is your target audience?
- What are the main goals that you are trying to achieve through the book?
- What was your approach for determining the structure and contents of the book?
- What are the core principles of data engineering that have remained from the original wave of ETL tools and rigid data warehouses?
- What are some of the new foundational elements of data products that need to be codified for the next generation of organizations and data professionals?
- There is a lot of activity and conversation happening in and around data which can make it difficult to understand which parts are signal and which are noise. What, if anything, do you see as being truly new and/or innovative?
- Are there any core lessons or principles that you consider to be at risk of getting drowned out in the current frenzy of activity?
- How do the practices for building products with small teams differ from those employed by larger groups?
- What do you see as the threshold beyond which a team can no longer be considered "small"?
- What are the roles/skills/titles that you view as necessary for building data products in the current phase of maturity for the ecosystem?
- What do you see as the biggest risks to engineering and data teams?
- What are the most interesting, innovative, or unexpected ways that you have seen the principles in the book used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the book?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Building Data Products: Introduction to Data and Analytics Engineering for non-programmers
- Theory of Constraints
- Throughput Economics
- "Swaptronics" – The act of swapping out electronic components until you find a combination that works.
- Informatica
- SSIS – Microsoft SQL Server Integration Services
- 3X – Kent Beck
- Wardley Maps
- Vega Lite
- Datasette
- Why Use Make – Mike Bostock
- Building Production Applications Using Go & SQLite
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted. Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data. Go to data engineering podcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macy, and today I'm interviewing Brian McMillan about building data products in his book to introduce the work of data analysts and engineers to non programmers. So Brian, can you start by introducing yourself? Thanks for having me on. My name is Brian McMillan. And professionally,
[00:01:47] Unknown:
I'm a longtime enterprise architect working in large corporations and primarily been focused on data and analytics problems within those big companies. Even though EDS and HP are technology companies, a lot of what they're doing is just provide the enterprise services to others. And they're definitely old school companies. And then more recently, I'm working at a major defense contractor. I left my job in October 2020 to write a book about building data products called Building Data Products, Data and Analytics Engineering for Non Programmers.
[00:02:26] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:29] Unknown:
Yeah. So my degree is in economics, not computer science. And I am a business guy. That's really colored a lot of my career going forward. So they have to go back to the mid nineties. I was a data analyst. It was my first real job. I was a data analyst working for electronic data systems in a General Motors manufacturing plant. And 1 of the things that was kind of unique to the business at that time is that they had actual the IT people were embedded in the customer's organization. So I got to participate in the daily management stand ups. My job was to really run their manufacturing war room.
And the biggest thing was that they introduced me to the theory of constraints. It was an engine plant. They were going through, you know, they had old technology. GM was going through a lot of turmoil at that time, and the plant was always on the verge of being shut down. They decided to go all in on theory of constraints and packed on me because I saw, you know, basically a failing company turn themselves around, and it kept that plant open for almost another 10 years. Next 15 years are basically spent architecting, product managing, people managing, and hands on building a variety of enterprise data management systems for EDS, HP, and Raytheon. Just because you've asked us in a number of other podcasts, I've been working remotely for most of the last 15 years, which I think is pretty unique.
So having to having to go, you know, COVID sending us all home was actually very pleasant for me.
[00:04:02] Unknown:
Yeah. Levels the playing field. I've worked remote off and on throughout my career. And so when everything went fully remote, just no ifs, ands, or buts about it. It was kinda nice for me because it gave me an excuse to go back to being full time remote instead of being primarily in the office with part time remote. Yep. And there are a lot of
[00:04:23] Unknown:
really significant downsides to that, but a lot of positives and, hey, we didn't have a choice. Exactly.
[00:04:30] Unknown:
And so as you mentioned, you recently wrote this book. You made the decision to quit your job to spend the time on it. So I'm wondering if you can talk to some of the motivation for deciding that this book was necessary, that this was the time to do it, and that you wanted to actually have that full time focus on it, and just some of the overall story behind how you came up with the idea, the motivation around it, and how it came to be?
[00:04:55] Unknown:
So when I started working for Raytheon, I didn't hire into the IT department. I hired into the quality department. And, you know, it was pretty clear when I hired on, they had a really big data problem that they were trying to solve. So that's so I came in to build a data warehouse for them for their quality data. And having done this a whole bunch of other times, this was the first time in a long time that I had actually been really hands on. You know, architects don't typically you know, there's a certain point in an architect's career where they're no longer allowed to touch things. That was pretty frustrating to me, not being able to you know, I was feeling like I wasn't able to touch things and work on things. So this is a great opportunity.
They had had a lot of trouble with the IT department getting you know, the idea of getting a database for a business organization to do their own database work was not very popular. So they had managed to get themselves a database. They had some really good ideas of projects they wanted to work on and problems they wanted to solve, but they needed somebody to actually come in and do that work. So got hired in, started to do that work for them. It was pretty clear that it was way more work than 1 person could do, which is never a good idea anyways, but that's where you always start. I had got the opportunity to do a lot of training on that team with people who had never touched a database before.
And they had good domain knowledge, but they didn't have the technical skills. That started a ball rolling. And I guess probably 1 of the biggest things that I learned was how important that domain knowledge really is, that domain expertise. So 1 of the first things we did was, well, what's our production yield? We don't have a good way to look at our production yields, and that should be pretty simple. We'll just go in. We'll look at the data from the warehouse, and we'll, you know, just write some reports. That wasn't what it turned out to be at all. And what we quickly found out was that they had hundreds of parts, that they were serialized parts that they were recycling through the production process. So you go to do your recursive query to figure out what your bill of materials looks like, and you can't do it because you've got loops.
So, you know, as a technical person, what you do is you'd start chasing all these rabbit trails to try to figure out how could this possibly be happening because this doesn't make sense. Instead, we were really lucky on that team. We had somebody who had been in manufacturing for, oh, jeez, let's just say decades, and he had a story for everything. So I learned the story of Swaptronics. So unlike a lot of businesses, you know, Raytheon, they are rocket scientists, and all of the work that they do is just bleeding edge. And the sensor systems in particular are tough to test, so they fail Tesla on. They pull it out. They put it back on the shelf. It eventually goes back in another device, and sometimes it matches up with the rest of the components and sometimes it doesn't.
I never would have known that that's what was going on unless I had talked to someone who had been on the plant floor and said, oh, yeah. We do this all the time. It's no problem. Well, it is a big problem, and we need to stop doing it. That pivoted our whole work to trying to figure out how to solve that problem. And it became very clear that domain knowledge is absolutely the most important thing you can do to solve the problems you're at. Domain knowledge is the most important thing you could have in order to solve business problems you have. There's no way around it. You can't algorithm your way out of it. The other thing that I motivated me was I just couldn't believe it literally took that team 7 years to get a production SQL Server database on the network that they had access to, but it didn't come with any ETL tools.
Like, what good is that? So the big thing that prompted me is I've been in this business a long time. And as an industry, I would like to think that at some point we'd be ready to come to terms with the fact that we keep doing the same things over and over again. We keep reinventing the wheel. Most of our projects still fail, and we tend to collect a lot of data that we don't know what to do with just because it's fairly easy to collect a lot of data. But that ends up being generated a lot of technical debt. And then for enterprise companies in particular, their whole operational model is all centralized.
You know, we have centralized IT departments, and you go to the data model, and you go to the ETL person, and you go to the report developers. No. We've gotta have project managers managing the whole thing, and it just doesn't work. And outside of big enterprise companies, we are doing things to fix that problem. But there's a huge opportunity inside big enterprise companies to solve that. It's definitely interesting how,
[00:09:56] Unknown:
you know, the current buzzword is the modern data stack of everything as a service. It's easy to just get a new database. You just throw a credit card at it, but that's only the case for, you know, a certain subset of the industry. And as you pointed out, in the enterprise, yeah, you have these procurement paths. Like, you can't just throw a credit card at something because that credit card is, you know, being held by a gatekeeper that has, stockade of paperwork to fend you off. And so
[00:10:22] Unknown:
yeah. That's a big nut to crack. I don't know. I don't know how to solve that problem. I mean, I know how to solve it in. I know how to solve it in a subversive way. Quite frankly, that was 1 of the motivators of the book. There is a bit of subversion in the book. Like, here's a whole bunch of free software that you can do that you can implement to do basically everything you'd wanna do. Orchestration, serving, storage, you name it. It's all in here. You can do that if you want. You may or may not wanna do that.
[00:10:57] Unknown:
Sometimes you have no choice. Right. Yeah. I mean, there's definitely the double edged sword of shadow IT of, k. You know, these people are unblocked. They're able to get their job done, but they're not necessarily doing it in the most effective way or they're, you know, reinventing their own wheel that's already been solved by somebody else in the organization. And so there is that problem of being able to connect up all the people who have the right problems and solutions.
[00:11:20] Unknown:
Yeah. I lucked out as an architect. 1 of the side jobs I've always had is go find the shadow IT teams. Go find them, find out what they're doing, and decide what we should do about them. And a lot of times, it's we're gonna give you funding. We're gonna give you additional resources to scale your solution up. Yep. You've got a great start of day server monitoring platform in Australia. We're going to take that, but we're gonna have to rewrite the entire thing, and you get a central role to help build it. That isn't what a lot of IT organizations do.
The first position is almost always shut these teams down, make it harder for them so that they can bring us requirements, and we'll work on those requirements and rebuild their thing for them. But that isn't what they want. They just want to, you know, meet the business problems they have. And, generally, they have a very good idea of what their business problems are. As IT folks, we we tend not to know what the real business problems are we should be focusing on because we're focused on technology, because we like bright shiny things. Yeah. And I think that that's probably why
[00:12:29] Unknown:
the data ecosystem has been going through such a long and cyclical route of self discoveries because it's never just the technical solution, and it's never just the business problem. And it's always hard to get both of those sides in the same room and agreeing with each other and speaking even in the same language. So I think that that's probably why we keep going through these, oh, well, we'll build this new evolution of this technology platform, and that's going to solve our problems. And, like, nope. Still have the same problem.
[00:13:02] Unknown:
Yeah. Absolutely. You know, enterprise application integration. Well, it smells a lot like Kafka.
[00:13:10] Unknown:
Absolutely. And now that we have Kafka, we're still running into the same problems of, okay. Well, we just put everything into this service bus, but now we don't actually know everything that's using it. Or, oh, we just broke the contract for this data structure because it was being used by this other service that is actually mission critical now.
[00:13:29] Unknown:
Yeah. Yeah. It always has been and probably always will be. And then there's that pendulum between centralized and decentralized. Like, you know, right now, the big thing is, you you know, so there's the modern data stack, the 100 vendors that are inside of that, and all the overlap and whatnot. And then you have the decentralized pendulum swinging pretty hard, and, you know, with things like data mesh. Probably talk later about that. But that centralized decentralized pendulum has always been swinging around. And, you know, it feels to me like right now, that pendulum is, like, kind of peaked or close to peaking, and a lot of the modern data stack vendors are starting to get big and start to be more centralized. And I think that's 1 of the things we're gonna see is some pretty aggressive consolidation in the business. Yeah. Absolutely. And another interesting
[00:14:20] Unknown:
trend is the repackaging of the modern data stack with companies such as Mozart Data of okay. So you've got 5 different tools that you need to use to do to this 1 thing. So we're just going to be that 1 bill for you, and we're gonna run those 5 tools on your behalf.
[00:14:36] Unknown:
Yeah. Which I think to enterprise customers is gonna be very appealing. Absolutely. Because the people who are buying those systems are not the people who are actually going to use them. So that value proposition that we can take this complex, intertwined architecture and package it up, and we can guarantee you, knock on wood, that everything works together nicely. But, again, the problem is you've got all these tools that people need to learn. You've got business problems that you need to try to solve. And all of that's very difficult to deal with. Absolutely.
[00:15:16] Unknown:
And so in terms of the book itself, I'm wondering as you were setting out to write it, what were you keeping in mind as your target audience and the primary goals that you were trying to achieve and help them realize through the creation of this book and the kind of core lessons that you're trying to impart? Well, 1 is, you know, architects at traditional IT departments,
[00:15:40] Unknown:
you know, used to the typical traditional IT stack, you know, okay, we have Oracle databases or Microsoft databases. We're using Informatica or integration services. We may have stood up, you know, if you're a Microsoft stack, you know, analysis services, we've got a whole bunch of other web apps that we may have built internally. If you're fortunate enough to have people who are doing application development in your organization, they probably switched a long time ago to building things, you know, in a 30 year old but modern software development. They're doing continuous integration, continuous deployments. They're test driven development. All of those things that help you build a better software product, They're doing that. But on the data side, we tend to not do that. We gave up on testing data a long time ago. Just too complicated. We can't do it. You know, if the file doesn't show up and the dashboard breaks, well, then we'll go fix the somebody will call us. The executive will call us and say the dashboard hasn't updated. Where's the fix? Then we'll go fix it. And we really shouldn't be behaving that way. Know, people figure these problems out.
So architects need to know that there are alternate ways of doing things. And I'm presenting to the book kind of a ridiculous, minimum viable product that's got the entire stack. You know, if we're building in the technical part of the book, we're building a little cupcake for a data product, looking at sales data. And the data is actual sales data from a company in the UK. It's very messy data. And at the end of the day, that data gets exposed as APIs and a web GUI on, you know, Google's cloud infrastructure. And you can do that. And it's not really that complicated. And you can treat everything as code, and you don't need to resort to gooey applications.
And that may not be where you're at right now, but you need to start thinking about that. The second target, I've talked briefly about this before, are those shadow IT teams. You know, the teams who are just fed up, and maybe they need to learn some better tools to really take it to the next level. And then maybe the IT departments will come along and help them out.
[00:17:58] Unknown:
In terms of the contents of the book, you mentioned that you're working with real sales data. You're iterating through building out a small scoped data product from that information. I'm wondering if you can talk to the overall approach that you took for deciding what was the structure of the book, what were the kind of main technological choices you were going to lean on for determining this is how I'm going to impart these core lessons and some of the ways that you are able to talk through these are the technologies that we're using, but these are the actual fundamental principles that we really care about here. So, you know, not to get too bogged down into the specifics of tool x or y. Well, that was the first thing. Don't
[00:18:39] Unknown:
try to not make it too tool centric. 1 thing I didn't wanna do is I didn't wanna make this a DBT snowflake text. So the start of it was really getting back to those core business concepts, starting with things like product life cycle. All products go through this s curve product life cycle. They start out in, and I'm using, Kent Beck's model here, 3 x, that he says he developed at Facebook, but, you know, it's in the extreme programming book already. Yep. Things start out in exploration. In an exploration, you don't know what's going to work. You don't know what's gonna bring value. So what you should be trying to do is run lots of little experiments.
And as you get traction on real business value, do the next step, then do the next step. And, eventually, you'll start to make this transition where you're you're expanding. You're bringing on more users. You're delivering more value. The products getting more complicated. And then you make another transition at some point where the value starts to level off. And you're in what he calls extract, where you're really in a process of making a good tool better, not necessarily making big improvements to it, but you're just making small incremental improvements and starting to just extract whatever value you can out of that. You're not really investing a whole lot of new into the product, but you're just getting the most out of it. And then the thing that isn't in the 3 x cycle is the exit phase, which is critically important. At some point, you have to start winding that product down.
Hopefully, it's gonna be replaced by something else. Or your environment may say, you know, we don't need this thing anymore, and it drops off a cliff. So that's the first thing people need to understand. And that product lifecycle, if you know where you prefer to sit, you know, if you're a builder, and you like challenges, and you like fighting fires, you know, you're an expander. If you're somebody who's more concept based, you're probably an explorer. And there are plenty of people who just wanna make things run like clockwork. That's the first big business concept. The second 1 is that you need to understand how your company makes money.
You need to understand the value chain, and this is probably where all these projects and teams need to start. Do we know how our company makes money? And a great example of that was when HP and EDS merged, I was the architect for the availability capacity organization, and both companies were supporting about 200, 000 servers a piece, you know, each. So we were gonna put 400, 000 servers worth of performance data into a single data warehouse, and it wasn't working very well. We had lots of arguments about how to do that and what was important. And the key came down to understanding that value chain. And as it turns out, about 90% of electronic data systems business was in hands on capacity management. We had people looking at server performance and making recommendations for how you should manage your servers, where HP was only 10% advanced management.
Totally different business models. The solutions in those 2 cases are going to have to be different solutions. The money is being made differently, and it behooves you to figure that out and treat those 2 value chains separately. The advanced level of that is starting to look at things like wordly maps and breaking that value chain down into the product life cycle phase they're in and treat those pieces of your solution differently. Yeah. Wordly maps are fantastic. And then the 3rd business concept is that just like products have life cycles, companies have financial life cycles. Like a start up, your job is to just figure out how to get somebody to pay you something.
May not be the product you just wanna build, probably won't be the product you intended to build. But your job is find a way to make money. And then eventually, you get customers, You're profitable, but your profits are fluctuating. And there are different metrics that you can look at. The either the academic name for that is throughput economics. There's some good work in there from the theory constraints folks about what metrics you should be measuring. You know, when is it appropriate to do things like measure your throughput and when does it not? You know, and throughput's kinda central to theory constraints, but it's not always appropriate to measure that. So that's on the business side, and that's really the first third of the book. And then the second third of the book is about demonstrating some key technical skills, how to write some basic stories, not do a lot of planning or do little to no planning, write some basic stories, get familiar with command line tools. You know, most, particularly in traditional IT departments, were very GUI application focused, and we are terrified of the command line. But there are a lot of great utilities out there that just require you to write, you know, 5 words, and you can be deployed on Google's infrastructure.
It's really amazing. And once you start to see that, you go, oh, wait a second. I can work in a completely different way. SQL, of course, there's been a renewed realization that just plain chain SQL gets you a lot a long way into your problems. Well, yeah, I'll I'll say that into your problems. Yes. It brings some problems with it. And it, you know, it can be used inappropriately. Automation and orchestration. You know, that's a big problem, particularly for like the shadow IT teams. Yeah. Great. We built this great thing. You know, we've got a great data model. We've got beautiful reports, but I still need to come in on Monday and push buttons and babysit things. You don't need to do that. Like in the book, I use make and use make file for everything.
And that 1 hung around in my head for a long time, you know, watching a presentation on airflow years ago. And the description was, well, it's just like make for data. That's interesting. You know, I've compiled data with make. I've compiled programs with make, but I never really thought about using it for data. It's awesome for that. The syntax is easy. You know, if everything you're doing is in command line, you're just stringing a bunch of command lines together without having to write some nasty Python script or some nastier bash script, just really clean. And then 1 thing that I don't really talk about in the book, but critically important is version control.
You've got your code in version control. Stop putting it on file servers. Yes. I know you are doing that. Stop it. Don't do that. You know, even if it's just check-in and sync up, start doing, you know, do trunk mail, do trunk based development because when you get to CICD, it'll make that a lot easier. Those kind of things, they're new concepts to most, you know, enterprise data people. How do you treat a tableau a tableau report as text or as code? You can't. It's hard.
[00:26:04] Unknown:
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and prophecy generates clean Spark code with tests and stores it in version control. Then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy. Given the fact that you are trying to cover these technical practices and keep it rooted in the business requirements, and I'm just wondering, you know, as technologists, we are often kind of tainted by the hubris of, oh, well, this is technology and it's hard, so I'm not going to kind of expose that level of detail to the business users. I'm just going to try to make it as pretty as possible so that you know, give them some pretty pictures so that they can make their own decisions, and, you know, we'll use that as the handoff. And I'm wondering what you see as being the kind of realistic expectations for business users to actually adopt the core technological tools and approaches that we as engineers have become accustomed to and vice versa of both getting engineers to actually understand the various business concepts and the economic modeling and all of these more sort of process and organizational concerns beyond the realm of the technologies and tools that we're using to make bits fly around the ether?
[00:28:05] Unknown:
Yeah. Well, by far, the hardest thing is to get technical people to understand the business because the business people have enough trouble with it themselves. If you're in the marketing department, you're focused on marketing things. You're not focused on, is it possible to build the thing that you're trying to sell? So business people have this trouble as well. But the more people who could understand, you know, how to think about the value chains holistically for the company you're in, the better off everyone will be. It will be much easier to have conversations about what's valuable or not because we spend a lot of time suboptimizing things and just making busy work for ourselves because we have to be busy. Right?
So getting technical people understand the business is first the most difficult. Getting the business focused people to understand the technical stuff, you have to keep your eyes open. There are a lot of people who a lot of analysts who are really good SQL developers or could be if you had taught them common table expressions and window functions. There are lots of analysts out there, you know, writing their Tableau reports who can write subqueries and, you know, they can write the Getty code subquery thing. If you just spend 5 minutes teaching them how to do a common table expression and pop that logic out into an independent piece that they can eyeball and maybe someday test, they will be ecstatic because it now gives them some confidence to start doing other things.
If you can get people comfortable with SQL and you can get people comfortable with checking it in to a centralized repository that's shared with everybody else. You know, try to do some kind of teaming, you know, pair programming, that kind of thing so that the knowledge is spread around. Eventually, you're gonna find someone who, you know, really wants to know how to write Python and R, and they've got the basics on. Now you could get a case where you've got someone who knows how to get the data to use that they wanna use, and now they have the technical skills to start doing more data science y type things. Lord knows there are a lot of data scientists who can't get data out of databases, which blows my mind, but it happens a lot.
They're just not comfortable with it. So it goes both ways, which gets to a bigger question about you have to start with what you have. A big part of data literacy is just trying to figure out where you are right now. What skills do you have? What capabilities could you get the team up to quickly? That's a hard problem. That's really the tricky part, data literacy.
[00:30:42] Unknown:
Yeah. It's definitely the interesting aspect of it is, you know, beyond just the technical bits is understanding how do you create and propagate context of the information that you're actually using? How do you understand the kind of statistical and semantic elements of manipulating the data? What are the, you know, downstream impacts of these mutations as far as what you can and can't do with it after you've made them? So it's definitely a large and complicated sea of complexity and concepts no matter what your background is.
[00:31:17] Unknown:
Yeah. I think the big thing is there aren't enough data people to go around, and we have to make more data people. Absolutely. Yeah.
[00:31:26] Unknown:
Yeah. That's really what it boils down to. There's a presentation, and I referenced this presentation in another interview recently by Jez Humble from, you know, early on in the process of the DevOps adoption saying, you know, stop trying to hire your DevOps people and create them instead. There's something to that effect. And we're we're definitely in that same phase with data where we're not gonna hire our way out of these problems. We have to start educating everybody who's already working on the problem to understand it more thoroughly so that they can, you know, do the things that need to get done rather than trying to hire the next, you know, data scientist or data engineer who already has all the skills, you know, but in a tool that we're not actually using right now.
[00:32:09] Unknown:
Yeah. That's a great point. You know that these DevOps practices are 98% applicable to data. Like, I hate the term data ops. It's the same thing, people. We're shooting for the same target. We want reliable, reproducible products. That's all we're looking for. Absolutely.
[00:32:28] Unknown:
And so to that point, another thing that has come up in some of my conversations recently is, you know, maybe the idea of the data engineer or the analytics engineer or what have you is starting to be on the wane. And we don't actually need these specific job titles, but we really just need our developers who understand how to work with data. And so that just needs to become the kind of baseline status quo of if you're an engineer, everybody needs to understand these concepts because it's just becoming more ubiquitous, and so we need to generalize and not specialize in these regards.
[00:33:05] Unknown:
Yeah. You know, it's that Conway's law thing. If you've got data engineers and analytics engineers and data scientists oh, and then you've got the platform folks. You know, you've got your site reliability engineers and whatnot. You're going to end up with siloed, centralized systems. And there's always gonna be a space for centralization and, you know, people with deep course you know, that t shaped skills thing. You know? They're gonna be people who who absolutely have to be t shaped, deep, long t's because this stuff is complicated. But we need more people to be more generalists.
And if we can combine people who have more general skills with less requirements, you know, and the and the big 1 is cutting down in the amount of data you have to process. Like, if you can narrow down the data set you're working with, suddenly that makes a lot of very complicated things a lot easier to deal with. So start doing more of that. Get more general. You're going to naturally end up in a more distributed system with a more distributed environment. And hopefully, you're going to end up with teams that are really knowledgeable in the domain and the problem that's interesting to them. You know, that old adage, look for people who are concerned about a problem and let them do things, which gets to another thing that we that we do a poor job of.
We've got to get more diversity in these teams. You know, we need to hire for more diversity, and we need to assemble teams with more diversity in mind. Because whenever you assemble a team that's, you know, you're looking for these t shaped skills and you're looking for people with lots of experience, and you don't want that. What you really want is you want people with a wide range of skill sets and a wide range of backgrounds. They make better problem solvers. There are some cool studies that I referenced some of them in the book about that. You know, having a more diverse team makes it easier for you to solve problems you don't already know the answer to. And we need to get better at that. In terms
[00:35:10] Unknown:
of the kind of scoped problem of building a data product, You know, in the book, you kind of focus on the use case of a small team and working in a smaller group to be able to conceptualize and iterate on and build and produce these data products. And I'm wondering what you see as being the kind of core practices that are necessary for that venue and some of the ways that those practices either change or mutate or when you start needing to bring in other concepts or specializations as that team starts to scale and can no longer be considered small and maybe speak to what that tipping point happens to be, whether it's actually in terms of quantity of people or complexity of problem,
[00:35:56] Unknown:
etcetera? Oh, ew. Boy, I think that the answer to that depends. I mean, the short answer is pretty straightforward. 2 pizza teams. Right? You know, no more than 5 to 8 people. Keep the number of people odd so that you can have a tie breaker when you have to decide some, you know, all of that kind of stuff. That's the good short answer. But, you know, the reality is you can only hold so much stuff in your head at once. It's difficult to think about a wide ranging problem with any kind of complexity by yourself. So you need other people to help you do that. But the more people you have, the problem that people think they're working on is gonna diverge, and that's okay. Maybe you need to split that team off. And, again, this is a centralized versus decentralized thing. This is 1 of the big drivers behind the, I guess, philosophy of data mesh is that you know, same thing with microservices. I mean, again, I think about data mesh as being microservices for data.
And sometimes, that's probably the most appropriate way to solve the problem. Other times, it's not. 1 area where I think is the most appropriate way to solve the problem is when you're in exploration, When you're just exploring the problem, you've got couple of people who are dedicated to solving a particular problem. Give them an exploration platform that lets them do their job as efficiently as possible. And if it doesn't work, you package it up and you put it on the shelf. If it works, you probably will need to completely change the way that thing is implemented.
And at that point, you're gonna need to bring in specialists to help you do things. You're gonna need to bring in people with deeper technical experience than the team probably has. And then that's where you start to figure out what's the most appropriate team structure because we're doing something different. We're growing.
[00:37:51] Unknown:
As far as the overall state of the data ecosystem, we mentioned earlier about the modern data stack, and there have been various evolutions of ETL and ELT tools. And every, you know, few months or every couple of years, there's some new product category that people are exploring. You know, some of the recent ones are data catalogs and now data quality and data observability. And with all of that activity, and, you know, a lot of this is stuff that we've had in some variety or another for years. It's maybe just repackaging it with a particular focus. And I'm just wondering what you see as the kind of challenges, particularly for business people who aren't steeped in this every day to extract useful signal from all of the noise.
And if there's anything in all of this kind of, you know, funding laden hype of these different elements of the data ecosystem that is actually truly new and innovative and not just revisiting the same concepts with a, you know, shinier brand?
[00:39:01] Unknown:
The first thing is, again, you've got to understand what problem you're trying to solve. Why do you bring in a data catalog? First off, you need to realize that we've always had data catalog efforts. You know, for 20 years, there's been a push, you know, every 5, 10 years for new data catalogs. Well, why is that? Because we don't have any visibility to the data. We start collecting the data. We don't manage the data well. It's difficult to expose it because we lock it away in databases, and we don't let people get access to those databases. So it's a legitimate business need. But the problem is, again, I'll I'll go back to, you know, 1 of the central themes.
Why on earth is your data system so large and so complicated that you need to put a data catalog over everything? And how long is it going to take you to deploy that data catalog and to catalog all of your assets and make them make them visible? Your system's so big, where it might be better to distribute that work out and have the teams that really understand it figure out, no. Okay. This is valuable. This is not valuable. Let's figure out how to prune the system, you know, get rid of that technological debt. It's difficult for business people who are making these buying decisions to focus on that because they get sold on, oh, we need to have a catalog, and we need to have APIs on everything, and we need to do this, and we need, you know, to hire a team of 12 data scientists to do machine learning models, and we don't stop to think about why, and do we really need to do that right now. And more importantly, are we even prepared to do that? What would we do with that? I mean, in manufacturing, you see this all the time. You show someone who's working on a production line, statistical process control chart, XPR and R chart, they're gonna look at you like deer in headlights.
You have to go back and explain variation to them. All they need to know is, am I doing okay today? Maybe a dashboard. The equivalent would be a dashboard. That's probably where you need to be. And leave the fancy stuff to come after the training you provide them to understand why, you know, this is more valuable. You know, why is a scatter plot more valuable than a bar chart, let alone, you know, a pie chart? We're pretty immature. And as technologists, we need to step back and go, are people ready for the level of maturity that I'm pitching here? Chances are probably not. So what do you do?
[00:41:34] Unknown:
And so in terms of the kind of work of building data products and being able to bring business people into the planning and execution of that and being able to bring engineering teams into the business problems, what do you see as the biggest risks to those engineering and data teams as they're starting to embark on these products or starting to evolve their capabilities to kind of work more closely with the business needs?
[00:42:07] Unknown:
We don't get alignment on the problem. We don't clearly articulate what the problem is and have a shared understanding of that. Take the time to get a shared language around the problem. That's really the core. And again, we don't understand the business. Next thing that flows from that is we don't understand the business problem. We think too big too quickly. And this goes for both sides of the fence, the business folks and the technical folks. We want to jump to a conclusion. We want to pay for some product to magically take our problems away. And that's where we first jump. We need to slow down a little bit and think about what we are really trying to accomplish.
That's just always a big risk. You know, it's always people in process problems. And then I do think that there's a problem trying to collect all the data, you know, for folks in the data warehousing space, A famous quote by the CTO of General Motors, who was the CTO of Hewlett Packard, he was really famous for saying, I wanna know everything about everything. Like, that is the most insanely crazy thing you could ever say. We're just going to build a big pile of technical debt that no 1 will be able to use.
[00:43:24] Unknown:
Absolutely. And so in terms of the work that you've been doing on the book and some of the kind of core lessons that you've put in there, what are some of the most interesting or innovative or unexpected ways that you've seen those ideas applied or
[00:43:39] Unknown:
some of the useful or interesting feedback you've gotten on the book now that it's out? So my wife works for an educational publisher. You know? You know what the educational market looks like right now. It's just exploding. And they're trying to deliver their own new products. And as she's explaining, you know, what's going on at work, it's like, oh, let me tell you about Explore, Expand and Extract. And I got invited to come and do what I affectionately call a man explaining minute, lunch and learn about their transition because where they were at, they were exiting out of exploring, going into expand.
And that piece there, you know, it's the crossing the chasm spot. You know, from that famous book, I call it the crappy chasm of doom because it's really horrible to make that inflection point. Nothing works. You know, everybody's working way too much overtime. And there's no sign that anything's ever going to get better. And 1 of the big messages that was really nice to say because I've been thinking about this. I've been applying it to where other places where I've worked. I've been through this transition point a lot, and it's always horrible.
And it was nice to explain that to non IT people that, you know, it's going to be okay. You just need to give it a few more months, and you will be past this, and you'll be back to normal again. And that was really satisfying because these concepts are really universal. Anytime you try to build things, the worst thing that can happen to you is people actually like it, and it takes off. Nobody's prepared for that. That was a nice thing that I got out of the book.
[00:45:20] Unknown:
In terms of your experience of writing the book and sort of formulating the ideas behind it and the core kind of mission of it, what has been the most interesting or unexpected or challenging lessons that you've learned in the
[00:45:40] Unknown:
development, and throw in a whole bunch of business theory at the same time was just ridiculous. But I couldn't think of another way to do it. I couldn't think of you know, if you learn how to use a particular set of tools, you don't know what to use them for. If you know where to point a tool, you have to know how to use the tool. And so I tried to weave a fine line between both of those difficult situations, and it was hard. And I'm I'm not really sure I did a great job of it, but I did the best job I could do. The other thing that was really good is I learned that I had about 20 years of frustration locked in my head, and getting it out on paper was very cathartic. Like, I can sleep at night. I don't think about data problems at night now, and I was before.
They were keeping me up at night. Lot of imposter syndrome that I encountered still have, so I think everybody who puts themselves out there really suffers from that. Yeah. Those were really the most valuable pieces of learning that I got out of doing this. It was a lot harder than I thought. It was 1 of the hardest things I ever done, And to crank out basically a textbook in 8 months was nuts. That was ridiculous to try to do that.
[00:46:57] Unknown:
Well, now that you've done it, what's next?
[00:47:00] Unknown:
You know, I don't know. I'm kind of looked at when I left. You know, when I took my resignation leave, I kind of thought about it as I've been planning this for a while, you know, saving up my money for a while. So right now, I'm just looking for new opportunities. I don't know that I have another book in me right away. And just trying to think about how to solve these problems. So if anybody else feels that they're interested in these kind of problems, feel free to get in touch with me. I'd love to talk to you about it. It's something I'm really passionate about and have been passionate for a long time, and I don't really know what's gonna come next.
[00:47:34] Unknown:
Alright. Well, for anybody who does wanna get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:49] Unknown:
I think there's a lot of potential in data mesh, both for good and for just total disaster. We had the same thing with microservices. Go look at how microservices panned out, learn from that, then start applying it to data and hopefully make fewer mistakes. That would be 1 thing. The second thing is, man, I wish there was something like SQL for visualization that didn't require you to write Python code. Because if you hand a page of Python code to someone who's not a developer, their eyes glaze over, and it's terrifying to them. And we need something that's simpler. In the book, I used Vega Lite, and that was fantastic.
But pretty arts and charts, not a whole lot of interactivity. Some places, that's exactly what you want. Other places, not. So I would like to see something that's easy for people to grok as SQL for visualizations. And I'm not I maybe have a suggestion
[00:48:45] Unknown:
for something to take a look at. I haven't found 1 yet. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on writing this book and some of the thoughts and experiences that went into it. Definitely a very interesting problem space, and it's definitely great to see people trying to bring business people more in the fold of working with the data and helping to educate engineers on the business concepts and requirements that go into what they're actually trying to build for. So appreciate all the time and energy you have put into that, and I hope you enjoy the rest of your day. Great. Thank you. I really appreciate the opportunity. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at dataengineeringpodcastdot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Brian McMillan and His Career
Motivation Behind Writing the Book
Challenges in Enterprise Data Management
Target Audience and Goals of the Book
Bridging the Gap Between Business and Technology
Core Practices for Building Data Products
Modern Data Stack and Industry Trends
Risks and Challenges in Data Projects
Feedback and Application of Book's Concepts
Lessons Learned from Writing the Book
Future Plans and Opportunities
Biggest Gaps in Data Management Tools