Summary
Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what AgileData is and the story behind it?
- What are the main industries and/or use cases that you are focused on supporting?
- The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis?
- One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project?
- How do you design in affordances for refactoring of the data models without breaking downstream assets?
- Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products?
- What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products?
- In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.)
- How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers?
- What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets?
- What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData?
- When is agile the wrong choice for a data project?
- What do you have planned for the future of AgileData?
Contact Info
- @shagility on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- AgileData
- OptimalBI
- How To Make Toast
- Data Mesh
- Information Product Canvas
- DataKitchen
- Great Expectations
- Soda Data
- Google DataStore
- Unfix.work
- Activity Schema
- Data Vault
- Star Schema
- Lean Methodology
- Scrum
- Kanban
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png) Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
- Prefect: ![Prefect](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/BZGGl8wE.png) Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit [dataengineeringpodcast.com/prefect](https://www.dataengineeringpodcast.com/prefect).
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atland's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Shane Gibson about how to bring agile practices to your data management workflows. So, Shane, can you start by introducing yourself?
[00:01:37] Unknown:
Hi. Yeah. I'm Shane Gibson, currently the cofounder and chief product officer of a startup called Agilentat. Io. And my background has been, for a while, coaching data and analytics teams, how they bring agile patterns and make them useful. Just love like to say I'm, you know, longtime listener, first time caller. So love the podcast and great to be on it. Yeah. Happy to have you here. And do you remember how you first got started working in data? Yeah. So it's kind of embarrassing. I've been in the data world for almost 32 years, I think, as I thought about it. And I started my career out working in a finance department doing accounts payable, so doing some of the invoice processing. And I didn't enjoy it. I found the work quite mundane, and I was lucky enough to fall into a row around the thing we call systems accounting.
So at the time, we had a time time sheet system, and so I picked up the technology side of that. And just to date myself, I think the first server we bought was a compact 386 SX 25. So those cost us about 50, 000 New Zealand back then and had, I think, you know, 512 k of RAM. So really started enjoying that working in the technology space and was lucky enough to move to another organization where I was put in charge of replacing the financial system, moving it off the old mainframe, the m data source across to client server. And as part of that, we were experimenting with what was called executive information systems back then, EIS, and playing around with talking forests and trees to take their financial information and make it available to our CFO.
So after we implemented that platform, I moved jumped across to the BI vendor world. So for some of the large US vendors but based out of New Zealand. My first gig was kind of doing pace enterprise resource planning in that financial software market. And again, it wasn't my passion, but I was lucky enough at the time that the vendor had a really BI and data stack. So I jumped swiftly across into that team and spent, you know, 10 to 15 years working for those large vendors, but based out of New Zealand and that data and analytics space. After that, I founded a consulting company, Optimal BI. And so grew that to your usual type of, you know, data and analytics consulting company. I think when I was leading it, got to run about 20 consultants. So we'd go into a customer and and do that strategy work and figure out where they wanted to go and then bring the rest of the team in to help deliver it. And as part of that, I got frustrated with the standard consulting methodologies.
I found that they didn't work. And so stumbled across this thing called Agile, which at the time for me was I thought a bit of a hairy, fairy, weird religion, you know, the whole kumbaya thing. So we experimented with with it internally with the team. And then was lucky enough to experiment with customers with their teams to see how it work. And spent the last 8 years effectively being a agile data coach. So working with data and analytics teams exclusively about how they can adopt agile practices. And then to finish it off about 3 and a half years ago, cofounded agile data dot io with a good colleague of mine, Nigel. And so been kinda leading that that product side of that company for the last 3 and a half years. And so in terms of
[00:04:38] Unknown:
the overall practice of bringing some of these agile principles into the data domain, I'm wondering if we can start by talking about some of the, I guess, industry verticals and existing patterns that you're coming into as you start to work with some of these businesses and say, Okay, well, this is what's challenging for you right now. These are some of the ways that Agile can solve that problem for you. If I look at it, everybody has a data problem. Whether you're a small organization, a big organization, whether you're in a financial services vertical, whether you're in an ensure you know, a government vertical,
[00:05:13] Unknown:
health. Right? Everybody has a bunch of data, and they want to leverage that data to make decisions. So from my point of view, I don't tend to work in specific verticals when I'm coaching teams. But what I do see is I see these basically 2 times that a organization will engage me as a coach. 1, which is ideal was when they're starting their journey. Right? They got a way of working at the moment that's not working for them, and that's the key. Right? There you go. What we're doing right now is not working for us. We wanna make a change. And they've seen, you know, this agile way of working or talk to people who've done it before or read about it, and they wanna try to see if it it makes the world a better place in terms of their organization and the teams. So what we'll do is we'll start from scratch. Right? The team will start their journey from that point, and I will coach and help them move forward. And the other way that tends to happen is the team have been experimenting with agile for a while, normally 6 to 12 months.
They've either had no success from the goals that they want to achieve or they've started off and it's been working for them and then they kinda hit a wall. Right? They get stuck around iterating their process. And that's the other time that I tend to get bored in to help teams change the way they work.
[00:06:23] Unknown:
As far as the kind of paradigms that have been dominant in the data ecosystem, there have particularly been in recent years a number of attempts to bring some kind of software methodologies into the data domain with varying levels of success, with some of the notable standouts being things like data ops and version control, both for the code and data versioning, you know, auditability, testing, kind of observability. And I'm wondering what you see as some of the aspects of agile that do map well into some of the existing data paradigms and some of the ways that it either falls short or introduces maybe brittleness or kind of extra work that is just not really providing value?
[00:07:10] Unknown:
Yeah. So the way I think about it now is when we talk about agile, it's a way of it's a form of mindset. And 1 of the downsides is a lot of people think agile equals scrum, and that's not true. There's a bunch of patterns out of lean, out of flow, out of XP, out of some of the other agile ways of working that have value to a data team. I think the second thing is people often focus on 1 perspective of agile. So they'll focus on something like techno call patterns that you've talked about, you know, CICD, version control, those kind of things.
And the way I think about it now is I think these 4 lenses we can use. And I talk about patents. Right? And a patent is something that has value in a certain situation. Right? It's something that's been used before that if I look at the context of the way it was used, it fits our context. And and if we apply it, we potentially will get some value out of it. And I break those patterns down into 4 groups. So I talk about team topology organization patterns. So this is the way the teams are structured. You know, do we have 1 single team? Do we have multiple teams? How do they interact? How do they fit into the organization? So that whole pattern around team topologies is important.
The second thing I talk about is process or practice patterns. That's, you know, what are the things the team are gonna do to get the data from the beginning to the end? I'm a great fan of the concept called how to make toast. So if you Google that and look at YouTube video, it's a great process to work with a team to say, actually, what is the work they do every day and how does it flow? The third 1 I think about is technical patterns. You know, those idea of version control, those idea of checking of managing tests before you deploy that idea of data modeling. Right? And in the data world, we have some unique things about that, which we can talk about in a second. And then the last 1 I think about is way of working patterns. How do we take all those other things and put it together and create our own way of working?
And with teams, I really encourage them not to adopt a methodology. So there's a big push in the world to adopt scout agile safe, and I am very negative in my view of safe. Agile is not a methodology. Right? It's a way of working, which says we're gonna iterate. We're gonna get value to the customer early. We're gonna get feedback. And then when something's not working, we're gonna change it. And so we don't wanna pick up a methodology. We wanna craft our own way of working. But by leveraging patterns that have value, so we don't have to do it from scratch.
[00:09:23] Unknown:
1 of the interesting aspects of kind of the agile principles in the data world is that 1 of the kind of predominant aspects of agile is that you want to focus on a fully connected end to end flow with a very narrow scope where you say, I want to, for instance, you know, add a new input form for somebody to be able to, you know, give me their email. So that means I have to have the UI. I have to implement that. I have to create the database model so that it can store the field. I have to make sure that, you know, the controller, middleware, and the web application is able to receive that input and write it to the database. I need to have tests around all of that. And in, you know, a web application, you know, workflow that's fairly straightforward. It's well understood how to actually do that end to end flow with that narrow scope. Whereas in the data domain, it's not always clear how to manage the appropriate chunking because before you deliver all the information to the end user, you have to think about things like governance, data modeling, you know, the kind of life cycle of the data, data cleanliness.
And so I'm curious how you think about how to approach that question of what does that quote unquote narrow slice look like, and how do we reduce the scope from end to end without, you know, starting at the beginning and saying, oh, well, now I actually have to do the entire horizontal layer of staging the raw data across the board before I can even go to whatever the next stages are. And so you're able to do that kind of, like, deep integration instead of wide integration.
[00:10:56] Unknown:
I think there's 2 questions on there if I unpack it. There's the question of why do these software engineering practices that are well established and really successful seem to struggle when we apply them to the data domain? And then the second question is how do we thin slice? Right? How do we take this big behemoth that, you know, we typically used to spend 3 years building out and how do we bring it down into Wix? Right? And do that in a, in a successful way. So if I go back to that first 1, I've really struggled to understand why software engineering practices and agile from a software engineering point of view is difficult in the data domain. Because in theory, it's very similar. The best I've got at the moment, as you said, when you're building a web application, you're in control of the data. You control how that data is created, how it's entered, how it's landed, how it's stored.
And the data domain, we effectively get given that data as exhaust. So we have absolutely no control and we get a massive amount of uncertainty. And that brings a lot of the problems to the data world that we don't have in the software engineering world. I think the second part is the tools that we have in the data world for adopting natural working. We're in the stone age. Right? The tools we have are not fit for purpose. They're based around big chunks of work happening. And so we've got to find ways of fixing those 2 problems. And we started to see that. We started to see our tooling get more agile in terms of the way we work. And we've found techniques to break and solve that problem around that uncertainty. We'll never solve the uncertainty unless, you know, we do go full data mesh where the software engineers are actually producing fit for purpose data. And that's a dream we've had for 30 years. I don't think we're gonna achieve it, but we might. We are let's look at it. And why don't I think we're gonna achieve it? Because if I'm the product owner for an organization and I need a new field on my form to go out to engage with my customer, and I have to make a trade off decision before that field turning up to this week.
Or you doing the data work for that field to turn up next week. Unless I really am a data driven organization, I'm gonna make a trade decision, which is push that field out and give me that customer value. And then come and do the data. Right? And then I'll potentially reprioritize some other work. So I think it's an organization priority problem more than anything else. So if we go back to okay. So we get given this data we can't control. So has a massive amount of uncertainty. Our tools aren't fit for purpose. What techniques do we have to fix that? And so what we wanna look at is thin slicing. So teams will either take 1 or 2 approach. They'll thin slice, which says they'll try and break the work down into a small enough chunk that they can do end to end in 3 weeks, or they'll pipeline it. Right? They will break up their work to match the technology stack, and they'll pipeline the work. So you'll see them grab the data, collect it, land it. Yeah. And staging, like, whatever you wanna call it. Then they'll see them pick it up and move it into some form of other data repository and ideally model it. You know, then they'll actually go and create some metrics, and then they'll create some visualization or last mile delivery. And each 1 of those will be a set of work with handoffs and milestones in between.
I encourage teams to thin slice. Our goal is to get a group of people to be able to go end to end with that data, add the value to that customer, the consumer at the end of what they do. And, ideally, I start where we try and get the teams to do that within a 3 week iteration, which is hard. Right? It's hard to go end to end in 3 weeks as a small team.
[00:14:21] Unknown:
Another aspect of that kind of end to end is understanding what is the other end, where in the web application, it's very clear that the other end is, you know, some UI functionality or some way that the user is able to interact with the product, whereas with data, there isn't necessarily a kind of cohesive end step for any given piece of effort, where in some cases it might be, I need to be able to add a new filter to this visualization in the BI dashboard, or it might be I need to be able to populate this, you know, data source for an API so that I can then consume that data product in some other web application or data pipeline. And so what do you see as some of the, I guess, most common or most achievable kind of terminal nodes in that end to end graph for particularly for teams that are first starting on this effort of being able to kind of transform into that agile workflow?
[00:15:22] Unknown:
Yeah. So this comes back to 1 of those process patterns, and it's around understanding requirements, understanding the value we need to deliver. So I've been working on a thing called the information product and the information product canvas with customers for the last 8 to 10 years and, you know, open source the canvas so we can have a link in the show notes if anybody wants to download it. And what it is, it's a way of defining a boundary of the work to be done. Right? So we go away and we talk to the stakeholders, and we wanna understand the actions and the outcomes that'll be taken with this piece of effort. So, you know, if you get this information at the end, what are you gonna do with it? What action are you gonna take and what's the outcome of that? So, you know, are we gonna go and look for get a flag for customers that are about to churn? What are you gonna do with it? Right? Okay. Well, we're gonna have an outbound call center to go and talk to them and give them an offer. Cool. And if that's successful, what happens? Okay. We reduce our churn, which has some financial benefit. So what we're doing in there is we're getting both the action and an understanding of the value of that piece of work.
And the reason we want the action is often we get told the solution. Right? Not the problem we wanna solve. And as data specialists, we should be trying to understand the problem and ways we can solve it. So by asking for the action, we can sometimes come up with a different solution. So, well, actually, we don't need the churn flag. You know, we've already got this thing over here. If we just leverage that, we can get something to you much quicker that will solve your problem. We'll give you that that action. The other thing we do is we often ask for the business questions.
And the reason we do that is we find when we ask for the action, the outcome, some people struggle. They haven't thought about it that way. But if we ask for the business questions they wanna answer, the how many, how much, how long, yeah, it comes straight off. They always had 3 to 5 straight out of the head. Right? And from there, we can help the and further and talk to them about it. As part of the canvas, we all wanna understand the data that we need. Right? So we use a methodology from Lawrence Call called Beam. Who does what? And we say, okay. So what core business processes are involved? So, yeah, customer orders product, customer returns product to store. And that enables us as a team to go, damn. Actually, you know, customer orders product, we've already moved that data into the platform. We can get something out pretty quick. Customer returns product store. Actually, that's a whole new system. Right? We haven't collected that data. And that allows us to size the data. We go, okay. Now we've got a data collection task, and we know those are hard. So the canvas allows us to have a bunch of boxes, gather that information, understand how big it is, and then go back to the stakeholders and have a trade off decision conversation. Say to them, well, look. We could do everything that you want, and we're estimating 3 months. Or we could do this, but first, and that'll be 2 weeks. And then we'll do the next 2 weeks. Right? And we'll just incrementally build it up. And that's what we want. We wanna be able to break those requirements down into smaller and smaller chunks to get it out, show value, and get feedback.
[00:18:02] Unknown:
Particularly on that requirements gathering piece and helping to educate the stakeholders on, you know, what is easy, what is hard, what is impossible. With a web application, that can be challenging enough of somebody saying, oh, 0, well, I want it to automatically know whether this is somebody that we've interacted with before. And it's like, well, we can do that, but that's gonna take about 2 years. And then in the data domain, it's, oh, well, I wanna be able to report on x, y, and z. And you say, well, actually, we're gonna have to incorporate information from 5 different sources and do entity resolution. And, you know, now we're talking another 2 years for this kind of thing versus, oh, I just wanna be able to know, you know, did they purchase or did they have something in their cart? And you say, okay. Well, we can do that pretty quickly. And just understanding both at the stakeholder level, but also sometimes, you know, within the data team, what are those kind of relative levels of effort, and what are the things that are easy versus hard versus impossible?
[00:18:54] Unknown:
Yeah. And so in the past, I've made many mistakes. You know, we fail often in agile. And 1 of the ones was trying to explain the complexity to our stakeholders. They don't care. Yeah. They just want the job done. They they're not a specialist in the data world. They have something they want. They don't understand why it takes so long. And that's reasonable. Right? It's not their domain. And often we have complexity that doesn't make sense to us. So how the hell do we justify why it takes so long to them? So this is where the role of product manager or product owner comes in, in my view. That is a role that sits between the team and those stakeholders.
And that role is around facilitation and communication with the stakeholders around trade offs that need to be made. And then it's also about making the trade off decisions. So, you know, when there is something that is complex and it is gonna take some while and there is no choice, then the conversation then from a product owner, the stakeholders is we can do it. And this is how it's gonna take. Do you wanna wait? Or here's some some alternatives that will get us there in incremental ways. But you won't get everything you want. And what we see is when we have a good data team and a good product owner, there's a massive trust there. Right? When the team are estimating or guesstimating as I call it. Right? Because as humans, we are crap at estimating.
When they're guesstimating the number, you know, there's a trust thing where's where the product owner's been working them for a while and goes, yep. That's that's in the right for the little I know. And then they articulate to the stakeholders that there really is no alternative. That's just as what it's gonna take to get their job done.
[00:20:24] Unknown:
The other interesting element of kind of doing that end to end flow is the question of data modeling, where how much of the modeling and kind of entity design and schema design do you do upfront versus how much of it is emergent? And when you do the path of saying, you know, we'll just do it, and then we'll let the patterns become emergent similar to how you might do in, you know, a regular software project of, you know, rate it a couple of times, figure out what are the right abstractions, refactor. Doing that in the data world isn't always easy or possible or it just becomes expensive because you're duplicating data or, you know, the operations to refactor the data and kind of rebuild it from scratch can take quite a bit of time.
I'm wondering what you see as some of the signals for when you need to bias towards 1 direction or the other of, no, if we don't do the data modeling right now, we're going to regret it because it's going to take, you know, weeks or months of effort to do it right afterwards, or, you know, this is a small enough change set or a small enough problem space that we can just do whatever makes sense right now and then refactor it later?
[00:21:33] Unknown:
So I'm highly opinionated on this 1, and we're highly opinionated in our agile data product as well. I don't believe there is ever a 1 off question ad hoc piece of work. If you look at what happens, you get asked by a stakeholder for a piece of data or a piece of information to answer a business question. That's their first business question. How many customers we got? Soon as you answer that, you're gonna get the next business question. Yeah. Where are they based? What are they buying? How much are they worth? How many we're losing? Soon as you answer those, you're gonna get the next level of complex questions.
Why are they leaving? What can we sell them that's gonna give us more revenue? I wanted more customers that look like them. What do they look like? Then we start getting to the really complex optimization and flow ones, you know, over time, how many have left from this region versus that region? Why did they leave and what could we have done to change it? Yeah. So the first question we get asked is is just the first question, and we know that the next ones are coming. The other thing we know is once we give that piece of information to a stakeholder, they're gonna ask for it again on a regular basis. It has value for them. So this idea that we ad hocly just go create a piece of code as a 1 off, give them the answer, and we'll never have to use that again. I think it's been proven to be wrong time and time again. You know, we've seen that in the market. We saw it with the self-service BI wave, you know, the tableaus, the clicks.
We gave self-service out. Lots of people built really cool stuff, but we lost the practice of data. And then we kinda go through a way when we move back, and we're in that wave again with DBT. Right? And, you know, again, I'm opinionated. I don't like it when words get used out of context. So I really don't like the fact that DBT calls a chunk of code a model because in my past, after the last 30 years, there's not a model. It's a chunk of code. So how do we solve that problem? Well, what we do is we focus on how we model likely. Right? And how we have modeling that enables change as much as it can.
You know? And if you look at people like Scott Ambler, he's been talking about agile data modeling for a long, long time. And so with the teams I've worked with, there are techniques you can use to model early, model lightly, and enable change as much as possible. But the thing that's true with every agile way of working, change has a consequence. Yeah. So we just try and reduce the consequence of that change as much as we can, but change is always gonna need effort. So there's a bunch of patents and techniques that I've seen teams use to model early, model lightly, and enable the change to the model, but they always model. And so the other lens that I use is what we call definition of done. So definition done is, a set of statements from the team about what the professional practice is. How do they know they've done a job for themselves? Right? How do they know if somebody else in their team is doing a piece of work that is done to the level they've agreed?
And, you know, as an example, I would expect the definition of done for a data team to have the code is being tested and the data has been validated. Why would I expect that? Well, you go talk to a stakeholder, and you say to them, I'm gonna give you a kind of customer, but, actually, I'm never gonna cheat that it's right. Right? It's just a number. I can't prove that's how many customers we've got. I'm gonna leave that to you. Would that stakeholder believe that we've done our job? No. But we do it all the time, and it's just wrong. So there is a level of professional practice as data people that we need to do, And data modeling is 1 of those, and testing is 1 of those, and validating our data is 1 of those, and we should just do it right. It shouldn't be optional. The trick is how do we make it iterative? How do we make it small? How do we not go in, as they say, catalog and cocktails or the ocean? You know? How do we not go into that scenario where we have a single data modeler who sits in a small office scratching their chin for 9 months to come out with this beautiful enterprise data model we'll never implement. We've got to balance out, light modeling, but we still have to model.
[00:25:35] Unknown:
Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month. For more information on Prefect, go to data engineering podcast.com/prefect today.
That's prefect. So in terms of the kind of definition of done and some of the challenges or kind of anti patterns that are maybe becoming emergent with this newer set of tooling. What are some of the ways that you see teams as they go through this exercise of, we're going to embrace some of these agile practices, we're going to do this iteratively, you know, how do they start to think about the tools that they use and the ways that they're using those tools to be able to kind of incorporate all of these concerns into a holistic kind of development flow without losing their minds?
[00:27:02] Unknown:
Yeah. So again, you know, opinionated. All they should do is buy an out of off the shelf platform like agile data dot I o. Right? And, not try and cobble their own. However, that's not what the market's Right? We're at the stage. We have what we call the modern data stack or or I call it the jinga stack. You know, these 25 different tools, 1 for each category, and you're coupling it together. I kinda liking it back to the old ERP days where we used to have to cobble together the payables module and the receivables module and the geo module from 3 different vendors. And we learned from that. But given this where we're at, right, with the majority of the teams I see building out their own technology platform using a combination of open source and closed source.
We start focusing on only building what's valuable for now, but making sure that we understand the technical debt and what the cost of change will be in the future. So, you know, we look at creating a blueprint. And what I mean by blueprint is we draw on a piece of paper, on a whiteboard, in a Miro board, a bunch of very large boxes that have the chunks of stuff that we need. Yeah. And then we figure out what we're gonna build first. So we'll typically look at it and we'll go, okay. We need some way of collecting data. What are we gonna use to collect data? Are we gonna build something? We're gonna use the software as a service product? We're gonna use an open source thing. Right? And that's driven ideally by what's our first data source.
And so we do that. I'll give you an example. Yeah. 1 team I worked with had a theory that the first piece of value we had to deliver was grabbing data out of SAP. And we know that grabbing data out of SAP is incredibly hard. There's a whole lot of reasons that collecting data from SAP is just a nightmare. So we looked at that, and the team estimated it's probably 3 to 6 months to get, you know, a fully automated collection process out of the SAP platform. But when we went and asked the stakeholders what the first information product we're gonna deliver for was, it was actually based around their call center software because that's where they had an emerging problem. So if the team had have gone and build out this whole collective for 6 months out of SAP first, and then the first information product when they were ready was actually the contact center software. They'd invested in the wrong place.
So that's what we understand is what we're gonna build first. So they collect it normally, and then the second thing is where we're gonna store it. You know? And there's a bunch of patterns out there now. You know? Are we data lake centric? Are we, you know, data warehouse centric? You know, Snowflake, are we Databricks? Are we BigQuery? Are we Firestore? Firebulk, are we single store? Right. There's a bunch of patents that are reusable. And so we should just pick 1 and implement it as quickly as we can and then test it. Right? And then we think about what's next. And so from an agile data point of view, when teams are working, they should be constantly reviewing how they make toast their processes. They should be figuring out where the next major problem is or the next piece of value is, and they should iterate the way they work. So we're constantly building out our technology platform, and we're constantly changing the way we work to solve problems as they arrive.
But we still need that blueprint. Right? We still need that big picture of where we think we might go, and that's important. But that blueprint's not 6 to 12 months of a data architecture or, you know, a big 500 word document that pretends to know what we're gonna build over the next 12 months because everything changes. We know that.
[00:30:16] Unknown:
The other aspect of kind of allowing for refactoring of the data model as you explore more of the problem space is how do you bring in the right abstraction layers so that you can kind of restructure some of the foundational layers so that you can bring in, you know, data model reuse, code reuse, concept reuse without breaking some of those existing kind of end user assets and doing it in a way that doesn't, you know, add a whole bunch of extra work on the behalf of the developer so that it becomes a maintenance nightmare where you're just, you know, sort of like the situation where if you want to replace the foundation of a house, you first have to lift the entire house and move it, then dig out the whole foundation again, and then hope that it doesn't fall over in the process, and then relay the foundation and put the house back down. Yeah. And that's a really important concept. Right? It's this idea of an architecture of a house or I often use food analogies for some reason I kind of blame data kitchen for that. Right? The idea of ingredients, recipes, and store rooms, and and frontline service. So we've got to understand the blast radius. Right? We have to understand what our architecture looks like, what our layers are, the ones we think we're going to add later, the ones we think become
[00:31:27] Unknown:
semi immutable. Right? The technical debt for changing them is massive so that we understand the consequence of that change. And we understand we were making bets that are dangerous. Right? Where those bets are embedded as foundational pieces and to change them later is is a high cost and a high consequence. This is ones that are relatively disposable. So what's an example? You know, if I'm gonna bring in a testing framework, I could probably throw a great expectations on the side of what I'm doing, and it's relatively replaceable. Right? I could bring in SOTA or 1 of those other tools. Right?
And effectively change that testing paradigm because it's not embedded. It's kind of sitting to the side. Whereas if I was gonna replace my cloud data analytics database, yeah, that may or may not be completely replaceable. I'm gonna change my modeling technique. Yeah. If I'm gonna go from Data Vault to dimensional or Data Vault to activity schema, that's probably a breaking change. Right? That's a massive change. So we need to understand the bits we're making, the ones that we can change easily, the ones we can't, and how we know when we need to change. And I'll give you an example for our product. Our entire product is based on Google Cloud. Right? It's 1 of the bets we made really early. We're not multi cloud. We decided to go with a single cloud provider. We picked Google for a whole lot of reasons. It's actually 1 of the best bets we've made. You know, it's 1 of the things I'd probably go back in hindsight and saying, damn, that was a good guess.
And our product's based around what we call configuration. So we're metadata driven. So when you create, a transformation, you actually create it via rules for natural language, and that's stored in a database which tells that configuration. And when the transformation runs effectively, we have code that calls it config, says, what does this look like, compiles the transformation code, runs it, and then disposes of it. So we call that a manifest as a pattern. Now when we started out, we knew that this config is core to us, and my cofounder, Nigel, and I come from more of a relational background rather than a NoSQL background because we're that old. So we had a relational model for that conflict. We'd had stuff we've done before for customers when we're consulting.
So we looked at it and we went, okay. We need a relational storage mechanism. We wanna keep it as low cost, as low complexity as possible. We're using BigQuery as the way of storing our customer data. There's no reason why we can't store that config in BigQuery. But we knew that that was not immutable. We knew that at some stage when we got big enough and we had enough customers, that actually BigQuery wouldn't be able to handle the concurrency of hitting that config database for us because it's not designed as a transactional system. So we knew we're going to have to change it. So every time we did the design and config, we did it with the idea that we were going to move it. Now, interesting enough, we decided to become 1 of the cool kids and we did a change and we moved to Google Data Store. We moved it to a NoSQL database because when we looked at the market, everybody was doing NoSQL.
That was an epic failure. Now, yeah, we learned lots, but we lost, you know, a lot of time doing that. Why? Well, because it wasn't our natural pattern. Some of the features we got from a relational type database gave us engineering for free, Whereas in NoSQL, we had to go build it. So what we ended up doing was we ended up jumping from Google Data Store to Google Spanner. Right? Which is the massively scalable, relational style database. And that change, you know, it was effectively we lost all the work we did on those SQL database. We ended up going from the BigQuery pattern straight to Spanner. But now that's for us as immutable. Right? The cost of change of not using Spanner is massive for us, but that's okay. We get so much benefit out of that piece of technology, that pattern that actually we don't wanna change it. Right? And so for us, we we had a plan.
You know? We knew we had to make a change. We had a guess at what we thought it was gonna be, but we always enabled ourselves to paint ourselves out of the corner. And so that's what teams should be doing. Right? They should be thinking about that, about what happens if we need to change this piece of technology. You know, what happens if our, you know, massive cloud analytics database vendor who has a massive loss every year doesn't survive or gets bought out by, I don't know, a big CRM company. What's it gonna do to us? Right? If we had to change that database, what would we move to and how much work would it be? So we have to keep those things in mind, I think.
[00:35:38] Unknown:
Another problem that manifests particularly later in the life cycle of a data product, but is something that should be considered early on, is the question of access control and governance and who should be able to view what data and do what with it and export it where. And it's another 1 of those things that's easiest to do early on or at least to start incorporating early on in the process rather than trying to retrofit it in afterwards. And in this agile way of working of saying, I want to deliver a narrow slice, you know, this can very easily be something that gets left on the cutting room floor of, we don't need that right now, so we'll do it later. And I'm wondering how you either encourage teams to kind of push back against that habit or, you know, ways to simplify the process of incorporating some of those governance concerns or understanding what are the appropriate role boundaries for data as it traverses its different stages?
[00:36:37] Unknown:
Yeah. So there's 2 really good examples in there, which is, yeah, our security policies for data and our governance way of working. I think I've got successful patents for the security. I've got great ideas for governance, but never been able to experiment with a team and, like, organization to see if they're right. Although we're experimenting a little bit within the product we're building. So if I take that security 1, what do we see? We see a natural pattern to complexity. We see a natural pattern of, on day 1, we want to be able to secure at a cell level. We wanna be able to mass in bits of data for certain users. Right? So it's a high level of complexity. And if we look at the effort, it could take 6 6 to 12 months for us to build that out in a way that works.
So we have to question about, you know, do we really have to build that right now? Or can we chunk it down? Can we start off with an environment where only a certain small number of trusted users can come in and use the data? And there's a whole lot of built some braces around policies and procedures and and accountability for the people that we don't need to build a complex security model yet. You know, then can we chunk it down by role or group? And so we're always gonna segment our finance data versus HR data to different groups and make a really simple security model. We might get something like, no. No. We've got PII data in there. Right? And we really have to secure, you know, social security number or driver's license. But we're quite lucky in that there's some technology patterns out there now. Most of the cloud vendors and most of the good products have, you know, data loss protection stuff where it will go through and use machine learning to identify the columns that have that data and mask it for us. So we could probably put that in as a low cost, low well, a low effort, maybe not low cost, but a low effort, you know, component that we can replace later if we needed to.
And we just start building up that security capability over time as the need happens. Now the downside of that is we have to refactor. Right? We have to change. So, you know, if we're using Power BI and Power BI is coming into the data and it's using a service account and it's not passing through the credentials of the user that's running the visualization, and we want to bring in some fine grain security, we've got a problem now. Right? We have to actually do the work that that identity gets passed to the database if the database is applying the security. But we can solve those problems, and we only solve them at the right time.
If the organization is saying, no. We need built some braces for whatever reason. You know, maybe we're in a highly regulated industry. Then, you know, we estimate it and say it is 6 to 12 months to build out that capability. Right? That's the cost. Right? So it's a trade off decision. Yeah. Would you like us to incrementally build it out over time? Would you like us to do a big investment? If we're doing a big investment, I still think the team should break it down. Right? They should figure out ways of decomposing that security work into smaller moving parts where they can test and validate at each step and just, you know, build it up like a house, you know, at a room, at a layer. So that's the security 1, and I think there's lots of ways I've seen teams do that well. The governance 1 is 1 that I still struggle with. The best I've got at the moment is we have to break the anti pattern of governance as a bunch of committees that take something that's been done and has an opinion on it, which requires more work to be done. So they're at the end of the process.
What we need is we need governance to be at the front of the process. And on the podcast I do, we had a great guest that gave me this kind of terminology that I love. They talked about exoskeletons and internal skeletons. And the way I think about it now is our governance groups. And and by governance groups, it might be data governance group. It might be an architecture group. Right? It's a bunch of people outside the team that have the right to set rules or change the work that's being done. So, you know, let's look at our architecture group or our data governance group. They should be setting principles that are immutable and those are exoskeletons. So those are the things you have to do.
If you are gonna break those, if you're gonna go outside the boundary of the exoskeleton, then you need to have a conversation with those groups before you start any work. Right? You need to trade off with them the work you're gonna do and get agreement that you can either bypass those rules, do them in partial, or deliver them as a partial capability or partial compliance, I think, or could be completely out for whatever reason. There'll be other bills and prices put on. So what's an example of an exoskeleton? How about whenever we store data, that data will be encrypted on disk and in transit?
Yeah. That's probably an exoskeleton from your architectural security team. And if we're gonna store data and we're gonna transmit it and it's not encrypted, we need to go have a really big conversation up front. Right? And we'll probably get told no. Then we talk about internal skeletons, which are patterns that we can use if they have value. So, you know, 1 might be we have a preference to model data for analytical reasons using Data Vault. Right? It's an internal pattern. So if you can do that, do it. Right? Because it makes sense. Right? There's a bunch of expertise in the organization around it. That's a well described patent. We know that if multiple teams create, hubs, we could probably conform them together if the keys match. So there's some value in your reasoning, those patents.
But if for whatever reason you needed to go use activity schema, right, because you're using event streaming and it just fits for the use case that you want to do, then that's okay. Right? It's just becomes a new internal pattern. So that's my view around governance now is we need to be able to understand rules that are immutable, which are our principles, patents which have value, which we should reuse. And then, ideally, we should be moving to governance as code. How do we create code that kind of defines that policy and means we can apply it against what we're doing and it can tell us where we pass or fail. Right? That would make our life so much easier.
Be able to do that without actually having humans having to check it. So that's where I'm at with complex things like security. Build it up step by step if you can. And governance, you know, turn it to rules we can't break, things we should use, and code that will actually test it for us.
[00:42:41] Unknown:
Another aspect of kind of the agile data way of working is the kind of question of how do you incorporate kind of collaboration across the different kind of capabilities in the team where some aspects of data engineering and data management can require substantial knowledge and understanding of the systems that you're working with, the data that you're working with, you know, what are the acceptable mutations. And if you're trying to sort of cross train across members of the team, either who don't have expertise in the domain of the data source that you're working with or don't have expertise in kind of data engineering writ large, but are quite adept at software. Like, what are some of the ways that you think about being able to kind of bring everybody up to a kind of shared level of understanding and capability and capacity?
[00:43:42] Unknown:
Yeah. So we use a patent called t shaped skills, which has been around for a long, long time. So when we start off with a new team and they're they're experimenting with this way of working, there's a bunch of foundational patterns we wanna put in place. So things like teaming agreements, definition of done, definition of ready, definition of done, done. Some things that we just know we need to be in place first for the team to have a chance of success. And 1 of those is mapping out the t skills. So t skills is based on the concept of you have the letter t. So across the top of the t is effectively breadth and vertical bar there is is depth. So for breadth, what we wanna do is map out the core skills a team needs to be able to deliver data or information from a end to end process. So it'll be things like facilitation skills, requirement, gathering, development of code, testing, documentation, release management, those kind of things, data bombing. Right? So we want to understand the core skills that the team should have.
And then we want to understand the depth of capability within the team. So I have a pattern where I talk about novice practitioner, expert, and coach. Yeah. So novice is somebody who's, you know, done a little bit of it, but doesn't do it day to day. Practitioner is a day job. I'm good at it. Expert is, hey. I could actually teach you how to do it or show you how to do it. And a coach is, actually, I could teach and coach and mentor other people to do it the way I did. And there's quite a big jump in my head between an expert and a coach because sometimes experts don't want a coach. They're just bloody good at what they do, and that's okay. So we put that down the page and then we get everybody to map out their skills. Right? So from a documentation point of view, am I a practitioner?
Am I an expert? Am I a coach? And then once everybody's done their t skills, we overlay it as a team. Right? As a group. Right? And when we talk about a team, you know, I subscribe to the pizza philosophy. Right? So between 4 and 9 people is an optimum size for a t team for a whole bunch of reasons. So we get those people. We put their skills. We overlay them. We see we've got gaps. Oh, okay. Look. We don't have any testers. We don't outsource testing to another team. Right? We want to have testing skills in our team. So what are we gonna do? Some people are gonna upskill who are interested in that. We're gonna bring another team member in. Right? We wanna fill that gap. We also wanna look where we have duplication. Right? Which is good. Now where we got 2 people that are strong at data modeling or 2 people are strong at testing or 2 people are strong at development because now we've got redundancy in the team.
So that's what we wanna do is we wanna understand that and make sure that the team becomes self sufficient over time. And the good thing about that actually is it helps the team have a conversation about where they wanna go in their careers as well. 1 of the things that Agile does to us, which is not good, is it gets us into this factory. You know? It's funny when we talk about scrum, we talk about sprints. It's not a sprint. It's a marathon. Right? Once a team's rocking it, they're just in there day in, day out. It's monotonous. Right? And so we want some way for people to grow, you know, be able to get more skills. And the skills matrix or that t skills thing says, hey. Look. I'm actually really interested in jumping over into that facilitation space. Hey. I'm a novice, but, you know, how do I get to practitioner?
So that's what we do when we have a team of 4 to 9, and then we start to scale. Right? And that's where we had another problem because now we go and, you know, we have 2 pod squads teams or 4. The most successful practice I've seen is we split our our squad into half, create 2 squads, bring in new people, and cross skill them. Now what happens is you lose what we call velocity. Right? The team the 2 squads no longer deliver as much as the 1 squad used to, but that's okay. Right? Over time, they'll build up and we'll get the the 1 +1 equals 3 behavior. As we build that up, we're now gonna see specialized skills that actually we don't have enough of them in the organization to have 1 person person per squad.
And that's where we start bringing in specialized squads and seeing how they work, where they help those other squads by either upskilling them or by doing the work for them. Now there's a bunch of patents out there that help us do that. Spotify published 1, got called the Spotify model. We did bad things to it as a community. They won't share what they're doing with us anymore, which is a real loss. My favorite 1 at the moment is Jorgen's 1 called unfixed dot work. And the reason I like his 1 is I like the way it's described. So if you go to unfixed dot work, he has nice pretty pictures with boxes that are colored and a really good description of what that is. And he talks about, you know, teams that actually do the work and teams that coach other people to do the work.
We'll see when we start scaling in the data world, we would typically see a platforms team come out. Definitely in the data mesh world. Right? That's the new hot thing. 1 of the things I talk about with teams when we start to get to that level is the platform team is now building a platform or a product for somebody else, not for themselves. So they've got to bring in a whole lot of product thinking, a whole lot of new skills that are different to the skills when they were embedded in the squad building out their technology. But that's okay. Right? We just map out the t skills. What do they need as a platform team to build a product? Which is effectively what we're doing with agile data. Right? As we're building a product that other data teams can use to do the work. And I've seen lots of teams internally do that capability.
[00:48:55] Unknown:
As you do bring in that split between the data platform and the kind of data engineers or data products, what are some of the useful kind of interfaces for defining what are the boundaries where if you're somebody who's working on the platform layer and you start to try and, you know, get involved in discussions about how should you model your data or, you know, what is the right level of granularity for a data product, can say, actually, no. That's not your concern. You know, you don't need to worry about that. You know, this is the thing that we actually need from you.
[00:49:28] Unknown:
Yeah. So that's the key, right, is that the platform team are now building a capability for another customer, and they've got to decide where they are in their government cycle. Are they setting the, you know, the exoskeletons, the rules that are immutable? So our platform will only accept modeling using Data Vault, or are they building something that's internal patterns which see is our platform enables dimensional modeling, star schema, you know, Data Vault, Activity Schema to 3rd normal form. Right? So they have to be really clear about how opinionated the platform they're building is. The second thing is they need to understand how they're going to innovate that platform. What's their way of working? What's how do they make toast?
So an example I've seen is with 1 customer, each of the squads actually did the first cut of of platform capability. So what happened was the teams were out building data or information products, and they'd need a new capability on the platform. And the problem they had was they had a bottleneck. So whenever they needed a new capability, they'd have to telegraph it to the platform team way early. The The platform team would have to build it in time for just when the squad needed it for the squad then to build out what they needed for their stakeholder. And that timing became a nightmare because often the information product the squad was gonna work on got changed. And therefore, the platform team were building out features that weren't needed just yet. So 1 of the techniques they used was they had good technical people in the squads.
So they would do the first cut of the technology patterns. They would use them and then they would basically be picked up by the platform squad who would then harden them and then make them available to the rest of the squads. Because it's not just about building the technology, it's about all the supporting things around it, documentation, training, how do you use it, testing, All those kind of things. So that was, you know, kind of a prototype iteration and the squads. Harden and production wise and and the platform team. I still have a preference personally for the platform team to have a road map and be able to move fast enough that they can move features in and out of their road map in time for those squads. But their orchestration is often hard. So the key thing is that's your way of working. Right? You have a theory about how you're gonna make toast, how you think it's gonna work.
Give it a go. When something fails, you know, your retro is let's look at that process and say, why did it fail? What are we gonna experiment with to change the way we work? And how do we know whether it made it better or worse? If it made it better, lock it in. Right? If it made it worse, stop doing it and experiment with something else. So there is no methodology. Right? But there are a bunch of patterns that our, teams have done that may be useful to them.
[00:52:11] Unknown:
Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and Spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Posting more than a 150 out of the box connectors that can be set up in minutes, HEBO also allows you to monitor and control your pipelines. You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks. Preload transformations and auto schema mapping precisely control how data lands in your destination, models into workflows to transform data for analytics, and reverse ETL capability to move the transformed data transformed data back to your business software to inspire timely action.
All of this, plus its transparent pricing into 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast.com/hebo data today and sign up for a free 14 day trial that also comes with 20 fourseven support. That question of accepting failure is an interesting 1 because sometimes it's not always clear whether you have failed or how to decide sort of when to give up, and particularly when you start getting into the sunk cost fallacy of, well, we've already put so much work into it. If we just push it a little bit further, then it will work. And, you know, because of the inherent complexity in the data space, what are some of the useful heuristics that you found for being able to help somebody understand it's not going to work, no matter how much more work you put into it versus you actually just haven't put enough work into it yet, and that's why it's not working?
[00:54:00] Unknown:
So the 2 know. So we'll see that sunk cost. That is a massive problem. And seems to be worse in the data space than the application space, and I don't know why for sure. But I think part of it is once those numbers go out, people rely on them. And if we do anything and those numbers change, now we've got a real problem. Right? So, you know, we had a 1000000 customers, and now we've only got 500, 000. Why was that? Well, was that because we changed the rule of what the definition of a customer was? Was that because our code was wrong? Was that because the source system did something and it affected the number, but the number is now actually correct, how we've been reporting the incorrect number for a while? So there seems to be a bigger blast impact when we file with data because effectively the decisions or the the consequences of the information we've used seems to be bigger.
But if I go back to kind of the core of the question, the team know when they need to change, when they need to iterate. They just need to be given permission to do it. Sometimes in the team, you'll see 1 person keep because it's their baby. Keep wanting to sunk cost it. Right? Keep going and investing because, you know, you're only 10% away from being ready. You know, they're 90% done. But sometimes that last 10 percent is the hard part. Right? That's where all the effort is or you're just not gonna get there. But team behavior is really interesting. As a team, they will have a culture where they know, and then the team will find ways of stopping that behavior happening over time either in a polite way or in a not so polite way. But that's 1 of the benefit of teams working on a problem, not individuals. And again, what I think in the market at the moment is we've gone to hyper specialization.
Right? We've gone to people really, really specializing on 1 small moving part or we've gone to a single person end to end, right, where 1 person picks it up and gets all the work done without any colleagues. And both of those patents for me are extremes, and they can work, but you have more chance of success if you have a small team of people working together. And let's be honest, it's more fun for me anyway. Yeah. A group of people working and solving problems and leveraging each other's skills and being there for the journey. That's more fun than working on your own, but maybe that's just me.
[00:56:14] Unknown:
As a kind of higher level question from your perspective of somebody who has been embracing these agile practices and working with teams and working in the data space, do you see the overall trend of the available tooling and systems and kind of infrastructure capabilities as bending towards kind of being conducive to agile practices, or do you see that they are, in some cases, actively harmful towards those approaches?
[00:56:45] Unknown:
I wouldn't say they're harmful because I think we can make any technology work. I think they've got better. You know? Yeah. If we look at the data space for transforming data via code, we've adopted good practices out of the software engineering space with things like version control and CICD. If we look at some of the front end last model tools, if we look at our visualization tools, visualization tools, very few of them allow us to check-in our code and vision it. Right? We're still back in the dark ages of going to that dashboard and, you know, change it and hope like, oh, you didn't break it. Or, you know, copy of copy of copy Tuesday just to make sure I've got 1 I can regress to.
I think then yeah. So we're still at table stakes. What I don't see is I don't see our tooling enabling us to use the machine to do the work for us. Now we're still human centric. We're still bashing the problem as individuals, and we're doing it relatively manually. So I don't think we're well served by technology in the data space to adopt agile techniques. I think we in the data domain, for some reason, we love complexity at the moment. We love solving that complexity problem, which of the technology stack, not solving the complexity problem of getting information to our customers as early as possible to add that value and get their feedback. So I think we're focused on the wrong problem. As technologists, we love to solve those technology things.
So I think it will change. I think we see waves. I think we'll see a move back to less complexity, less involvement in engineering out the technology, and more involvement around engineering out the data problems, but time will tell.
[00:58:20] Unknown:
And in your experience of working with clients, what are some of the, I guess, perennial questions or points of confusion that you've had to work through to help people understand how to think about agile approaches, how to think about proper scoping, how to think about the kind of useful integration and flow, and how to structure the work so that you can do these kind of, you know, fully vertical end to end implementations?
[00:58:49] Unknown:
So what we see is confusion of terminology. I mean, what we're doing is we're taking a whole lot of agile patterns, terms, and practices, and a whole lot of data terms and practices and patterns. And now we're bringing in a whole lot of product thinking and a whole lot of product terms and practices and patterns. And that causes confusion. You know, the difference between a product owner and a product manager. So what I see is I see some repeatable things that people struggle with. I see the data modeling problem. We can just go and create this information and whack it out to a consumer or a stakeholder without modeling the data.
Because we think that being ad hoc is the same as being agile, and it's not. We don't wanna do ad hoc. We just wanna chunk the work we're down into smaller iterations that we can do faster, but still repeatably, still safely. I see a whole problem around build versus buy. I see a problem around organizations that are focused around projects and programs, and that's how they fund work to be done. And that's how they structure everything, yet they want their teams to be agile. And so they're putting these brakes on these constraints that aren't making them safer. It's just making them slower. 1 of the core ones I see is where we have this concept of I have a concept of a heat shield, right, which is a person in the organization that's sponsoring the team to be able to adopt that agile way of working and fail as they do it. And so that heat shield fitly sits above them. Something goes wrong happens. They take most of the heat for the team and and keep the team safe. So if you don't have that role, if you don't have that person that's sponsoring it in your organization, then the team are exposed. Right? That bad behavior gets put upon them when things don't go well.
Another 1 is I see the teams try to scale too fast, you know, rather than start off with 1 small pod squad team and prove your way of working and then try and scale. They try and do it with 4 or 10 or whatever squads at once. Right? And that's that's hard. So start off with something that, you know, adopting an agile way of working with data is hard anyway. Give yourself some chance of success by removing that complexity and uncertainty. Same thing boiling the ocean. Right? The idea that we're gonna go build out a platform for 12 months before we add any value to our customers is just madness. Another pattern that that's an anti pattern is somebody other than the team makes the promises.
So we have a project manager who goes and tells the stakeholders how long it's gonna take. They're not the team. They've never done it before. The team are the only ones that can estimate or guesstimate with any sense of accuracy. And even then, they'll be wrong. But they should make the promises of what they're gonna deliver and win. And my last 1, which is my personal 1, project managers who become scrum coaches. Right? As a project manager, your skills and the things you're taught to do are very different to being a Servient leader as a scrum coach. So I find that things like business analysts make the best scrum coach. So those are some of the the areas of confusion or risk that I see a lot of teams adopt as they start their journey.
[01:01:38] Unknown:
In your own experience of working in this space and working with teams and building a product to help people adopt and adapt to these agile principles, what are some of the most interesting or innovative or unexpected ways that you've seen people try to incorporate these ideas into the way that they work with data?
[01:01:57] Unknown:
So every team that I work with is innovative. I learned early in my journey of coaching that I made a mistake. And the mistake I made was I worked with a team. They got to a level of maturity. And when I went in to help the next team, I just, for some reason, started off at that level of maturity and the team went there. They needed to go back to the beginning and kind of grow to that level of maturity. After that, I made another mistake, which was the patterns of the first team applied were applicable to the 2nd team. We're going to be successful. And that wasn't true. It was different context, different organization, different data, different platform. So, you know, the idea that each team can iterate and build their own way of working is amazing to watch, but was unexpected when I first started.
And the other thing is the team culture is really important, and we have to be really cognizant of that. An example is, you know, if we talk about scrum and we do retros, the standard pattern is, you know, ever virtually or physically, we're putting stickies up on the board about what went well, what didn't go well, what we should improve. And it's a conversation. It's very verbal and it's very visual, and there's this buzz in the room, and that's great. 1 of the teams I worked with were really introverted. And so they didn't enjoy that type of process. They saw the value of reviewing the work they were doing and the way they were doing and iterating it, but they just didn't enjoy that standard pattern that we would have for doing that work.
So what they did was they basically used Azure DevOps and they would have quiet time. They would sit in the room together because this is is pre COVID and they would bring up their laptops and they bring up the board and they would type their notes, stickies into the board and they would have absolutely no conversation. And then, you know, we time box them and then it was like, cool, now review it. And they would go in and drag the notes around and make comments on the notes and dot score them again with absolutely no verbal conversation. And then we go, right, you know, we've we've scored the things we need to focus on next. What are we gonna do about it? And they'd go and create, you know, the next set of notes about the work to be done, and they'd type over the top of each other. And it was like being in a library and it did my head in. Right. Because I was like, this is not what I'm used to. Like, where's the buzz? But for that team, it was the right fit. Achieve the goal that we wanted them to achieve, which is look at the work you're doing, figure out what's not going right.
You know, where you're gonna iterate and iterate it. Right? That's the core principle. That's the core pattern. So for me, every time I work with a team, I see something that's that's amazing. And that's cool. Right? But we have to empower the teams to build their own way of working and encourage them to leverage patterns that have success, right, or may have success and experiment with them. Absolutely.
[01:04:32] Unknown:
And in your work of working with teams and living in this space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:04:42] Unknown:
So there is no methodology. There is no out of the box way of working. Every team is different. Teams are amazing. Just empower them and enable them to get the work done. And typically, they always will. So that's the learning I constantly get.
[01:05:00] Unknown:
For people who are kind of revisiting the ways that they work or looking at how they can incorporate new capabilities or new practices into their overall workflow for delivering analytics and data assets? What are the cases where an agile approach is the wrong choice for a data project?
[01:05:20] Unknown:
So I don't think there ever is. I'm a bit biased on that these days. I think all the alternatives, not as good as an agile way of working. However, there are some things that are big warning signs that you're gonna struggle. So 1 of the ones is if there's no uncertainty, if what you're doing is repeatable, then an agile way of working may not be the best fit for you. But I've never seen it right. I've never seen teams working with data where there's not a high level of uncertainty. As I said, where there's no heat shield, if you're starting this journey and you don't have a senior person that can hold that heat when it hits from the rest of the organization, then you're going to struggle, right? There's going to be some real problems coming.
And the last 1 is when the organization is command and control, where it's hierarchical, right? Where there is a culture of blame and people need to hide from that blame or serious consequences happen to them. Like they get fired, those kind of things, then your organization is not gonna support an agile way of working. It's not the culture of the organization, and your team are gonna be exposed. So have a really good think about whether whether you wanna start down that journey and whether you wanna expose the team to that. Often, I work with teams that are in a hierarchical organization, but we have that heat shield. Right? So the team are empowered for this new way of working. And what do we see? We we see them be successful, and we see the rest of the organization go, oh, 0, how are you doing that? Right? And they start to watch and learn and and ideally adopt it themselves.
That's what success looks like. But, yeah, those are the probably the 3 warning signs for me that you're going into a high level of risk by adopting an agile way of working.
[01:06:55] Unknown:
Are there any other useful references or resources or practice exercises that you recommend for teams who want to dig into and understand more about how to apply agile practices to their data work that you recommend?
[01:07:10] Unknown:
So I'm trying to spend as much time as I can publishing the ones that I've used with Teams in a way that anybody else can pick it up for free. So if you go to wow.agiledata.io, that's a site we've created where I kind of brain dump those patents and templates as much as I can. There's a bunch of books out there that I've read over the years and approaches that have value. So I'm a great fan of Lawrence Core and his main theme methodology. I think that's a great way of gathering data requirements. I'm a great fan of of Scrum and Kanban and Lean and some of those practices, and there's lots of good books here about how you can pick those patterns up. Personally, I'm a great fan of data vault as a modeling technique. I find it's the most flexible at the moment. It's got some problems, but it's certainly the 1 that can get to change the most in my experience. Those are probably the core ones. I and unfixed dot work is probably the best taking topology explanation that I can find out there at the moment. So those are probably my go to when I point to patterns. And then the information product canvas, which I've published for me. Yeah. It's my go to whenever I work with a customer and try and understand what information they want delivered. So that's probably my my short list of go to stuff at the moment.
[01:08:21] Unknown:
Are there any other aspects of your experience of working with data teams or how to apply agile methodologies to data work or how to think about the kind of technical and team structures to support that that we didn't discuss yet that you'd like to cover before we close out the show? I think the main thing for me is, you know, I call out to everybody that when you find something that's worked, a pattern that's worked for your team, just sharing is caring.
[01:08:45] Unknown:
You know, take the time to write it up in a simple way and publish it out so the rest of the world can ideally see it and experiment with that patent as well. Sometimes we hold those things internally. And, yeah, if we look at it, John, everything we do is iterating on other people's work. That's what it's all about. We should try and pay it back and push some patterns that we've had success with back into the community so we can help our fellow data practitioners where possible.
[01:09:10] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:09:24] Unknown:
The obvious answer for me is reducing the complexity, and that's what we're focused on with agile data dot io. How do we remove a lot of the engineering process and practices and effort that's required and automate them? So I think that's the problem today. What I think the opportunity in the market is over the next couple of years is right now, everything we do is based on human effort. We don't use the machines to do the effort for us. And I think, you know, we're gonna see over the next 2, 5, maybe 10 years where we start using algorithms and machines to recommend and do the work for us. And I'll give you an example.
You know, when we collect a piece of data and we look at it, you know, sometimes there's a hint. Right? If it's coming from a relational database, there's a foreign or primary key. It's being flagged, and we know that's a unique key for customer. But in this whole event streaming, 80% of the poorly designed systems for capturing data that don't have keys on it for us. We have to go and look at that data and figure out what the key is. And so, you know, using machine learning to actually identify a candidate looking at the data and giving us a hint of what that key is would save us so much time. And, you know, we're talking about not a simple process because it might be a concatenated key. Right? It might be 3 or 4 columns that we need to say this is actually a unique record.
But that's what I think areas where we can actually use the machine to reduce the work we do. And at the moment, we're not focused on that. At the moment, we're focused on grabbing all the parts to actually build a machine that can run. So we gotta get over that and move on to the next phase. So that's where I think the major opportunity is is using the machines to automate as much of our work as we can. That's not easy.
[01:11:06] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experiences and thoughts on how we, as data practitioners, can adopt and embrace some of these agile ways of working. So I definitely appreciate all of the time and energy that you've put into the work you've been doing with data teams that you work with and encapsulating that into your product. So thank you again for all of that, and I hope you enjoy the rest of your day. Yeah. Thank you for having me on the show. It's been great.
[01:11:36] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Shane Gibson's Career Journey
Challenges in Data Management and Agile Solutions
Agile Principles in Data Management
Thin Slicing and Iterative Processes
Defining Requirements and Understanding Value
Data Modeling and Iteration
Definition of Done and Tooling Challenges
Architectural Decisions and Technical Debt
Security and Governance in Agile Data Management
Collaboration and Skill Development in Teams
Platform Teams and Innovation
Accepting Failure and Iteration
Tooling and Technology Trends
Common Challenges and Solutions
Innovative Approaches and Lessons Learned
When Agile is the Wrong Choice
Resources and Recommendations
Final Thoughts and Call to Action