Summary
Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing your definition of shadow IT?
- What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?
- What are some of the roles in an organization that you have seen involved in these shadow IT projects?
- What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?
- What are some of the pitfalls that these solutions present as a result of their initial ease of use?
- What are the benefits to the organization of individuals or teams building and managing their own solutions?
- What are some of the risks associated with these implementations of data collection, storage, management, or analysis that have no oversight from the teams typically tasked with managing those systems?
- What are some of the ways that compliance or data quality issues can arise from these projects?
- Once a project has been started outside of the approved channels it can quickly take on a life of its own. What are some of the ways you have identified the presence of "unauthorized" data projects?
- Once you have identified the existence of such a project how can you revise their implementation to integrate them with the "approved" platform that the organization supports?
- What are some strategies for removing the friction in the collection, access, or availability of data in an organization that can eliminate the need for shadow IT implementations?
- What are some of the inherent complexities in data management which you would like to see resolved in order to reduce the tensions that lead to these bespoke solutions?
Contact Info
- Sean
- @seanknapp on Twitter
- Charlie
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Shadow IT
- Ascend
- ZoneHaven
- Google Sawzall
- M&A == Mergers and Acquisitions
- DevOps
- Waterfall Development
- Data Governance
- Data Lineage
- Pioneers, Settlers, and Town Planners
- PowerBI
- Tableau
- Excel
- Amundsen
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and auto scaling so you can focus on your exciting data projects. Your team will get the most complete, accurate, and ready to use behavioral web and mobile data delivered into your data warehouse, data lake, and real time data streams. Go to data engineering podcast.com/snowplow today to find out why more than 600, 000 websites run Snowplow.
Set up a demo and mention you're a listener for a special offer. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include Strata Data in San Jose and PyCon US in Pittsburgh. Go to data engineering podcast.com/ conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Sean Knapp and Charlie Crocker about Shadow IT in the data and analytics space. So, Sean, can you start by introducing yourself?
[00:02:12] Unknown:
Yeah. Absolutely. I'm Sean Knapp. I am the founder and CEO of Ascend. Io.
[00:02:19] Unknown:
And Charlie, can you introduce yourself as well?
[00:02:22] Unknown:
Yes. My name is Charlie Crocker, and I am the founder and the CEO of of Zonehaven.
[00:02:28] Unknown:
And going back to you, Sean, do you remember how you first got involved in the area of data management?
[00:02:32] Unknown:
I do. It was 15 years ago now. I had just joined Google as a front end software engineer on web search and we were obviously known for doing a lot of the user experience of background color of ads to the layout of the page. And I quickly found that even the smallest experiment where you move around just a few pixels on the front end, oftentimes, brought with it doing hours and hours of data analysis by writing MapReduce jobs, in the internal language there called SOWL to analyze the usage of 100 of millions, if not billions of users. So really quickly, and early on in my career, fresh out of college, found myself doing a lot of, pretty complex big data jobs to just answer questions around consumer experience on, web search.
[00:03:27] Unknown:
And, Charlie, do you remember how you first got involved in data management?
[00:03:32] Unknown:
I was an environmental consultant, many, many years ago, 25 years or so ago. And, in the organizations that we worked in, everybody was using, paper and sometimes, maybe if we were lucky, spreadsheets. And so we were asked to make, charts, graphs, maps, all sorts of information. And as a junior staff member, you were expected to just sit in your cube and draw these things out. So I found a way to work directly with the analytics labs, get them to start sending us the data in an electronic format, and I brought this interesting concept into the environmental consulting world called a database. And that led to building out databases, maps, and then moving into larger scale and larger scale datasets as I as my career progressed.
[00:04:19] Unknown:
And so as I mentioned at the open, we're talking about the idea of shadow IT and how it manifests in the space of data management and data analytics. So before we get too far along, can each of you start by sharing your definition and how you think about the term shadow IT?
[00:04:37] Unknown:
Sure. Happy to go first. You know, oftentimes when I think about the the notion of shadow IT is really what we see happening from consumers and customers of IT. Oftentimes, they're, looking to test out or experiment, or try new technologies to meet some of their needs. And the the mandate usually behind IT is to be looking for technologies and capabilities that are of broader use to, the broader organization. And oftentimes, the different, customers of IT are looking to go faster or get more experimental or simply self-service, and offload that burden from IT.
And so you'll start to see those business units take on more of that responsibility on their own and bring more of it in house, if you will, into their own organization.
[00:05:30] Unknown:
Yeah. And I I think the term shadow IT is, is interesting. It it kinda has, I think, a little bit of a a negative context. I really see shadow IT being the environment where people choose to work in silos or in sometimes in in business units. There the this has really accelerated, I think, a lot in the last probably 5 to 10 years with the advent of all of the cloud service providers because it's easier and easier for people to do their own IT without having a full IT organization. So you're starting to see a lot of that experimentation and a lot of that, acceleration without necessarily having a central organization to drive standards, etcetera.
[00:06:17] Unknown:
1 of the things that you were both pointing out is this pressure to be able to deliver and how the availability of some of these different services that are easier to access and easier to provision without necessarily strategy. And so I strategy. And so I'm wondering, what are some of the sort of main drivers of that tension and of those types of projects that are leaning on these different self provisioned services that help to contribute to these projects that are maybe not driven by the engineering or IT organizations within a business?
[00:07:07] Unknown:
To me, it's it's, in most cases, it's about acceleration or it's about fiefdoms. It's it's about trust. So you see in organizations that have a lot of m and a, people coming in with their own way of doing business and, finding it hard to adapt and move into an existing organization. You find existing organizations where there's, in some cases, a lack of trust between the engineers that build the products and the IT organization that may be tasked with managing the the services and the contracts with with the big vendors. And then you've got the conflict between the goal of getting a new product out really fast and leveraging standards and using standard operating procedures.
[00:07:55] Unknown:
Yeah. I would double down on on what Charlie's saying there, which is, you know, these are the the drivers are conflicting goals. You know, you have, oftentimes IT is exists to help actually centralize, standardize, get economies of scale and leverage which by design means they should move slower and more thoughtfully and more carefully and at the same time that is in conflict with oftentimes the the needs of the business which is to move very quickly and at times even to perhaps, you know, break some glass in the pursuit of moving quickly in response to business demands. And, I think that, you know, the reason this exists is, you know, if we pop back up even kinda higher up, it's because we're seeing across industries the wave of digital transformation is pushing businesses to move faster and they have to respond more quickly. And, you know, for example, we've seen in software development, the introduction of DevOps was, you know, in my view, all about how do we enable more people to build more software applications faster yet more safely. And we do that through a variety of different constructs.
Yet, that same level of agility that we're now getting in software, many companies have appreciated for a while. We haven't yet received in the data world. We're still doing much more waterfall y traditional style approaches. And as a result, we see those pressures and and, that's what's causing and and igniting a lot of this, behavior. And there's a little bit of danger in that behavior too with the data. You know, with the with the software piece,
[00:09:29] Unknown:
I'm I'm not as, deeply engaged with that. But with the with the data piece, you end up with siloed datasets, siloed data pipelines, repeated data pipelines. You increase expense, but you also have whole different privacy and management schemes that need to be dealt with. So with data, it gets really hard to, in a large organization with a lot of different silos and a lot of different processes, to really even understand your compliance regime, for example.
[00:10:01] Unknown:
Totally agree. Totally agree. 1 of the other things that's interesting about that difference in terms of software versus data projects is the level of impact that can be had throughout the business of a project being delivered in in an accelerated fashion, as well as some of the issues around things like compliance or data quality that arise and are somewhat unique to this analytics space. And I'm wondering what you have seen as some of the challenges posed and some of the driving forces towards building these projects that are maybe outside of the supported platforms within an organization?
[00:10:38] Unknown:
Yeah. I think there's a I think it's interesting when we think about how, you know, how we were trying to centralize in standard in many ways and and I I oh, I would make a a potentially, you know, inflammatory comment which is, you know, I'm not sure it really matters as much these days if we, like, standardize on are you gonna process your data with Hadoop or Spark, for example. I'm not really sure that those levels of standardization matter in so much as do we have standardization? And this is where I see a lot of IT needs going today, which is do we, standardize on how do we articulate what data exists, how do we unify, how we know that it got there, and why did it get there, and then how where does it go? And so it's much more of the notion of governance, lineage, tracking, a lot more at the metadata layer because this is I think as Charlie is highlighting is that's the stuff that gets scary. It's like, do you know if your stuff is actually legit and valid? And if it is broken or it does have a bug And legal and even, you know And and and legal. Yes.
Like, all of a sudden, it's the like, in in the software world, like, you would make API calls. Right? And so if somebody fixed a bug, the next time you made that API call, you'll probably get the right answer. Yet, in the data world, you're making copies of data and you're moving it all over and so, like, you may have produced a new dataset from a buggy piece of code, but how do you know that, like, that even went the right place? Or, you know, as Charlie highlights, like, well like should that data have gone there and you know like gosh if it went somewhere it wasn't supposed to do at least know that it went there and you know how to retract it and and so on and so that is more of the problem is, I would argue, is a metadata layer these days of just knowing what is happening with your data. As you mentioned, the tracking of the data is definitely 1 of the key problems
[00:12:30] Unknown:
that exists in this particular realm. Because as you said, with software, if there's a bug, you fix it, and it gets redeployed. But with data, if it gets copied 5, 10, 15, a 1000 different places, and then you realize, oh, there was 1 different way that we were tracking it, or there's a particular field that needs to be masked. How do you then go and apply that transformation or apply that constraint on all of the big issues is,
[00:13:00] Unknown:
which metric is the right metric. You know? I mean, 10 people can run the same pipeline and call the outcome the same number, and you could have 10 different numbers. Right? And so, you know, at least early in the transition for a lot of companies, you know, he he who owns the, the metric, you know, owns the story. And so every individual would want to come into a C staff meeting with their own set of metrics, for example. So how do you, from the top down, start saying, look, how do we drive standardization without squelching innovation?
And so the stuff that that Sean's talking about around metadata, around being able to have visibility into the pipelines, being able to rank and canonize certain data sets and certain metrics. Those are the key things that allow success in a data product or in a data pipeline within an organization.
[00:14:01] Unknown:
Yeah. It's it's really interesting kind of building on top of what Charlie was saying is we've watched a number of companies go down the super cool path which is they've said the look like we hate all these little fiefdoms that are are like holding tight on to their numbers and literally won't share them with other people or or their datasets. I think, like, you know, you have large fortune 500 companies and each BU has some other data, but they don't wanna share it with others. And we've seen really, cool, like, executive level mandates that say, you know what? We're going to expose all of our, data, all of our drive datasets, our published datasets, and if we have disagreements around how to calculate something or what the definition is we're going to have that conversation but we're going to create a higher level of transparency.
And the the classic objection to that has always been but now we're giving more people access to more data. Isn't that scary? And the response that I think is really helpful is well, not if we actually have invested in that metadata layer and have enough intelligence to say either some things you shouldn't have access to and there's some things you can't take to other places. But if you can automate more of that, now you can safely actually enable this level of dialogue and collaboration across teams that you just otherwise couldn't
[00:15:17] Unknown:
have. Yeah. And govern and governance, governance, governance. Right? And how do you do that without, once again, restricting innovation? I've been in many organizations. I've seen this with many that I've worked with. Governance is the thing you do at the very end. When you finally are done with everything you've been working on and you've worked through all the data, then you look at it and say, wow, you know, am I in compliance? Did we do this the right way? That kind of thing. And shoehorning things into governance may get you there faster, but it slows you down when you start to really try to scale.
[00:15:49] Unknown:
And another thing too that I think is worth calling out is to what you were saying, Sean, about the fear of giving access to data to all the different people in an organization is somewhat related as well to the fact that they might not necessarily have the appropriate background or understanding of how to interpret that data or how to use it for making effective decisions. And so I think in addition to the governance and metadata aspect, there's also the education component to making sure that everyone within the organization is able to actually gain value from the data that they have access to. And so in that scope as well, I'm wondering what you have both seen in terms of the types of roles or responsibilities that often are the drivers of these shadow IT projects, and some of the reasons that those are the types of roles that might be more likely to build out some new, platform or new, transformation on the data that they have access to or maybe collecting new sources of data from other systems that aren't already incorporated into the underlying platform that they have access to?
[00:17:00] Unknown:
Yeah. It's a really good question, Jan. And what we see is usually a a few fold and and it's it's dependent on oftentimes how big the company is, and where they kind of are in that that data journey if you will. But the 1 of the more obvious ones is we'll see coming and stemming from the engineering and product teams, the data engineering role who's constructing a lot a lot of data pipelines, who are trying to source new pieces of datasets, oftentimes they're part of a even a data analytics or a data science team and they're the they're connecting essentially all these various data systems. And they're trying to get access to new pieces of data. They're trying to to work with new technologies, to empower the the the broader group. And oftentimes, just can't get those capabilities or can't get those datasets, onboarded as part of the the standard sort of corporate platform.
And we see them emerge a lot as some of the early drivers. We also interestingly see, product managers and even data scientists themselves saying, look, like, I like, I know we have all this really big infrastructure and we have these really cool capabilities in a central platform but, like, I'm not a Spark expert. I just want the power of Spark to run a big job or I don't wanna deal with all of these other sort of complex technologies around it. I just wanna point it to some data, do some really cool data logic, create something that's more automated, and move on. And so oftentimes, they start to become the seed for, some of these shadow IT efforts.
And it sort of start to to trigger some of this behavior of like, hey, we can really move a lot faster as a result if we can can properly free ourselves and and and get moving quicker. Yeah. The,
[00:18:49] Unknown:
we saw a lot of you you can kinda do it from cent this sort of federated to centralized to federated thinking. And we've seen I've seen in several organizations where the they had a very fractured, fragmented data structure, data silos, etcetera. And then they work very hard to say, look, we can come up with a consistent methodology for how we store the data, for where it's located. Maybe we have flexibility on the tools. Maybe we have flexibility on some of the compute layers, but we're going to only have a single data store or a single place to put that, and we're going to have a supported stack and this kind of thing. And that's all well and good, but shadow IT, in some cases, becomes the residual, the people that are really uncomfortable making the change. Right? So making the change from, I like command line Hadoop, and now we're gonna start using managed services like Glue in the cloud.
That's a whole different skill set. And so in some cases, people are sort of clinging to the old tech because that's what they know. And the switching costs for the engineers, the data engineers, is almost too high. And so they'll look to try to keep a fiefdom or a silo that continues to work the way they understand it.
[00:20:15] Unknown:
And another interesting aspect of this is that, as we mentioned before, the term shadow IT can have this negative connotation, and it can lead to people trying to hide their activities from the central IT or just from the organization at large so that they don't get called out on embarking on some, maybe, unapproved project or incorporating some technology that hasn't been vetted by the powers that be. And so I'm wondering, what are some of the ways that we can try and either eliminate that stigma so that people are more willing to be upfront about the fact that, hey, I tried this thing. It's having this useful outcome, and then being able to then incorporate that into the rest of the organization, or popularize it, or add a way for them to integrate the work that they've been doing into the data sources or data processing systems that are being used throughout the organization?
[00:21:13] Unknown:
Let me I'll just quick I have a very just short statement on that. The the service organization, whatever that central organization is, needs to be needs to really think of the users as customers. And if you can't provide them with something of value that allows them to innovate and scale, they're going to go somewhere else. Right? So you find a lot of organizations where the central unit, the central whether you call it IT, whether you call it the central analytics team, it becomes more of a policing organization than an innovation organization. So how do you take that customer first attitude and bring that to your community from that central location.
[00:21:57] Unknown:
Yeah. I would so I would so, double down on that with with Charlie, which is, you know, we we see this happen pretty frequently where, like, that starts to be a a behavior in an unhealthy organizational dynamic. And, I mean, to put it really sort of, directly, it's a pretty terminal strategy because at some point, like, people like, because customers will go elsewhere even if they're internal. And we're in this stage right now where, the central, teams and then the service provider teams have a lot of leverage because of concern around data privacy and governance and and data leakage and so on. But the encouragement we would I would generally provide is to not misuse that and abuse it because at at some point, markets even the, you know, the small internal markets will correct themselves inside of the organization.
And so, you know, the I think the the way to to think about this, especially for those who are are testing out or, you know, testing the waters in in shadow IT and trying new technologies and so on, 1 of the pieces of guidance we always provide is don't make your or make sure you don't paint yourself into a corner. Like, any technology that you are trying, is it is it still enterprise grade enough that it could actually be adopted by the broader organization? Does it have the right security and governance capabilities? Does it have the ability in to integrate into your broader ecosystem?
In some way, right, you you don't you want to make sure that you're not trying to introduce a technology that fundamentally traps you because that's a a surefire way of of getting a lot of of resistance from from IT. And this is certainly what we've found with a lot of our customers is as they experiment and explore with technologies, whether it's ours or or others, you know, finding some really cool use cases that prove out a lot of the value, but then still helping them come back in and even talk to the central teams and say, hey, but look at all these other security safeguards, how we can do this, you know, as Charlie was describing, this hub and spoke model of data sharing and sort of data governance where we each have our little pods of datasets, but we can publish back to central teams and have proper governance on this so that they actually can become a really cool advocate for how to introduce new technologies back into the broader organization that everybody benefits from. I think if you if you kinda think through it with that mindset, it's a really collaborative approach for how both the the sort of business units and and the the central teams can work together well. Yeah. There's an analogy that I've heard in a couple of different contexts
[00:24:32] Unknown:
of the pioneers, settlers, and town planners in terms of the life cycle of innovation from a technology perspective, where the people who are embarking on these different projects of bringing in different stacks and different tools are the pioneers where they're going out, and they're exploring what's available. And then if they find something that's useful, then the settlers on the team will be the ones who adapt it to say, okay. Well, we've got this maybe bleeding edge tool or technology. How do we actually make use of it in a somewhat more stable fashion? And then, eventually, that gets handed off to the town planners who are maybe the central service organization who determine how do they incorporate that into the rest of the organization and the rest of their technology stack. How do they make it scale and make it available?
[00:25:21] Unknown:
Yeah. I really like that. That's a great I like it too. You you you end up though in some cases where you've got a really nice town that's been built out, but there's somebody who's, you know, not going to the planning department and is building out a little hazardous waste dump somewhere. And, you know, at some point, at some point, somebody's gonna find that hazardous waste dump and and and they're gonna have to deal with it. Or you end up with Boston where it's a nice city, but you can't find your way anywhere. Or you can't afford to live anywhere there. Right? So yes. Yeah.
[00:25:53] Unknown:
I I think it's I I think there's something really cool about that approach though, which is, yeah, like, okay. So it it may be problematic, but if you can actually keep it well contained, right, it's kind of, you know, to continue, Tobias, your your analogy, it's it's kind of out on the frontier as opposed to actually in the town, right, and you're in a good process of collaboration between central and the, business teams is to be really careful what you allow to be introduced back in. Like, you don't wanna introduce technologies that could infect the the broader sort of town architecture, if you will, but giving those teams some degree of, freedom and flexibility.
[00:26:31] Unknown:
And so 1 of the interesting things to explore as well is that there are these tensions that exist between the priorities of the different groups within the business and the different projects that get spawned as a result. But once you have identified or somebody has introduced some new tool and, presented it to the rest of the organization. What are some of the useful strategies for removing the friction that exists in the organization that causes them to go out and build those new tools in the first place or maybe try to hide them? And how do you incorporate those new platforms into the organization and make it easy to integrate or extend the services that are available to make it so that you maybe use a different compute framework, but you're not trying to reinvent the definition of a particular metric, and you're able to rely on some of the master data management or compliance and governance strategies that exist without them being too rigid?
[00:27:33] Unknown:
Yeah. You know, I I think it it playing off of Charlie's comment earlier which is, you know, if you're a central, enablement team, really at the end of the day, your customers are the other business lines of business. And there's really great lessons to be learned from software vendors and and enterprise SaaS vendors in general which is, you know, it's not just about creating a technology and then writing a a white paper on it and emailing it out to the, you know, to rest of the engineering org, but it's really around how do we actually help onboard different teams. And so we've seen and even helped our customers do a variety of different things from, you know, how do you create, centralized teams of excellence but, like, drive distributed, innovation.
So create a team that, like, is literally the SWAT team that orbits and goes from org to org to org and essentially drops in and says, what is, you know, your biggest, hardest, most painful in our world data pipeline, and how do we help you, migrate this to a really cool tech new tech stack, fast and efficiently, and we'll train you in the process. So there's approaches like that all the way out to which is great for the central team too because you get to go drop into a lot of really cool businesses, and see the end use cases you may not otherwise get to see and help them build stuff out fast, which is super fun. The other thing we've seen is we've seen companies literally do the big, you know, once a a quarter or once a year hackathons, and they're weaving in seams. And so that becomes a way for them to say, hey. We just have brought in, you know, 3 new really powerful technologies.
We're gonna focus the hackathon on use of these new technologies and we're gonna bring in those vendors or how or centralized, you know, SWAT teams and they're gonna be these orbiting teams to really help everybody just massively ramp up within 24 hours on how to build something really cool and magical. And both these strategies we we've seen be really effective.
[00:29:22] Unknown:
And then shifting gears a bit, we mentioned at the outset that some of the reason that shadow IT projects, particularly in the data and analytics space, are starting to become a bit more prevalent is because of the availability of these different cloud tools or, you know, 1 click provision applications or easy to use databases. So I'm wondering what types of tools or platforms in particular are well suited for being provisioned by people who don't necessarily work in a primarily engineering role or for people who are not necessarily looking for a end to end integrated solution. They just want something that they can start using in conjunction with existing tools and some of the potential pitfalls that exist as a result of these tools being so easy to use and maybe the people who are initially setting them up not having the context or training necessary to be able to foresee some of those potential problems?
[00:30:22] Unknown:
Well, I'll I'll start with the the what what we found was in many cases, you have a set of master data. So people have access to business systems data or some set of product usage data. And then it's very easy for people to go out and get analytics tools. Right? Go out and get your Power BI. Go out and get your Tableau. Go out and download some of that data into Excel. And then they start to do really great work. They think about it, and they generate a lot of really cool metrics, great reports, etcetera. But once again, you start to because those tools are easy to get to, very well marketed into organizations, you start getting lots and lots of little silos. Everybody's desktop becomes, a silo. And so that's that be that that in itself, just that thin BI layer, for example, on top because of the ease of access to the data and because of the ease of access of sharing that via Excel files being moved around, those things become huge govern governance headaches
[00:31:31] Unknown:
and also make it really difficult to do any provenance and understanding of the value of the actual information that's coming out of the that that effort. Yeah. I'd agree. And I think, you know, even voting on top of that, we see, you know, as as Charlie's describing, the introduction of a lot of, you know, SaaS vendors and and BI vendors do a really good job of nailing that, you know, how do you make it pay as you go, how do you make things that are super simple to connect into your existing ecosystem and so on which is great to to start. You know, I think part of the balance is finding the those tools and technologies that, of course, have that that easy capability to get up and going. But similar to what I was highlighting before, still enterprise grade enough that at the same time they're extensible and can be hooked into the rest of your ecosystem and are are, sufficient for far more advanced use cases.
And I think that's the the the nuance of, like, figuring out where do you find those and how well do they work, so you don't end up with this, like, massive proliferation of silos. And I think at least SaaS does a a pretty good job of helping to solve parts of that problem, because it's much easier to unify access and take what started as just a couple of of use cases and expand it to a bigger organization without everybody sort of sitting in their own little pockets.
[00:32:51] Unknown:
Yeah. You brought up earlier too, Tobias, the education piece. Right? And so a really good, support team, very customer focused central organization, well supported BI tools and data engineering tools and data science tools. If you can get that ecosystem built out and that support network built out and you can help the people within that organization feel like they are the champions and that they have some flexibility in that, then you can start to to drive, a lot of those standards. Right? A lot of the the shadow IT is, you know, it's like I said earlier, it's trust and it's lack of communication.
And so how do you, you know, deliver not just a tool that is easy to use, but a support team in an organization that helps people feel like they're, you know, they're always 2 steps ahead. They're not that they're being told no at every turn.
[00:33:49] Unknown:
And so that communication and support piece, do you think it's just a matter of saying, hey, we're here and we're available, ask us questions. Is it a matter of having substantial documentation for people at different levels to be able to access and follow? Is it a matter of publishing the availability of different datasets and making them discoverable using maybe something like the Amazon tool from Lyft? Are there any other elements of that sort of support strategy that you have seen as being effective and productive?
[00:34:20] Unknown:
The big thing is you need top down support too. If you don't have top down support for driving some of those standards, then that behavior will never, change no matter how well you do in that. So you get top down support. You have leadership that at the next tier down that is held accountable for that. You can start to drive drive that. There's lots of different tools. Data discovery is 1 of the hardest. I've been you know, there are a lot of tools out there. I don't even know what the the newest ones are right now. But, data cataloging, finding what is pertinent, what is the most valuable data. You know, how do we easily do that in a in a big organization that's even got standards. Right? So that that that's a that's a huge thing. But there's a lot of documentation. There's a lot of learning. There's there's great tools for doing data discovery, but you really, really need the the top down support for driving that consistency and clear reasons, goals for why we're doing this. We're doing this because we wanna be a compliant company. We're doing this because we are stewards of our customers' data.
We're doing this because we feel like we can drive innovation and scale better as a company. We're doing this so we don't have redundant people doing redundant jobs all over the place. There's a lot of reasons why you do you drive people towards that. And if you're not clear on why, people are not really gonna understand. Yeah. I totally double down,
[00:35:53] Unknown:
on that. And, honestly, the the once you have more top down support, which is really where you need the executive or at least very senior level sponsorship because a lot of the technical decisions will be made much deeper in the organization. But at a at a more senior level, what you're going to get is the amplification of this importance and those are also the the sponsors you can get internally to then let you do these really cool things like create these rotational teams to go spend time with the internal customers or do these hackathons. These are the things that allow you to do these outsized impact moves in really short time frames that, you know, as you decide on some of these new strategies frankly help them see the light of day at much larger scale much faster which everybody's super interested in doing these days.
[00:36:46] Unknown:
And are there any other just inherent complexities in the overall aspect of data management and the available technologies that you think we need to see resolved and addressed more effectively in order to reduce the tensions that exist between the organizations and the different business units that lead to these bespoke solutions?
[00:37:06] Unknown:
You know, I mean, my take is and and, Tobias, we've talked about this a little bit before. I'm I'm a really big believer in moving a lot of the technology systems from imperative to declarative. And the primary reason is, you know, we we have pockets of technology in the data space right now. We have storage engines. We have processing engines. We have data catalogs and data warehouses and governance systems, but we don't have systems that actually connect all of these together. And so, you know, when we talked earlier about the importance of metadata, metadata most of the metadata management really in most companies is being done pretty manually today.
And even if I have automated systems like I'm doing, for example, automated data replication or pipeline orchestration and so on, I'm usually automating something that says do a job at a particular point in time, but we don't really have a a tight linking between this piece of code ran on this piece of data that produced this other piece of data, and here's why it did it. And as a result, because we don't have this tightly bounded notion, that's highly automated at the metadata layer, we push all of this burden to people and that's why we do it more mainly. Like, very few companies, if any at all these days, can actually say, is that data in your warehouse, your database, your lake actually reflective of the code in your repository?
It was probably reflective of some version of your code at some point that ran and some version of the data, but but you actually don't know how and why. Like, you'd have to throw an engineer at the bottom of like go go audit all of this stuff and figure out why. And so as a result, I think this is 1 of the the things that's just missing the, today is still these smarter systems that are are looking more holistically across the the data and code landscape, they can actually track and trace because in doing so, now we actually can automate more or get more of that burden out of the hands of of engineers who are trying to manually do it, which accelerates the those cycles and makes it easier for people to to move a little bit more freely.
[00:39:16] Unknown:
Yeah. I mean, I'm gonna, double down on what you're saying there. And I know, Sean, you and I have had this conversation many times. But, visibility into all of the work that is happening, all of the the loads, all of the processing, the cost for the processing, which datasets are, you know, being used by hundreds of pipelines, which datasets we don't need anymore, streaming information, what data is scheduled to be retired and deleted. You know, it's really difficult for a, you know, a nontechnical person who may be making some of these decisions, like, you know, product leaders or, you know, working with their legal counterparts, it's really hard to get that that view. So there are a lot of good tools out there that are starting to sniff and consume logs and place, you know, points and information along various parts of the pathway, but it is not a solved problem yet.
[00:40:21] Unknown:
Are there any other aspects of this topic in shadow IT and some of the motivating factors and possible solutions and the reasons for tension and ways to try and overcome that that we haven't discussed yet that you think we should cover before we close out the show?
[00:40:38] Unknown:
I think the I I think kinda going back to to core first principles and, you know, as I was mentioned a little bit before, DevOps is all about ultimately at the at a very high level, how do we enable more people to do more with software faster and and safely? And that's really the the sort of it's fighting against anything like that is like fighting against gravity. And the the same thing happens when it comes to data. And a lot of the even the conversations around DataOps today and it's still forming and and there's a lot of opinions and and people are trying to kinda push it in a bunch of different directions. But, really, at the end of the day, it's again they come down to how do we enable more people to do more things with data faster and safely. And and and it's gonna be the same sort of core business drivers.
And I think for a lot of teams, the specific nuances of how you accomplish that inside of your organization may be very they're very contextual to your world. But trying to actually reroute that momentum is really hard because it is very much like fighting against gravity. And figuring out how you can enable that and oftentimes the I would say you know we do the same sort of exercise and process and how we build here at Ascend which is, gosh, like, if you can think through the well, we want to enable this and that would make the organization much better. And there's always another hard technical problem at at the other end of that to solve is we go solve that problem and figure out, well, if we can solve that though, that will enable us to be a much more efficient and effective organization.
And so we just we adhere to those core principles and kinda keep trying to knock down the next challenge to really help the team go faster and faster safely.
[00:42:26] Unknown:
Yeah. And I'll come back to it. I think that in many cases, the the hardest problem is not a technical problem. It's a it's a people problem. So you've gotta figure out how to motivate. You've gotta figure out how to put them customer first. You've gotta help people understand why there's value in working together, and you've gotta support them.
[00:42:46] Unknown:
Well, for anybody who wants to follow along with the work that you're both doing or get in touch, I'll have you each add your preferred contact information to the show notes. And, we addressed this a little bit at the end here, but if you've got anything to add on your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today, I'm happy to hear it.
[00:43:07] Unknown:
I'll repeat what Sean said around visibility into, the overall pipelines. They're still not the perfect tool for understanding what data is the right data to use, where it came from, can you trust it, and, being able to get a view into that overall picture. And I'm a visual guy, so I'm not talking about, you know, code and databases. I'm talking about a way to really see this so that you can have an intuitive understanding of what's working and not working.
[00:43:43] Unknown:
Yeah. And and I'd say the clearly, at Ascend, I mean, we're really big fans of of, highly automated systems, that track tons of metadata. It it's what we do for for data orchestration and autonomous pipelines. We also have a pretty UI, Charlie, so I know you like our UI. But the, it's a you know, as data engineers, gosh, like, we've been throwing our weight and intellectual horsepower at solving so many other problems for so long. We we have self driving cars. We have incredible machine learning algorithms. We can store more data and process more data and move more data faster than ever before. And so the my big take and and Ascent's, big take here is we can now apply that same intellectual horsepower and around how do we make it easier and faster and more automated to build data pipelines and maintain them and manage them at scale, and that in doing so helps us solve a lot of these other problems and I think that's much more of a how do we have highly intelligent data orchestration technologies out there today. And so that I think that's the next big frontier for for data engineering.
[00:44:49] Unknown:
Alright. Well, thank you both for taking the time today to join me and explore the space of shadow IT and the data and analytics space. It's definitely something that I'm sure a number of people have either engaged with or had to deal with at some level. So, it's definitely an interesting topic and 1 that's valuable to address. So thank you both for your time and efforts on that. And I hope you enjoy the rest of your day. Alright. Thanks, Tobias.
[00:45:15] Unknown:
Thanks so much for having us.
[00:45:22] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Shadow IT in Data and Analytics
Defining Shadow IT
Drivers and Tensions in Shadow IT
Challenges and Compliance in Data Management
Roles and Responsibilities in Shadow IT
Eliminating Stigma and Encouraging Collaboration
Strategies for Integrating New Tools
Pitfalls of Easy-to-Use Tools
Support Strategies for Data Management
Complexities in Data Management
Core Principles and Future Directions