Shining A Light on Shadow IT In Data And Analytics

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, platform,

you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

Are you spending too much time maintaining your data pipeline?

Snowplow empowers your business with a real time event data pipeline running in your own cloud account without the hassle of maintenance.

Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and auto scaling so you can focus on your exciting data projects.

Your team will get the most complete, accurate, and ready to use behavioral web and mobile data delivered into your data warehouse, data lake, and real time data streams.

Go to data engineering podcast.com/snowplow

today to find out why more than 600, 000 websites run Snowplow.

Set up a demo and mention you're a listener for a special offer.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, ODSC,

and Data Council.

Upcoming events include Strata Data in San Jose and PyCon US in Pittsburgh.

Go to data engineering podcast.com/

conferences

to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Sean Knapp and Charlie Crocker about Shadow IT in the data and analytics space. So, Sean, can you start by introducing yourself?

Yeah. Absolutely. I'm Sean Knapp. I am the founder and CEO of Ascend. Io.

And Charlie, can you introduce yourself as well?

Yes. My name is Charlie Crocker, and I am the founder and the CEO of of Zonehaven.

And going back to you, Sean, do you remember how you first got involved in the area of data management?

I do. It was

15 years ago now.

I had just joined Google as a front end software engineer on web search

and we were obviously known for doing a lot of

the user experience of

background color of ads to the layout of the page.

And I quickly found that even the smallest

experiment where you move around just a few pixels on the front end,

oftentimes,

brought with it doing hours and hours of data analysis by writing MapReduce jobs,

in the internal language there called SOWL

to analyze

the usage of 100 of millions, if not billions of users.

So really quickly,

and early on in my career, fresh out of college, found myself

doing a lot of,

pretty complex big data jobs to just answer questions around consumer experience on, web search.

And, Charlie, do you remember how you first got involved in data management?

I was an environmental consultant,

many, many years ago, 25 years or so ago. And, in the organizations that we worked in, everybody was using,

paper and sometimes,

maybe if we were lucky, spreadsheets. And so we were asked to make, charts, graphs, maps, all sorts of information.

And as a junior staff member, you were expected to just sit in your cube and draw these things out.

So I found a way to work directly with the analytics labs, get them to start sending us the data in an electronic format,

and I brought this interesting concept into the environmental consulting world called a database.

And that led to building out databases,

maps, and then moving into

larger scale and larger scale datasets as I as my career progressed.

And so as I mentioned at the open, we're talking about the idea of shadow IT and how it manifests in the space of data management and data analytics. So before we get too far along, can each of you start by sharing your definition and how you think about the term shadow IT?

Sure. Happy to go first.

You know, oftentimes when I think about the the notion of shadow IT

is really what we see happening from

consumers and customers of IT.

Oftentimes, they're,

looking to

test out or experiment,

or try new technologies

to meet some of their needs.

And the the mandate usually behind IT is to be looking for technologies and capabilities that are of broader use to, the broader organization.

And oftentimes,

the

different,

customers of IT are looking to go faster or get more experimental

or simply self-service,

and offload that burden from IT.

And so you'll start to see those business units take on more of that responsibility on their own and bring more of it in house, if you will, into their own organization.

Yeah. And I I think the term shadow IT is, is interesting. It it kinda has, I think, a little bit of a a negative context.

I really see shadow

IT being the environment where

people choose to work in silos or in sometimes

in in business units.

There the this has really accelerated, I think, a lot in the last probably 5 to 10 years with the advent of all of the cloud service providers

because it's easier and easier for people to do their own IT

without having a full IT organization.

So you're starting to see a lot of that experimentation

and a lot of that,

acceleration

without necessarily having a central organization to drive

standards, etcetera.

1 of the things that you were both

pointing out is this pressure to be able to deliver

and how the

availability of some of these different services that are easier to access and

easier to provision without necessarily strategy.

And

so I strategy. And

so I'm wondering,

what are some of the sort of main drivers of that tension and of those types of

projects that are leaning on these different self provisioned services that help to contribute

to these projects that are maybe not driven by the engineering or IT organizations

within a business?

To me, it's it's,

in most cases, it's

about acceleration

or it's about fiefdoms. It's it's about trust.

So you see in organizations that have a lot of m and a, people coming in with their own

way of doing business

and,

finding it hard to adapt and move into an existing organization.

You find existing organizations where there's, in some cases, a lack of trust between

the engineers that build the products and the IT organization that may be tasked with managing the the services and the contracts with with the big vendors. And then you've got the conflict between the goal of getting a new product out really

fast and leveraging

standards and using standard operating procedures.

Yeah.

I would double down on on what Charlie's saying there, which is,

you know, these are the the drivers are conflicting goals. You know, you have,

oftentimes IT

is

exists to help actually centralize, standardize, get economies of scale and leverage which by design means they should move slower and more thoughtfully and more carefully

and at the same time that is in conflict with oftentimes the the needs of the business

which is to move very quickly and at times even to perhaps, you know, break some glass in the pursuit of moving quickly in response to business demands. And, I think that,

you know,

the reason this exists is, you know, if we pop back up even kinda higher up, it's because we're seeing across industries

the wave of digital transformation is pushing businesses to move faster and they have to respond more quickly. And, you know, for example, we've seen in software development,

the introduction of DevOps was, you know, in my view, all about how do we enable

more people to build more software applications

faster yet more safely. And we do that through a variety of different constructs.

Yet,

that same level of agility that we're now getting in software, many companies have appreciated for a while. We haven't yet received in the data world. We're still doing much more waterfall y traditional style approaches. And as a result, we see those pressures and and, that's what's causing and and igniting a lot of this, behavior. And there's a little bit of danger in that behavior too with the data. You know, with the with the software piece,

I'm I'm not as,

deeply engaged with that. But with the with the data piece, you end up with

siloed datasets,

siloed data pipelines, repeated data pipelines.

You increase expense, but you also have whole different privacy

and management schemes that need to be dealt with. So with data, it gets really hard to, in a large organization with a lot of different silos and a lot of different processes,

to really even understand

your compliance regime, for example.

Totally agree. Totally agree. 1 of the other things that's interesting about that difference in terms of software versus data projects is the level of impact that can be had throughout the business of a project being delivered in in an accelerated fashion,

as well as some of the issues around things like compliance

or data quality that arise and are somewhat unique to this analytics space. And I'm wondering

what you have seen as some of the challenges posed and some of the driving forces towards

building these projects that are maybe outside of the supported platforms within an organization?

Yeah. I think there's a I think it's interesting when we think about how,

you know, how we were trying to centralize in standard in many ways and and I I oh, I would make

a a potentially, you know, inflammatory comment which is,

you know, I'm not sure it really matters as much these days if we, like, standardize on are you gonna process your data with Hadoop or Spark, for example. I'm not really sure that those levels of standardization matter

in so much as do we have standardization?

And this is where I see a lot of IT needs going today, which is do we,

standardize on how do we articulate what data exists,

how do we unify,

how we know that it got there, and why did it get there, and then how where does it go? And so it's much more of the notion of governance, lineage, tracking,

a lot more at the metadata layer

because this is I think as Charlie is highlighting is that's the stuff that gets scary. It's like, do you know if your stuff is actually legit and valid? And

if it is broken or it does have a bug And legal and even, you know

And and and legal. Yes.

Like, all of a sudden, it's the like, in in the software world, like, you would make API calls. Right? And so if somebody fixed a bug, the next time you made that API call, you'll probably get the right answer.

Yet, in the data world, you're making copies of data and you're moving it all over and so, like, you may have produced a new dataset

from a buggy piece of code, but how do you know that, like, that even went the right place? Or, you know, as Charlie highlights, like, well like should that data have gone there and you know like gosh if it went somewhere it wasn't supposed to do at least know that it went there and you know how to retract it and and so on and so

that is more of the problem is, I would argue, is a metadata layer these days of just knowing what is happening with your data. As you mentioned, the tracking of the data is definitely 1 of the key problems

that exists in this particular realm. Because as you said, with software, if there's a bug, you fix it, and it gets redeployed. But with data, if it gets copied

5, 10, 15, a 1000 different places, and then you realize, oh, there was 1 different way that we were tracking it, or there's a particular field that needs to be masked. How do you then go and apply that transformation or apply that constraint on all of the big

issues is,

which metric is the right metric. You know? I mean, 10 people can run the same pipeline and call the outcome the same number, and you could have 10 different numbers. Right? And so, you know, at least early in the transition for a lot of companies, you know, he he who owns the, the metric,

you know, owns the story. And so every individual would want to come into

a C staff meeting with their own set of metrics, for example.

So how do you, from the top down, start saying, look, how do we drive

standardization

without

squelching

innovation?

And so the stuff that that Sean's talking about around metadata, around being able to have visibility into

the pipelines, being able to

rank and canonize

certain data sets and certain metrics.

Those are the key things that

allow success in a data product or in a data pipeline within an organization.

Yeah. It's it's really interesting kind of building on top of what Charlie was saying is we've watched a number of companies go down the super cool path which is they've said

the look like we hate all these little fiefdoms that are are like holding tight on to their numbers and literally won't share them with other people or or their datasets.

I think, like, you know, you have large fortune 500 companies and each

BU has some other data, but they don't wanna share it with others. And we've seen really,

cool, like, executive level mandates that say, you know what? We're going to expose all of our, data, all of our drive datasets, our published datasets, and if we have disagreements

around how to calculate something or what the definition is we're going to have that conversation

but we're going to create a higher level of transparency.

And

the the classic objection to that has always been but now we're giving more people access to more data. Isn't that scary?

And the response that I think is really helpful is well, not if we actually have invested in that metadata layer and have enough intelligence to say either some things you shouldn't have access to and there's some things you can't take to other places.

But if you can automate more of that, now you can safely actually enable this level of dialogue

and collaboration across teams that you just otherwise couldn't

have. Yeah. And govern and governance, governance, governance. Right? And how do you do that without,

once again, restricting

innovation? I've been in many organizations. I've seen this with many that I've worked with. Governance is the thing you do at the very end. When you finally are done with everything you've been working on and you've worked through all the data, then you look at it and say, wow, you know, am I in compliance? Did we do this the right way?

That kind of thing. And shoehorning things into governance

may get you there faster, but it slows you down when you start to really try to scale.

And another thing too that I think is worth calling out is to what you were saying, Sean, about the fear of giving access to data to all the different people in an organization is somewhat related as well to the fact that they might not necessarily

have the appropriate background or understanding of how to interpret that data or how to use it for making effective decisions.

And so I think in addition to the governance and metadata aspect, there's also the education

component to making sure that everyone within the organization is able to actually gain value from the data that they have access to.

And so in that scope as well, I'm wondering what you have both seen in terms of the types of roles or responsibilities

that often are the drivers of these shadow IT projects, and some of the reasons that those are the types of roles that might be more likely to build out some

new,

platform or new,

transformation on the data that they have access to or maybe collecting new sources of data from other systems that aren't already incorporated into the underlying platform that they have access to?

Yeah. It's a really good question, Jan. And what we see is usually a a few fold and and it's it's dependent on oftentimes how big the company is,

and where they kind of are in that that data journey if you will. But the

1 of the more obvious ones is

we'll see coming and stemming from the engineering and product teams,

the data engineering role who's constructing a lot a lot of data pipelines,

who are trying to source new pieces of datasets, oftentimes they're part of a even a data analytics or a data science team and they're the they're connecting essentially all these various data systems. And they're trying to get access to new pieces of data. They're trying to to work with new technologies,

to empower the the the broader group. And oftentimes, just can't get those capabilities or can't get those datasets,

onboarded

as part of the the standard sort of corporate platform.

And we see them emerge a lot as some of the early drivers.

We also interestingly see,

product managers

and even data scientists themselves saying, look, like, I like,

I know we have all this really big infrastructure

and we have these really cool capabilities in a central platform but, like, I'm not a Spark expert. I just want the power of Spark to run a big job or I don't wanna deal with all of these other sort of complex technologies around it. I just wanna point it to some data, do some really cool data logic, create something that's more automated, and move on.

And so oftentimes, they start to become the seed

for,

some of these shadow IT efforts.

And

it sort of start to to trigger some of this behavior of like, hey, we can really move a lot faster

as a result if we can can properly free ourselves and

and and get moving quicker. Yeah. The,

we saw a lot of

you you can kinda do it from cent this sort of federated to centralized to federated

thinking.

And

we've seen

I've seen in several organizations

where

the they had a very fractured, fragmented

data

structure, data

silos, etcetera. And then they work very hard to say, look, we can come up with

a consistent

methodology for how we store the data, for where it's located. Maybe we have flexibility on the tools. Maybe we have flexibility on some of

the compute layers, but we're going to only have a single data store or a single place to put that, and we're going to have a supported stack and this kind of thing. And that's all well and good, but shadow IT, in some cases, becomes the residual, the people that are really uncomfortable

making the change. Right? So making the change from, I like command line Hadoop, and now we're gonna start using

managed services like Glue in the cloud.

That's a whole different skill set. And so in some cases,

people are sort of clinging to the old tech because that's what they know.

And the switching costs for

the engineers, the data engineers,

is

almost too high. And so they'll look to try to keep

a fiefdom or a silo that continues to work the way they understand it.

And another interesting

aspect of this is that, as we mentioned before, the term shadow IT can have this negative connotation,

and it can lead to people trying to hide their activities

from

the central IT or just from the organization at large so that they don't get called out on embarking on some, maybe, unapproved project or incorporating some technology that hasn't been vetted by the powers that be. And so I'm wondering,

what are some of the ways that we can try and either eliminate that stigma so that people are more willing to

be upfront about the fact that, hey, I tried this thing. It's having this useful outcome,

and then being able to then incorporate

that into the rest of the organization, or popularize it, or add a way for them to integrate the work that they've been doing into the data sources or data processing systems that are being used throughout the organization?

Let me I'll just quick I have a very just short statement on that.

The the service organization, whatever that central organization is,

needs to be

needs to really think of the

users as customers. And if you can't provide them with something of value that allows them to innovate and scale, they're going to go somewhere else. Right? So you find a lot of organizations

where the central

unit, the central whether you call it IT, whether you call it the central analytics team,

it becomes more of a policing organization than an innovation organization. So

how do you take that customer first attitude

and bring that

to your community from that central location.

Yeah. I would so I would so, double down on that with with Charlie, which is, you know, we we see this happen pretty frequently

where, like, that starts to be a a behavior in an unhealthy organizational dynamic. And,

I mean, to put it

really sort of, directly, it's a pretty terminal strategy

because at some point, like,

people like, because customers will go elsewhere even if they're internal.

And

we're in this stage right now where, the central,

teams and then the service provider teams

have a lot of leverage because of concern around data privacy and governance and and data leakage and so on.

But the encouragement we would I would generally provide is to not misuse that and abuse it because at at some point,

markets even the, you know, the small internal markets will correct themselves inside of the organization.

And so,

you know, the I think the the way to

to think about this, especially for those who are are testing out or, you know, testing the waters in in shadow IT and trying new technologies and so on, 1 of the pieces of guidance we always provide is

don't make your or make sure you don't paint yourself into a corner. Like, any technology that you are trying, is it is it still

enterprise grade enough that it could actually be adopted by the broader organization? Does it have the right security and governance capabilities? Does it

have the ability

in to integrate into your broader ecosystem?

In some way, right, you you don't you want to make sure that you're not trying to introduce a technology that fundamentally traps you

because that's a a surefire way of of getting a lot of of resistance from from IT.

And this is certainly what we've found with a lot of our customers is as they experiment and explore with technologies, whether it's ours or or others,

you know, finding some really cool use cases that prove out a lot of the value, but then still helping them come back in and even talk to the central teams and say, hey, but look at all these other security safeguards, how we can do this, you know, as Charlie was describing, this hub and spoke model of data sharing and sort of data governance where we each have our little pods of datasets, but we can publish back to central teams and have proper governance on this so that

they actually can become a really cool advocate for how to introduce new technologies back into the broader organization that everybody benefits from. I think if you if you kinda think through it

with that mindset, it's a really

collaborative approach

for how both the the sort of business units and and the the central teams can work together well. Yeah. There's an analogy that I've heard in a couple of different contexts

of the pioneers,

settlers, and town planners in terms of the life cycle of innovation

from a technology perspective, where the people who are embarking on these different projects of bringing in different stacks and different tools are the pioneers where they're going out, and they're

exploring what's available. And then if they find something that's useful, then the settlers on the team will be the ones who adapt it to say, okay. Well, we've got this maybe bleeding edge tool or technology.

How do we actually

make use of it in a somewhat more stable fashion? And then, eventually, that gets handed off to the town planners who are maybe the central service organization

who determine how do they

incorporate that into the rest of the organization and the rest of their technology stack. How do they make it scale and make it available?

Yeah. I really like that. That's a great I like it too. You you you end up though in some cases where you've got a really nice town that's been built out, but there's somebody who's, you know, not going to the planning department and is building out a little hazardous waste dump somewhere.

And, you know, at some point, at some point, somebody's gonna find that hazardous waste dump and and and they're gonna have to deal with it. Or you end up with Boston where it's a nice city, but you can't find your way anywhere.

Or you can't afford to live anywhere there. Right? So yes. Yeah.

I I think it's I I think there's something really cool about that approach though, which is,

yeah, like, okay. So it it may be problematic,

but if you can actually keep it well contained,

right, it's kind of, you know, to continue, Tobias, your your analogy, it's it's kind of out on the frontier

as opposed to actually in the town, right, and you're in

a good process of collaboration between central and the, business teams is

to be really careful what you allow to be introduced back in. Like, you don't wanna introduce technologies that could infect

the the broader sort of town architecture, if you will, but giving those teams some degree of, freedom and flexibility.

And so 1 of the interesting things to explore as well is that there are these tensions that exist between the priorities of the different groups within the business and

the different projects that get spawned as a result. But once you have

identified or somebody has introduced some new tool and,

presented it to the rest of the organization. What are some of the useful strategies for

removing the friction that exists in the organization that causes them to go out and build those new tools in the first place or maybe try to hide them? And how do you

incorporate

those

new platforms into the organization and make it easy to

integrate or extend the services that are available to make it so that

you maybe use a different

compute framework, but you're not trying to reinvent the definition of a particular metric, and you're able to rely on some of the master data management

or compliance and governance strategies that exist without them being too rigid?

Yeah. You know, I I think it it playing off of Charlie's comment earlier which is, you know, if you're a central,

enablement team,

really at the end of the day, your customers are the other business lines of business. And

there's really great lessons to be learned from

software vendors and and enterprise SaaS vendors in general which is, you know, it's not just about creating a technology and then writing a a white paper on it and emailing it out to the, you know, to rest of the engineering org, but it's really around how do we actually help onboard different teams.

And so we've seen and even helped our customers do a variety of different things from, you know, how do you create,

centralized teams of excellence but, like, drive distributed, innovation.

So create a team that, like, is literally the SWAT team that orbits and goes from org to org to org and essentially drops in and says, what is, you know, your biggest, hardest, most painful in our world data pipeline,

and how do we help you, migrate this to a really cool tech new tech stack, fast and efficiently, and we'll train you in the process.

So there's approaches like that

all the way out to which is great for the central team too because you get to go drop into a lot of really cool businesses, and see the end use cases you may not otherwise get to see and help them build stuff out fast, which is super fun.

The other thing we've seen is we've seen companies literally do the big, you know, once a a quarter or once a year hackathons,

and they're weaving in seams. And so that becomes a way for them to say, hey. We just have brought in, you know, 3 new really powerful technologies.

We're gonna focus the hackathon on use of these new technologies

and we're gonna bring in those vendors or how or centralized, you know, SWAT teams and they're gonna be these orbiting teams to really help everybody just massively ramp up within 24 hours on how to build something really cool and magical.

And both these strategies we we've seen be really effective.

And then shifting gears a bit, we mentioned at the outset that some of the reason that shadow IT projects, particularly in the data and analytics space, are starting to become a bit more prevalent is because of the availability of these different

cloud tools

or, you know, 1 click provision applications

or easy to use databases.

So I'm wondering

what types of tools or platforms in particular are well suited for being provisioned by people who don't necessarily

work in a primarily engineering role or for people who are

not necessarily

looking for

a end to end integrated solution. They just want something that they can start using in conjunction with existing tools

and some of the potential pitfalls that exist as a result of these tools being so easy to use and maybe the people who are initially setting them up not having the context or training necessary to be able to foresee some of those potential problems?

Well, I'll I'll start with the the

what what we found was

in many cases, you have

a set of master data.

So people have access to

business systems data or some set of product usage data.

And then it's very easy for people to go out and get

analytics tools. Right? Go out and get your Power BI. Go out and get your

Tableau. Go out and download some of that data into Excel. And then they start to do really great work. They think about it, and they generate a lot of really cool metrics,

great reports, etcetera.

But once again, you start to because those tools are easy to get to,

very well marketed into organizations,

you start getting lots and lots of little silos. Everybody's desktop becomes,

a silo. And so

that's that be that that in itself, just that thin BI layer, for example, on top because of the ease of access to the data and because of the ease of access of sharing that via Excel files

being moved around,

those things become

huge govern governance headaches

and also make it really difficult to do any provenance and understanding of the value of the actual information that's coming out of the that that effort. Yeah. I'd agree. And I think, you know, even voting on top of that, we see, you know, as as Charlie's describing, the introduction of a lot of, you know, SaaS vendors and and BI vendors do a really good job of nailing that, you know, how do you make it pay as you go, how do you make things that are super simple to connect into your existing ecosystem

and so on which is great to to start. You know, I think

part of the balance is finding the

those tools and technologies that, of course, have that that easy

capability to get up and going. But similar to what I was highlighting before, still enterprise grade enough

that at the same time they're extensible

and can be hooked into the rest of your ecosystem

and are are,

sufficient for far more advanced use cases.

And I think that's the the the nuance of, like, figuring out where do you find those and how well do they work, so you don't end up with this, like, massive proliferation of silos.

And I think at least SaaS does a a pretty good job of helping to solve parts of that problem,

because it's much easier to unify access

and take what started as just a couple of of use cases and expand it to a bigger organization without everybody sort of sitting in their own little pockets.

Yeah. You brought up earlier too, Tobias, the education piece. Right? And so

a really good,

support team, very customer focused central organization,

well supported

BI tools and data engineering tools and data science tools.

If you can get that

ecosystem built out and that support network built out and you can help the people within that organization

feel like they are the champions

and that they have some flexibility in that,

then you can start to to drive,

a lot of those standards. Right? A lot of the the shadow IT is,

you know, it's like I said earlier, it's trust and it's lack of communication.

And so how do you, you know, deliver

not just a tool that is easy to use,

but a support team in an organization

that helps people feel like they're, you know, they're always 2 steps ahead. They're not that they're being told no at every turn.

And so that communication and support piece, do you think it's just a matter of saying, hey, we're here and we're available, ask us questions. Is it a matter of having

substantial documentation

for people at different levels to be able to access and follow? Is it a matter of

publishing the availability of different datasets and making them discoverable using maybe something like the Amazon tool from Lyft? Are there any other elements of that sort of support strategy that you have seen as being effective and productive?

The big thing is you need top down support too. If you don't have top down support for driving

some of those standards, then that behavior will never,

change no matter how well you do in that. So you get top down support. You have leadership

that at the next tier down

that is held accountable for that. You can start to drive drive that. There's lots of different tools. Data discovery

is 1 of the hardest. I've been you know, there are a lot of tools out there. I don't even know what the the newest ones are right now. But,

data cataloging, finding what is pertinent, what is the most valuable data. You know, how do we easily do that in a in a big organization that's even got standards. Right? So

that that that's a that's a huge thing. But there's a lot of documentation. There's a lot of learning. There's there's great tools for doing data discovery,

but you really, really need the the top down

support

for

driving that consistency

and clear

reasons, goals for why we're doing this. We're doing this because

we wanna be a compliant company. We're doing this because

we are stewards of our customers'

data.

We're doing this because we feel like we can drive innovation and scale better as a company.

We're doing this so we don't have redundant people doing redundant jobs all over the place. There's a lot of reasons why you do

you drive people towards that. And if you're not clear on why, people are not really gonna understand. Yeah. I totally double down,

on that. And, honestly, the the once you have more top down support, which is really where you need the executive or at least very senior level sponsorship

because a lot of the technical decisions will be made much deeper in the organization. But

at a at a more senior level, what you're going to get is the amplification of this importance

and

those are also the the sponsors you can get internally to then let you do these really cool things like create these rotational teams to go spend time with the internal customers or do these hackathons.

These are the things that allow you to do these

outsized

impact moves in really short time frames

that, you know, as you decide on some of these new strategies

frankly help them see the light of day at much larger scale much faster which everybody's

super interested in doing these days.

And are there any other just inherent complexities

in the overall

aspect of data management and the available technologies

that you think we need to see resolved and addressed more effectively in order to reduce the tensions that exist between the organizations

and the different business units that lead to these bespoke solutions?

You know, I mean, my take is and

and, Tobias, we've talked about this a little bit before. I'm I'm a really big believer in

moving a lot of the technology systems from

imperative to declarative.

And

the primary reason is, you know, we we have pockets of technology in the data space right now. We have

storage engines. We have processing engines. We have data catalogs and data warehouses

and governance systems, but

we don't have systems that actually connect all of these together. And so, you know, when we talked earlier about the importance of metadata,

metadata most of the metadata management

really in most companies is being done pretty manually today.

And

even if I have automated systems like I'm doing, for example, automated data replication or pipeline orchestration and so on,

I'm usually automating something that says do a job at a particular point in time, but we don't really have a a tight linking between

this piece of code ran on this piece of data that produced this other piece of data, and here's why it did it. And as a result, because we don't have this tightly bounded notion,

that's highly automated

at the metadata layer,

we push all of this burden

to people and that's why we do it more mainly. Like,

very few companies, if any at all these days, can actually say, is that data in your warehouse, your database, your lake actually reflective of the code in your repository?

It was probably reflective of some version of your code at some point that ran and some version of the data, but but you actually don't know how and why. Like, you'd have to throw an engineer at the bottom of like go go audit all of this stuff and figure out why.

And so

as a result, I think this is 1 of the the things that's just missing the, today is still

these smarter systems that are are looking more holistically

across the the data and code landscape, they can actually track and trace

because in doing so, now we actually can automate more or get more of that burden out of the hands of of engineers who are trying to manually do it,

which accelerates the those cycles and makes it easier for people to to move a little bit more freely.

Yeah. I mean, I'm gonna,

double down on what you're saying there. And I know, Sean, you and I have had this conversation many times. But,

visibility

into

all of the work that is happening,

all of the the loads, all of the

processing,

the cost for the processing, which datasets are, you know, being used by hundreds of pipelines, which datasets we don't need anymore,

streaming information,

what data is scheduled to be retired and deleted.

You know, it's really difficult

for

a, you know, a nontechnical

person

who may be making some of these decisions, like, you know, product leaders or, you know, working with their legal

counterparts,

it's really hard to get that that view. So there are a lot of good tools out there that are starting

to sniff and consume logs and place, you know,

points and information along various parts of the pathway, but it is not a solved problem yet.

Are there any other aspects

of this topic in shadow IT and some of the motivating

factors and possible solutions and the

reasons for tension and ways to try and overcome that that we haven't discussed yet that you think we should cover before we close out the show?

I think the I I think

kinda going back to to core first principles

and,

you know,

as I was mentioned a little bit before, DevOps is all about

ultimately at the at a very high level, how do we enable more people to do more with software

faster and and safely?

And that's really the the sort of

it's

fighting against anything like that is like fighting against gravity. And the the same thing happens when it comes to data.

And a lot of the even the conversations around DataOps today and it's still forming and and there's a lot of opinions and and people are trying to kinda push it in a bunch of different directions.

But, really, at the end of the day, it's again they come down to how do we enable more people to do more things with data

faster and safely. And and and it's gonna be the same sort of core business drivers.

And

I think for a lot of teams,

the specific nuances of how you accomplish that inside of your organization may be very they're very contextual to your world.

But

trying to actually

reroute that momentum is really hard because it is very much like fighting against gravity.

And figuring out how you can enable that and oftentimes the I would say you know we do the same sort of exercise and process and how we build here at Ascend which is,

gosh, like, if you can think through the well, we want to enable this

and that would make the organization much better. And there's always another hard technical problem at at the other end of that to solve is we go solve that problem and figure out, well, if we can solve that though, that will enable us to be a much more efficient

and effective organization.

And so we just we adhere to those core principles and kinda keep trying to knock down the next challenge

to really help the team go faster and faster

safely.

Yeah. And I'll come back to it. I think that in many cases, the the hardest problem is not a technical problem. It's a it's a people

problem. So you've gotta figure out how to motivate. You've gotta figure out how to put them

customer first.

You've gotta help people understand

why there's value in working together, and you've gotta support them.

Well, for anybody who wants to follow along with the work that you're both doing or get in touch, I'll have you each add your preferred contact information to the show notes. And,

we addressed this a little bit at the end here, but if you've got anything to add on your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today, I'm happy to hear it.

I'll repeat what Sean said around

visibility into,

the overall

pipelines.

They're still not the perfect tool for

understanding what data is the right data to use,

where it came from,

can you trust it, and, being able to get a view

into that overall picture. And I'm a visual guy, so I'm not talking about, you know,

code and databases. I'm talking about a way to really see this

so that you can have an intuitive understanding of what's working

and not working.

Yeah. And and I'd say the

clearly, at Ascend, I mean, we're really big fans of of,

highly automated systems,

that track tons of metadata. It it's what we do for for data orchestration and autonomous pipelines. We also have a pretty UI, Charlie, so I know you like our UI. But the,

it's a you know, as data engineers,

gosh, like, we've been throwing our weight and intellectual horsepower at solving so many other problems for so long. We we have self driving cars. We have incredible machine learning algorithms. We can store more data and process more data and move more data faster than ever before.

And so the

my big take and and Ascent's,

big take here is we can now apply that same intellectual horsepower and around how do we make it easier and faster and more automated to build data pipelines and maintain them and manage them at scale, and that in doing so helps us solve a lot of these other problems and I think that's much more of a how do we have highly intelligent data orchestration

technologies out there today. And so that I think that's the next big frontier for for data engineering.

Alright. Well, thank you both for taking the time today to join me and explore the space of shadow IT and the data and analytics space.

It's definitely something that

I'm sure a number of people have either engaged with or had to deal with at some level. So, it's definitely an interesting topic and 1 that's valuable to address. So thank you both for your time and efforts on that. And I hope you enjoy the rest of your day. Alright. Thanks, Tobias.

Thanks so much for having us.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links