Designing For Data Protection

Hello, and welcome to the Data Engineering podcast,

the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode.

That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

This week's episode is also sponsored by Data Coral, an AWS native serverless data infrastructure that installs in your VPC.

DataCorel helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure,

meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance.

Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from terabytes to petabytes of analytic data.

He started data coral with the goal to make SQL the universal data programming language.

Visitdataengineeringpodcast.com

/datacoral

today to find out more.

And having all of your logs and event data in 1 place makes your life easier when something breaks, unless that something is your Elasticsearch cluster because it's storing too much data.

ChaosSearch frees you from having to worry about data retention, unexpected failures, and expanding operating costs.

They give you a fully managed service to search and analyze all of your logs in s 3 entirely under your control, all for half the cost of running your own Elasticsearch cluster or using a hosted platform.

Try it out for yourself at dataengineeringpodcast.com/chaossearch,

and don't forget to thank them for supporting the show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council.

Upcoming events include the Data Orchestration Summit and Data Council in New York City.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Karen Heaton and Mark Sherwood Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines. So, Karen, can you start by introducing yourself?

Yes. Good afternoon. My name is Karen Heaton. I run my own data protection consultancy.

I've been working in data and systems implementations and financial services for over 20 years,

and I'm a real fan of technology.

And now my consultancy obviously specializes in data protection compliance for mainly small and medium sized organizations,

including tech startups.

And, Mark, can you introduce yourself as well? Sure. My name is Mark Sherwood Edwards. I'm a tech lawyer by background. I now do do a lot of work in the privacy space, and I've got a, privacy,

business called this is dp0.co.uk.

Like Karen, I'm based in London,

UK.

And like Karen, we both do a lot of work sometimes together around the GDPR.

My background, I said, is as a lawyer primarily in technology,

outsourcing, intellectual property, those kind of areas. And going back to you, Karen, do you remember how you first got involved in the area of data management?

Well, yes. Ever since I first started working,

which is quite a long time ago, I've always been involved in systems and data,

whether I've been working operations

for banks or

other types of different types of companies

or whether I've been working for software companies,

helping them

implement their software solutions into their clients organizations.

So data

and

systems has been a very large part of my life. And

now with the data protection regulations coming in,

these elements are now marrying together into a very interesting,

topic area. So that's many years of experience in this area. And, Mark, do you remember how you got involved in data management? Yeah. Well, probably not quite as directly as Karen's,

but,

so I've been in house counsel for a number of outsourcing

companies over the years. Some have been doing a lot of HR, some lot of procurement,

all all of it involving a fair amount of data. So I've always been quite interested in it and,

the kind of interrelationship between data and intellectual

property where you own data can't own data. So it's been a a an interest of mine that's been going on for quite a number of years now. It's it's been interesting because because,

Mark and I often talk about

how we've

given our our ages, how we've enjoyed watching

technology grow from the days of

very large mainframe computers,

punch cards,

all the way up to what we have today with plethora of apps and now, you know, artificial intelligence, I mean, machine learning. So we both feel very privileged actually that we've been able to watch the growth of systems and data over the last 30 plus years.

And as you mentioned, in recent years, there have been some regulations coming out that cover this idea of data protection.

Perspective of

data storage systems and data management? Okay. Well, I'll I'll kick off on that 1. So when people talk about data protection, they're talking essentially about personal data, the protection of personal data, and the regulation of personal data.

Now most people nowadays are aware of the GDPR, which came out in the European Union last year, but which has actually which has impact outside the European Union as well. We talk about by that a bit later.

But GDPR wasn't particularly new at least in Europe. It had been data protection has been around for 20 or 30 years.

And it's all and it and it starts really

with a human rights. In fact, the European Convention of Human Rights article 8 says,

everyone has the right to respect for his private

and family life, his home, and his correspondence.

And that's that, the respect for your private life, and data the data you use and create being seen as part of your private life is where it all starts from. Interestingly,

I took the opportunity to look at, some corresponding US documents,

and the US constitution, which starts off the 1 that starts off we the people, as you all know, the bias goes on. We the people of the United States,

in order to form a more perfect union,

establish justice,

ensure domestic tranquility,

and that ensured domestic tranquility is possibly I'm not a US constitutional lawyer. You'll appreciate.

It's the same kind of concept, how 1 protects private life. So the general concept has been around for a long time, and that is

it's fine for companies

to,

take and use other people's,

or their customers'

personal data, which they may collect because people enter an account or it's a bank or you may be shopping on Amazon or whatever, provided they do that lawfully,

transparently,

and,

in a fair way. And, essentially, that means,

if someone's gonna if you're gonna if you're disclosing your personal data to someone else or to a company, they've got to be very explicit

about what they're gonna do with it and not do things which they haven't disclosed ahead of time.

And if that's the case broadly,

then what they do is gonna be lawful and everybody's gonna happy with it.

Another way of looking at it is is a trust based thing. Okay?

We, consumers,

are trusting companies

with our personal data.

And having entrusted them with that data, we are expecting them to respect that trust by acting lawfully, transparently, and so on. So, essentially, data pro data protection

is a codification

of

a trust based principle, and that's the high level view. And then you can and then you can dig down to various low level views as as we, you know, as we progress.

And in terms of the actual specific regulations, as you said, it's a concept that's been around for quite a while.

But in recent years, we've been more explicitly codifying it in the GDPR and then recently in California with the CCPA.

I'm wondering if you can just discuss a bit of the scope of those regulations and

how an organization or an individual can best determine whether or not they're subject to those rules and what particular information is encompassed by that. Okay. Well,

both the GDPR

and the CCPA,

I sent you rules about personal data and people's and use of our people's personal data. So So to that extent, they're very similar, and they have a lot of concepts in common.

Some of the diff and so some of the concepts in common, for example, are applied both in GDPR

and CCPA.

If you're gonna use other people's personal data, your consumer's personal data,

you've got to tell them that's what you're doing. It's a generic privacy notice says, yeah. We're we're using your personal data. These are the characteristics of data we're gonna use, and these are the purposes we're gonna use them for. Okay? And that and and that's a that that applies both across the CCPA and GDPR.

Both equally have a

right for the,

relevant consumer to get access to their data, I, a copy of the data that that's been held about them,

and both have the right to for consumers to get their data deleted. Now then the differences start start to come. So the the GDPR

is is is is kind of product of a longer thinking. Now it was a preceding regulation from 1998

at a longer gestation

period, the CCPA was put put together in a bit of a rush to meet a deadline.

So some of the fundamental differences are you get concepts in the GDPR like lawful basis.

K? So there's a finite amount of law. There's, I think, 6 lawful basis under GDPR.

And if you're processing personal data,

personal data means it's more as the same in both categories.

The US tends to call it PII, personally identifiable

information.

Yeah. You have quoted personal data.

They're very broad categories,

include

your name, your email address, your phone number, but also online identifiers,

things like that. Anything which can be tracked back back to you on a kind of 1 to 1 basis. So both have that in common. The GDPR is this notion of the lawful,

lawful basis, and you have to have a lawful basis to handle,

personal data, and you have to explain which lawful data basis you're using. For example, have you got consent?

If you don't have consent, is it pursuant to contract? If it's not that, is it pursuant to legitimate basis, that kind of stuff?

The,

GDPR also has some more fundamental

concepts kicking around in it, things like privacy by design and data minimization. Privacy by design, means kind of what it says,

general the general thinking is you if you got any system, by which I mean,

not just a hardware software, but people round around the system,

and you'll be handling personal data, processing personal data, then you need to have thought out how you're gonna build in privacy requirements from the start. It's no longer okay just trying to retrofit it afterwards.

Data minimization

means,

couple of things.

What's the minimum amount of data that you need to accomplish the job, not what's the maximum amount of data?

And then once you have that data, when you use it in your business, using the minimum

amount for each particular job, not just, you know, not sharing data around my confetti. So those are some structural fundamental things which explicitly called out the GDPR

are not not explicitly called

out in,

CCPA.

The CCPA's

approach is much more an opt out approach.

Most things are permitted,

but the primary

thing is,

if you and it applies to Californian residents,

and most companies dealing with with Californian residents. But if it applies to you, you have the right to opt out of what they call

selling of data.

So you can object, you can't sell my data, and effectively that means selling is a broad term for any kind of

sharing, licensing,

and so on of data. So those those are the kind of prime they're they're similar in some ways. They've got some fundamental

underpinnings,

tectonic plates,

which is slightly

different, which are different, but that end up driving the same kind of behavior probably for for businesses over time, which is,

be more careful with personal data, essentially. And I think, Mark, Tobias, you also asked about the territorial reach

of, GPR.

Yeah. Some of the ways that you can determine whether or not the data that you're dealing with is actually subject to these regulations. And I think that the blanket approach that a lot of companies are taking is that it's too hard to identify at a granular level whether or not somebody is a European citizen or isn't or is in some way related to the European Union or California, and so they just apply the same sets of principles in a blanket sense. And I'm wondering what your thoughts are on some of the sort of best strategies to approach

the regulatory environment that we're in now.

So yeah. So,

I mean, I'm sure most people have seen evidence of

big American companies

in particular undertaking compliance programs,

signing up to US Privacy Shield to show that there they have adequate data protection standards

and even doing things like, updating terms and conditions, privacy notices.

Mailchimp, for example, have introduced

something called double opt in function

into their platform so that consent is over GDPR standard for any of their clients

in the in the EU. So if you don't want to adopt a blanket approach,

it comes down to then truly being able to try and understand

in a granular way what data you have in your datasets, how you acquired it,

what what you're gonna use it for, whether,

your organization

actually does sell products and services,

into

the EU,

in which case they will definitely be subjected

to the regulations, or even if they have websites

which,

EU

people based in the EU can access. And if those websites are running,

lots of cookies, plugins, pixels, etcetera,

Those

cookie

items, if they're collecting

data

of

individuals based in the EU, then they are also gonna be subject to,

the rules of GDPR. So

so I understand the the approach

to just assume everyone is

subject to the regulations and then apply the blanket like. But there are there are definitely ways of, you know, properly analyzing your business,

asking more and better questions about the datasets that you

have and how or have acquired

and then taking a view based on that. But I agree it's it's difficult if you've got very large sets of data and perhaps not a lot of information around the background to those datasets.

And then

from the organizational

and technical perspective, what are some of the conflicts or constraints

that act against some of the efforts that they might try to put in place to implement data protection,

whether it's because the technical systems design that they have doesn't really allow for,

proper segregation

or tracking, or whether it's a matter of policy as far as

helping the different people within the organization understand the importance of these different regulations and their enforcement?

Yeah, it's a really good question, actually, Tobias, because

there are a number of

reasons I think that we see complex

constraints. I mean, the biggest 1,

that certainly Mark and I think is, you know, first and foremost, do you need management buy in

of of the the need to

understand the regulations

and implement appropriate standards

into your organization in order to be compliant with them. This goes back also to, you know, the codification of trust that Mark talked about earlier on. It really is

a trust

journey,

not only with, your customers, but also with your employees. I think,

lots of data scientists and data

analysts want to do the right thing. But if organizations

don't train them in what the right thing looks like or if organizations

don't give them the tools in order to do their job in a way that is compliant and meet meets the standards. It's hard for them to be able to to do the right thing in in a way that they would like to be able to do it. So,

management buy in is really important.

And also,

you know, it's an investment in your business. And sometimes going through a compliance program,

you can identify

cost saving measures in your business. If you do a proper data audit, data discovery,

you map out your records of processing activities, you might find

activities going on in your business that are unnecessary.

You might have systems that don't need to be used or paid for.

So,

there there can be

benefits,

that that can be gained,

from these compliance programs aside from the obvious being compliant.

And, and also

sales and marketing departments often are in conflict with some operational departments

or compliance

departments because they've got different goals.

Obviously, the sales and marketing teams want to generate leads and and get money in the door, but it has to be in a lawful and and compliant way. I think 1 1 of the areas where you kinda see differences is if you think

kind of most IT departments are concerned with security of the data, you know, all the usual things on security, which is I think of as protecting the perimeter.

Mhmm. Then you can think about the kind of compliance that's within the perimeter. So for example,

if too many people have,

access to personal data, people are monitoring who you know, people are applying least privilege

within the organization, so too many people have got access to it, then that starts giving you a data protection

compliance problem. Now in fact, there's a good example in Portugal the other day,

which is obviously in the European Union. There's a Portuguese hospital which had something like,

no, 50 doctors,

and there were 200 people in the organization

who had who had doctor level access to patients' records.

K? So that gives you a good example. That's not a hard

you know, it's not like your yes. The server has been hasn't been correctly hard and I mean, everything's working like it should do. Everything's been patched correctly. It's just there's a a laxity

Business level. At the business level. No one's already thinking it through about, well, actually, this trust here, we've been trusted to hold on to this data and protect it properly, but actually,

that would mean

revising

or checking each time who has access to what data. And, you know, I I know we were talking about conflicts and constraints

around the efforts to implement,

data protection in organizations. But, you know, it's also important

for organizations to realize that, you know, supplier due diligence.

I'm sure many of your listeners will have gone through it,

an RFI process or you've had to fill in a due diligence questionnaire for a large client.

You know, GDPR and data protection compliance is often a big part of that due diligence process, and you may not actually get the business,

unless you have certain standards

and procedures in place.

Another thing too is that

the initial era of big data was just capture everything because you never know when you might need it. And a lot of the current trends are

pushing in the opposite direction of don't collect it unless you know that you need it. So it's interesting to see how that has been

manifesting in the industry in terms of the technology choices that people make, as well as the conversations that people are having as far as how to approach analytics

and data collection and data management. I think it's great that the conversations are being had and that the and the trends are changing. I think that shows a big awareness

now of data protection in and of itself.

And then

1 of the big challenges

in these

technical implementations

and in these systems designs

is understanding

what data you have, where you have it, who's accessing it. So I'm wondering if you can talk through some of the challenges that you've seen people go through and some of the solutions that they've come up with to approach that idea of just understanding

what data exists and how to properly

maintain and secure and protect it? Yeah. I mean, that is a great point actually because

understanding what data you've got in which systems, where it came from, who's access to it, where it's stored is is a foundational

step on your data protection,

compliance journey.

Until you've done that exercise,

it's actually quite difficult to do things like build out accurate privacy notices or decent processes and procedures in place, including

things like data handling. So

my experience with my clients is, it's 1 of the hardest exercises for them to do if they haven't previously done it.

So, you know, if you're a smallish organization, you can take a super simple approach like capturing an Excel spreadsheet, but even that is a very time consuming

exercise.

Luckily, on the market, there are

a number of different tools that can be used.

There's there's data mapping tools,

the things that can map out your data flows. You can

you can enter into which systems your data stored in, who the processors are, where they're located. So there's tools out there that allow you to capture all that

data audit or data inventory information. And that

is something that I would fully recommend an organization

invest in because once you've done it,

it's just a case of maintaining it. So

so yeah. So that's that's the first thing

that I would

suggest on that 1. Yeah. I've

interesting. I've also done it some low tech ways. I had a company that asked me this where I have the Gdpo help them kind of

get from where they were, which wasn't very

good, frankly, to a good data protection regime.

And I bought a large roll of

wallpaper, lining paper. I knew, you know, it sounds very cheap. There's a lot of it Yeah. And sliced it up and stuck it up on the wall when we had a kind of war room.

And I kinda worked up. We knew roughly what kinds of data were coming in, and you could do a kind of,

you know, data in. Okay. That's that. Data comes in.

Data's held. Okay. That's that section.

Data out that you have kind of life cycle of the data, and you can work through all that. And then you could get, you know, scribbled bits on your post it notes. You could stick it up. It's quite,

very analog, very old school, but you could get a lot of people involved. You bring in and you, you know, you can't you can't just be the the IT guys. It can't be just be the data guys. It can't just be the kind of compliance people. It has to be everybody in there adding their

kept talking about it, sticking their bits of information up. So that that worked quite well. 1 of the interesting outcomes is

that we went into that. The company thinking had held 2, 000, 000,

records records covering 2, 000, 000 people. By the time we finished, we realized they held records,

covering 30, 000, 000 people,

which

is which is a bit of a discrepancy, but, you know, no 1 had really thought about it. Right? Everyone was just doing their normal job, working quite hard. But In their silos. In in their silos. Yeah.

And 1 of the things that works that you kinda realize is

this this is not a silo.

You have to dis you have to kinda remove the silos, get it work get it to work well. It needs involvement across every department within within the organization. It's not just an IT project,

which is perhaps how data protection was seen before. I mean, most

I think most people would have linked data protection to data security. Well, it's it's it's more than that. You know, data privacy is is is the big side as well of data protection.

But, you know, on the on the topic of, tools that are available,

the International Association of Privacy Professionals, I'm sure if you're interested, might be your listeners might be interested. They produce every year. They produce

privacy technology report, and they list out all the vendors

in the privacy tech space.

And there's some really interesting

information in that report,

and I can certainly send you a link to that afterwards,

Tobias.

Yeah. I'll definitely be interested to take a look at that and see what types of systems they've got in their

purview. And

another thing that plays in to the idea of data protection

and identifying and auditing the data flows is another big challenge in the data management space

of data provenance, which covers everything from what you were describing of data in up through data out and into the point of doing analytics or machine learning to figure out what records are actually being used, what attributes of those records are being used within those machine learning

workflows to be able to make sure that you're not inadvertently

exposing information or inadvertently

including information that's not actually necessary for the conclusions that you're trying to derive.

Exactly. Data provenance is hugely important,

especially if

I think 1 of the examples, if you

if you're part of a project team and you're given a large dataset

and you have to go and tidy up, do the data preparation, for example, on it, who is going to be asking the questions around

why we got this data? What are we allowed to do with it? You know, at what point in your

life cycle

of obtaining,

preparing,

storing, securing, then using that data?

Are those important questions gonna be asked?

And in in interesting. So, you know, 1 of the things I've seen done, which I think works well for for companies that are taking big datasets in or out in and out

it is applying a bit like,

you know, the cost you know, when you arrive at the airport,

you gotta, you know, go through go through the the customs guys check to check you. The once you're checking your passports, you're gonna go out.

And 1 of the things I've done in some of these engagements is put in similar kind of things with data. Right? So you can't import any data in the system

without,

some kind of,

analysis of what the data is, where it's coming from, what rights attached, and so on. And then there is some mechanism within the company which allows that data to come in. It doesn't come in automatically,

And and the same thing for when you send data out.

And in fact, you know, that was a big movements of data.

And 1 thing we were discussing about and this is a company doing a lot of price processing data, so it might have financial data, some of that data,

fairly sensitive data on a number of people that you might send it out to 3rd party to take a look at, and there'd be an attachment in an email. Of course, you don't quite know where the attachment's gonna end up,

so we started kind of to change that. So you never send attachment. You send a link to a

to, you know, a a a data room. So so you can then when the person you sent it to comes and looks at the data,

it doesn't leave the location

in theory, but only if you don't see what I mean, but at least there's an audit trail of how, you know, how the data is getting accessed.

It's it's it's that kind of thinking.

And then another

question that came up when you were discussing bringing everybody into the same room to map out all the different ways that the data is being used and discovering that they actually had, you know, several multiples

more records that they needed to be concerned about than what they had originally thought is the idea of who's actually responsible

for making sure that an organization

is considering all of the different implications

and

ensuring that the company is

appropriately in compliance with the different regulations and even identifying what regulations they're subject to. And that's

that's

an interesting 1 to talk about because, I mean,

obviously, ultimately, the board is responsible

in a large organization.

Responsibilities,

you know, would need to be shared and appropriately, appropriately

find.

So,

you know, I've seen that where

you might start from a basis where you've got someone who's a system owner of each of the systems that you've got, and they they would then be responsible,

for the data within those systems.

And then

if an organization is large enough, they might have a privacy team, for example.

And, you know, the system slash data owners would then have dialogues

with the privacy teams. So

it is

really important to get back to the foundational step, which which is your data audit and your data inventory,

understanding what you've got

in order then to be able to

appropriately assign

the responsibility

for certain aspects of data protection

within the organization.

And larger companies as well

could have, you know, chief data officers have seen that used quite a lot in some of the big banks.

And, you know, even in tech startups, for example,

just because

they're startup and they're small, they might have

100

of millions of of records.

Well, they still need somebody in their organization who's responsible for data protection. And if they don't have the skills, they they should go externally and and and source those skills. Yeah. You should have some kind of, you know, data, you know, data

some kind of operating model. Right? So you know how how it works, you know, and you should,

you know, and you should have some kind of monthly governance. So every month,

the different disciplines get together, and you have a standing agenda to work through them.

And, you know, things like vendors, right, suppliers. So if you outsource some of your processing to someone else while you're still responsible for that data even though someone else is processing it for you. Right? So who who's checking up on the vendors to making sure they're doing what they should be doing? They've got they've got good security and all that kind of stuff. So,

you know,

although 1 probably 1 person in the end, you know, will be responsible, you know, in the hierarchy

as an operational

execution, it tends to be distributed,

responsibility

for different people bringing their their different angles in. And if they don't have a

if they don't have a kind of,

regular spot at which to meet and discuss issues,

that means, you know, it's more likely that issues will get missed.

And it all starts from training and awareness.

You know, you have to give I think I mentioned earlier, you gotta give your employees the chance

to be able to do the right thing. And, it's only fair to them. They get the training

to understand what the what what it is they they need to be doing. And then as a corollary to the idea that we were discussing earlier of

tracking and auditing, the information that we're storing and using

is in the GDPR, at least. There's the right to be forgotten clause where

a company needs to be able to thoroughly delete information pertaining to a given individual,

which can be quite complicated, especially when we're dealing with complex systems with multiple different storage layers or multiple different pipelines that are replicating bits and pieces of information throughout. And so I'm curious what you have seen as far as

challenges at the technical and organizational level and some of the strategies and technologies

that they have found useful for being able to,

follow that regulation?

Yeah. Yeah. Good question.

Interestingly,

in the so the in the year, we got, like, kind of, I think it's 6 or 7, ax

data subject. We call them date technically, then, you know, consumers known as data subjects.

The right to erasure, the right to forgotten, the right subject the right to get access to your data, the right of corrections, the right to this right I can't remember them all, but

it's either 6 or 7. Now the although the right to be forgotten, grab both the headlines,

the 1 that's mainly exercised is the data subject access right, the right to get a copy of the data held about you. Okay? Now access in the GDPR, that access in the CCPA,

and you get the same issue, right, multiple systems and and so on. Now

talking about the subject access right to begin with, it's not a,

you know, you had to go to the end end ends of the earth to to produce everything, but you've got to make a reasonable

effort.

Clearly, if you've got a coherent system

with, you know, everything in the right place, it's

easy to do. If you're if you're straddling,

you know, 6 or 7 legacy systems and it gets much more complicated,

a good reason to get rid of data, you don't need to be frank because you don't have to report on it. In terms of the right the right, the right forgotten, the right of deletion, it's not an unfettered right. It's it's not unqualified.

So if I've got a contract with you and you're still a customer, you can't just have your right you can't say I'm just deleting my data. Well, I can't do that. I've got a contract with you. Even if the contract's over, there's, you know, there's,

reasons you can refuse. For example, you're allowed to, to hold on the data if you think that you might need it for legal,

in contemplation of, you know, legal defense at some point down the line. So it's not

as unlimited as you might think.

Now

there are

the UK has always been

a bit more business friendly about this kind of stuff than other bits of Europe. I'm just still in Europe for the time being.

So, you know, things I mean, it's always been accepted that you could,

hash some of the information

and that would basically

count as deletion.

The UK used to have this thing called putting out of use

where for some reason you might not be able to delete all the data, but you could you could park it somewhere

where it was not accessible or not easy accessible to the business, you know,

require 2 or 3 sign ups. It's difficult to difficulty

difficult to make be accessible,

and you can take those those kind of protections.

It's not as

despite it's it's despite this exciting name, the right forgotten

in most businesses.

As a practical matter, it doesn't come up that often. It's not an unqualified

ride, and it causes less issues,

than you than you might expect.

So just to follow on from what Mark was saying there,

the challenges to that, whether they're deletion requests or subject access requests,

in a complex ecosystem where you have a number of potential potentially interconnected systems,

The first step in overcoming some of those challenges is, again, this foundational step we've already talked about, which is understanding what data you've got, where, why you've got it, etcetera.

Once you've done that,

there are then a number of tools on the market.

So coming back to the privacy tech sector,

there are data discovery tools

that can assist

with aid

working out what data you've got where.

But also, I've also seen some tools where,

they can bring you a single view of an individual and allow you to,

perform deletion requests in a much quicker and more automated way. So,

yes, there are solutions

on the market

to be able to assist with some of the complex,

and time consuming requests that you may get. But it's always important to do 2 things. 1 is understand whether the request is 1

that,

is valid under the regulations. And 2, I've done your homework,

on on your systems and your processes to understand

what's where and actually how much of a task is this that you're gonna have to undertake.

And then another layer where this manifests

particularly in terms of updating data or

having a customer

elide bits of information from their records

is how it's being used in downstream use cases, whether it's business analytics or

doing some sort of machine learning on aggregate data and

how that plays into the need to either

regenerate

a model after it's gone through a training regimen once you get the data updated

or how the data is actually being used or what particular attributes of a record are being used

within those

analytics and some of the technologies and techniques that are

viable for still remaining within compliance of these regulations, especially as far as some of the explainability

requirements that come up? So, yeah, they're these are the $1, 000, 000 questions, really. I think, we're we're getting into now.

So yeah. Definitely,

once we're starting to go downstream from the data collection and we're using,

the scientists and the analytic and the analysts are starting to

run

searches across the data, etcetera.

It come it still comes back to

the organization's

responsibility and requirement

to provide the scientists and analysts

with

a decent clean

set of

properly obtained lawful data.

So

if then data scientists or analysts then want to access data within that dataset that could be protected,

there should be appropriate

controls

or tags

or

logging or audits

that the scientists and the analysts can see and be aware of

when they then come to go and do the projects that they've been assigned to do.

And it's also possible as as a way of embedding that,

the management of that

at the beginning of a project.

If scientists or analysts or even project managers running a project that involves the scientists or analysts,

they do a privacy impact assessment

on

what the project aims to achieve.

So if the outcomes

from the intended outcomes from the project

are those that might result in

the creation

of

profiled datasets

upon which decisions will be made,

then really at the beginning of the project, they should do some sort of impact assessment

of what data they're going to use, are they using in a lawful way, and what safeguards they'll put in place

for the results that are generated from that particular project.

So, yeah, look look looked at another way. So

if you think go back to the beginning, the the point we're talking about right being about trust. Okay? So you've got you've got your data set.

So the question is what what

what consents have you disclosed, right,

and then what are the reasonable expectations

of of the consumers? Are you acting within that?

Now if you want to go beyond what was disclosed

under GDPR, you can do that provided it's kind of akin in the same ballpark,

but you typically need to inform

the data subjects, that's what you're gonna do.

1 so that's an initial constraint. Now of course,

GDPR

applies to personal data.

If you anonymize a personal data and anonymization

is a sliding scale as we all know, so let's say if you sufficiently anonymize

the the, the data,

then it stops in personal data and you're not regulated by the GDPR anymore. You can do what you want with it provided someone can't come back and,

and reverse it back to the original,

players.

But if if if you can't

thinking about what you need to begin with and thinking it through, to the thinking about what you need to begin with and thinking it through,

planning it through. In terms of the

there's kind of various

so the the GDPR

is,

has a thing about automated decision making, okay, and profiling. And it's not it's not forbidden. You're allowed you're allowed to have profiling,

you have to you're allowed to have automated decision making

or you like to have automated processing,

provided you've done the correct disclosures.

The the issue is where you have automated

decisions are taken, right, without any human involvement.

In that case, if you're doing that, you have to disclose that and the person,

about whom the decision is taken has the right to object, has the right to have an explanation as

to how how the decision was taken. And then that brings us into things you were talking about, interpretability.

So if you got some machine learning, which is pumping out decisions,

and you can't explain and then you can't explain how the decision was reached,

that starts to give you a bit of a problem both in gdpr,

you know, in the gdpr sense, but also in the more practical sense as well as, you know, is there some bias against particular kind of people?

You know, that's happened in the past before even without.

I mean, 1 of the 1 of the concerns is

that,

machine learning will have inbuilt biases,

which is funny odd because we know that humans definitely have inbuilt inbuilt biases. So, yeah, those are the those are the kind of new kind of cutting edge issues that that people are are wrestling with how how to do how to do that. And there's a lot of thinking

and discussion

and kind of guidance,

about it. I mean, part of what I'm hoping to see is AI, which validates AI.

Okay. So,

you know, you have a test data set and

you run it through and and if it produces the right art each time, you know, it's working and make something like that. Does that help? Yeah. That's definitely useful. And

some of the some of the topics of bias too comes back to the data engineering layer and data collection as far as trying to

identify potential biases that exist in the datasets

and then either seeking

additional or alternative sources of information to complement that

or to at least

annotate the dataset to say, this is a potential source of bias so that the people who are performing the analysis

are aware of that and can try to counteract that in terms of the algorithms that they apply to the data to make sure that they try and strip out some of the bias. But as you said, we're all human.

All the computer algorithms that we use are written by humans, and so there's no way to completely divorce ourselves from bias, but we can at least try to identify and account for it. Agreed. And and there's and there's some examples where actually it's quite easy to use a machine to remove bias. So if you've got a lot of CVs coming in, you can take out the reference

to sex. You could take out all the age references.

You might not take out all the unusual foreign sounding names, replace them all with Smith,

you know,

if that's the kind of background you're from. And so there are ways which the, you know, machines will actually

help help counter human bars as well.

In fact, if we'll we'll send you the link. There's some very interesting work done by the UK ICO,

on this kind of stuff. They've got a I think it's called I'm looking at Karen. It's called an AI audit.

Yeah. AI audit framework,

and they've got some kind of guys in it's kind of think tank work to what strategies

and what steps should you go through

to make your

development

of,

machine learning

successful in in all the senses of successful. If I saw a job ad, the other day about who needs to be involved in machine learning,

and 1 of them was described in 1 of the new new job I've never heard of before as an ethicist,

which is like a practical, you know, say Thomas Aquinas, A bit like the guy in, madams what the man is married to madam secretary if you ever watched that professor of ethics, but with a practical application.

Are there any other aspects of data protection

and the regulatory frameworks that we should be considering or ways to keep up to date with the regulations

that are present and new ones that might come out that we didn't discuss yet that you think we should cover before we close out the show?

Well, it's definitely a kind of fast it's a fast moving world. You know? And I mean

I mean, data protection's been around for a while, but the data is moving faster. The issues are kind of are moving quicker.

You can listen to podcasts on data protection

if that's your thing. You can link to feeds to staff, attend conferences.

I would say, I mean, artificial intelligence is gonna be a big 1 going forward.

If you're in test interested in in, programmatic

advertising,

that's another big 1. Yeah.

There's a lot of movement on that, coming up soon.

I think probably those are the main ones I I would personally call out at the moment. Don't let your other news, Karen.

Well, what I think would be useful would be the opportunity

for data protection to be talked about

perhaps more regularly at some of the the engineering or technology conferences that happen. I mean, I remember I listened to 1 of your other podcasts

where

they were talking about the data council, which is a meeting,

as a conference for engineers and developers,

quite cutting edge.

And I had I had a look at the conference online and I didn't see any topic that covered data protection.

So, you know, back to my point about giving

employees and engineers and analysts and scientists

the knowledge

to allow them to do the right thing.

If data protection could be,

you know, brought into,

more of the syllabus perhaps for

for those

technical conferences, I think that would be really helpful.

That's part of the privacy by design is the word. And bring in privacy and and help them understand privacy by design, getting it right in the beginning.

I think that would be really helpful.

It's hard to keep up to date with everything I have to say.

Well, for anybody who wants to follow along with the work that you each are doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Well, I I will answer a non tool thing. I I I think it's not a

tooling my view is not a tooling issue. It's a

cultural and leadership issue.

1 once a kind of culture and and the leadership in the organization

is,

aligned,

then, you know, the right things start to happen and actually flow out of it. And I'll hand over Karen to to a more tool based. I I do have a tool based,

gap, actually.

Subject

access requests, which we've,

Mark discussed and we talked about earlier,

are very time consuming requests for almost every organization.

Larger companies can invest in some of the sophisticated

What I'm not seeing at this point is a subject access automation tool that's accessible

to organizations that aren't just the big organizations.

So some things a bit more,

you know, moving

in in lower price bracket, say,

to a big enterprise wide, privacy management system. So that that's would be my

my gap from my perspective. But but certainly the privacy tech market is a really interesting market. There's lots of solutions and tools out there.

There's over 275

vendors out there in the market now. There's

$500, 000, 000 a year being,

invested and growing into tech startups

in the privacy space. So

there's a lot of technology out there in the market to help different organizations.

Well, thank you both for taking the time today to join me and share your

expertise and understanding

of the data protection space. It's definitely something that, as you said, needs to be discussed more broadly and more widely understood. So thank you for all the efforts on that front, and I hope you each enjoy the rest of your day. Thank you very much, Tobias. It's been a pleasure. Thank you. Thanks,

Tobias.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links