Summary
The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
- Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what is encompassed by the idea of data protection?
- What regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules?
- What are some of the conflicts and constraints that act against our efforts to implement data protection?
- How much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements?
- Can you give some examples of the types of information that are subject to data protection?
- One of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information?
- A corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems?
- What are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection?
- How do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer?
- Who in the organization is responsible for the proper compliance to GDPR and other data protection regimes?
- Downstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use?
Contact Info
- Karen
- Mark
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Data Protection
- GDPR
- This Is DPO
- Intellectual Property
- European Convention Of Human Rights
- CCPA == California Consumer Privacy Act
- PII == Personally Identifiable Information
- Privacy By Design
- US Privacy Shield
- Principle of Least Privilege
- International Association of Privacy Professionals
- Data Provenance
- Chief Data Officer
- UK ICO (Information Commissioner’s Office)
- Data Council
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distributions, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode. That's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data Coral, an AWS native serverless data infrastructure that installs in your VPC. DataCorel helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance. Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from terabytes to petabytes of analytic data.
He started data coral with the goal to make SQL the universal data programming language. Visitdataengineeringpodcast.com /datacoral today to find out more. And having all of your logs and event data in 1 place makes your life easier when something breaks, unless that something is your Elasticsearch cluster because it's storing too much data. ChaosSearch frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in s 3 entirely under your control, all for half the cost of running your own Elasticsearch cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch, and don't forget to thank them for supporting the show.
You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corinium Global Intelligence, Alexio, and Data Council. Upcoming events include the Data Orchestration Summit and Data Council in New York City. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Karen Heaton and Mark Sherwood Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines. So, Karen, can you start by introducing yourself?
[00:03:03] Unknown:
Yes. Good afternoon. My name is Karen Heaton. I run my own data protection consultancy. I've been working in data and systems implementations and financial services for over 20 years, and I'm a real fan of technology. And now my consultancy obviously specializes in data protection compliance for mainly small and medium sized organizations, including tech startups.
[00:03:27] Unknown:
And, Mark, can you introduce yourself as well? Sure. My name is Mark Sherwood Edwards. I'm a tech lawyer by background. I now do do a lot of work in the privacy space, and I've got a, privacy, business called this is dp0.co.uk. Like Karen, I'm based in London, UK. And like Karen, we both do a lot of work sometimes together around the GDPR. My background, I said, is as a lawyer primarily in technology,
[00:03:55] Unknown:
outsourcing, intellectual property, those kind of areas. And going back to you, Karen, do you remember how you first got involved in the area of data management?
[00:04:03] Unknown:
Well, yes. Ever since I first started working, which is quite a long time ago, I've always been involved in systems and data, whether I've been working operations for banks or other types of different types of companies or whether I've been working for software companies, helping them implement their software solutions into their clients organizations. So data and systems has been a very large part of my life. And now with the data protection regulations coming in, these elements are now marrying together into a very interesting,
[00:04:41] Unknown:
topic area. So that's many years of experience in this area. And, Mark, do you remember how you got involved in data management? Yeah. Well, probably not quite as directly as Karen's,
[00:04:51] Unknown:
but, so I've been in house counsel for a number of outsourcing companies over the years. Some have been doing a lot of HR, some lot of procurement, all all of it involving a fair amount of data. So I've always been quite interested in it and, the kind of interrelationship between data and intellectual property where you own data can't own data. So it's been a a an interest of mine that's been going on for quite a number of years now. It's it's been interesting because because,
[00:05:22] Unknown:
Mark and I often talk about how we've given our our ages, how we've enjoyed watching technology grow from the days of very large mainframe computers, punch cards, all the way up to what we have today with plethora of apps and now, you know, artificial intelligence, I mean, machine learning. So we both feel very privileged actually that we've been able to watch the growth of systems and data over the last 30 plus years.
[00:05:52] Unknown:
And as you mentioned, in recent years, there have been some regulations coming out that cover this idea of data protection. Perspective of
[00:06:07] Unknown:
data storage systems and data management? Okay. Well, I'll I'll kick off on that 1. So when people talk about data protection, they're talking essentially about personal data, the protection of personal data, and the regulation of personal data. Now most people nowadays are aware of the GDPR, which came out in the European Union last year, but which has actually which has impact outside the European Union as well. We talk about by that a bit later. But GDPR wasn't particularly new at least in Europe. It had been data protection has been around for 20 or 30 years. And it's all and it and it starts really with a human rights. In fact, the European Convention of Human Rights article 8 says, everyone has the right to respect for his private and family life, his home, and his correspondence.
And that's that, the respect for your private life, and data the data you use and create being seen as part of your private life is where it all starts from. Interestingly, I took the opportunity to look at, some corresponding US documents, and the US constitution, which starts off the 1 that starts off we the people, as you all know, the bias goes on. We the people of the United States, in order to form a more perfect union, establish justice, ensure domestic tranquility, and that ensured domestic tranquility is possibly I'm not a US constitutional lawyer. You'll appreciate. It's the same kind of concept, how 1 protects private life. So the general concept has been around for a long time, and that is it's fine for companies to, take and use other people's, or their customers' personal data, which they may collect because people enter an account or it's a bank or you may be shopping on Amazon or whatever, provided they do that lawfully, transparently, and, in a fair way. And, essentially, that means, if someone's gonna if you're gonna if you're disclosing your personal data to someone else or to a company, they've got to be very explicit about what they're gonna do with it and not do things which they haven't disclosed ahead of time.
And if that's the case broadly, then what they do is gonna be lawful and everybody's gonna happy with it. Another way of looking at it is is a trust based thing. Okay? We, consumers, are trusting companies with our personal data. And having entrusted them with that data, we are expecting them to respect that trust by acting lawfully, transparently, and so on. So, essentially, data pro data protection is a codification of a trust based principle, and that's the high level view. And then you can and then you can dig down to various low level views as as we, you know, as we progress.
[00:08:58] Unknown:
And in terms of the actual specific regulations, as you said, it's a concept that's been around for quite a while. But in recent years, we've been more explicitly codifying it in the GDPR and then recently in California with the CCPA. I'm wondering if you can just discuss a bit of the scope of those regulations and how an organization or an individual can best determine whether or not they're subject to those rules and what particular information is encompassed by that. Okay. Well,
[00:09:30] Unknown:
both the GDPR and the CCPA, I sent you rules about personal data and people's and use of our people's personal data. So So to that extent, they're very similar, and they have a lot of concepts in common. Some of the diff and so some of the concepts in common, for example, are applied both in GDPR and CCPA. If you're gonna use other people's personal data, your consumer's personal data, you've got to tell them that's what you're doing. It's a generic privacy notice says, yeah. We're we're using your personal data. These are the characteristics of data we're gonna use, and these are the purposes we're gonna use them for. Okay? And that and and that's a that that applies both across the CCPA and GDPR. Both equally have a right for the, relevant consumer to get access to their data, I, a copy of the data that that's been held about them, and both have the right to for consumers to get their data deleted. Now then the differences start start to come. So the the GDPR is is is is kind of product of a longer thinking. Now it was a preceding regulation from 1998 at a longer gestation period, the CCPA was put put together in a bit of a rush to meet a deadline.
So some of the fundamental differences are you get concepts in the GDPR like lawful basis. K? So there's a finite amount of law. There's, I think, 6 lawful basis under GDPR. And if you're processing personal data, personal data means it's more as the same in both categories. The US tends to call it PII, personally identifiable information. Yeah. You have quoted personal data. They're very broad categories, include your name, your email address, your phone number, but also online identifiers, things like that. Anything which can be tracked back back to you on a kind of 1 to 1 basis. So both have that in common. The GDPR is this notion of the lawful, lawful basis, and you have to have a lawful basis to handle, personal data, and you have to explain which lawful data basis you're using. For example, have you got consent?
If you don't have consent, is it pursuant to contract? If it's not that, is it pursuant to legitimate basis, that kind of stuff? The, GDPR also has some more fundamental concepts kicking around in it, things like privacy by design and data minimization. Privacy by design, means kind of what it says, general the general thinking is you if you got any system, by which I mean, not just a hardware software, but people round around the system, and you'll be handling personal data, processing personal data, then you need to have thought out how you're gonna build in privacy requirements from the start. It's no longer okay just trying to retrofit it afterwards. Data minimization means, couple of things.
What's the minimum amount of data that you need to accomplish the job, not what's the maximum amount of data? And then once you have that data, when you use it in your business, using the minimum amount for each particular job, not just, you know, not sharing data around my confetti. So those are some structural fundamental things which explicitly called out the GDPR are not not explicitly called out in, CCPA. The CCPA's approach is much more an opt out approach. Most things are permitted, but the primary thing is, if you and it applies to Californian residents, and most companies dealing with with Californian residents. But if it applies to you, you have the right to opt out of what they call selling of data.
So you can object, you can't sell my data, and effectively that means selling is a broad term for any kind of sharing, licensing, and so on of data. So those those are the kind of prime they're they're similar in some ways. They've got some fundamental underpinnings, tectonic plates, which is slightly different, which are different, but that end up driving the same kind of behavior probably for for businesses over time, which is,
[00:13:51] Unknown:
be more careful with personal data, essentially. And I think, Mark, Tobias, you also asked about the territorial reach of, GPR.
[00:14:01] Unknown:
Yeah. Some of the ways that you can determine whether or not the data that you're dealing with is actually subject to these regulations. And I think that the blanket approach that a lot of companies are taking is that it's too hard to identify at a granular level whether or not somebody is a European citizen or isn't or is in some way related to the European Union or California, and so they just apply the same sets of principles in a blanket sense. And I'm wondering what your thoughts are on some of the sort of best strategies to approach the regulatory environment that we're in now.
[00:14:35] Unknown:
So yeah. So, I mean, I'm sure most people have seen evidence of big American companies in particular undertaking compliance programs, signing up to US Privacy Shield to show that there they have adequate data protection standards and even doing things like, updating terms and conditions, privacy notices. Mailchimp, for example, have introduced something called double opt in function into their platform so that consent is over GDPR standard for any of their clients in the in the EU. So if you don't want to adopt a blanket approach, it comes down to then truly being able to try and understand in a granular way what data you have in your datasets, how you acquired it, what what you're gonna use it for, whether, your organization actually does sell products and services, into the EU, in which case they will definitely be subjected to the regulations, or even if they have websites which, EU people based in the EU can access. And if those websites are running, lots of cookies, plugins, pixels, etcetera, Those cookie items, if they're collecting data of individuals based in the EU, then they are also gonna be subject to, the rules of GDPR. So so I understand the the approach to just assume everyone is subject to the regulations and then apply the blanket like. But there are there are definitely ways of, you know, properly analyzing your business, asking more and better questions about the datasets that you have and how or have acquired and then taking a view based on that. But I agree it's it's difficult if you've got very large sets of data and perhaps not a lot of information around the background to those datasets.
[00:16:45] Unknown:
And then from the organizational and technical perspective, what are some of the conflicts or constraints that act against some of the efforts that they might try to put in place to implement data protection, whether it's because the technical systems design that they have doesn't really allow for, proper segregation or tracking, or whether it's a matter of policy as far as helping the different people within the organization understand the importance of these different regulations and their enforcement?
[00:17:16] Unknown:
Yeah, it's a really good question, actually, Tobias, because there are a number of reasons I think that we see complex constraints. I mean, the biggest 1, that certainly Mark and I think is, you know, first and foremost, do you need management buy in of of the the need to understand the regulations and implement appropriate standards into your organization in order to be compliant with them. This goes back also to, you know, the codification of trust that Mark talked about earlier on. It really is a trust journey, not only with, your customers, but also with your employees. I think, lots of data scientists and data analysts want to do the right thing. But if organizations don't train them in what the right thing looks like or if organizations don't give them the tools in order to do their job in a way that is compliant and meet meets the standards. It's hard for them to be able to to do the right thing in in a way that they would like to be able to do it. So, management buy in is really important.
And also, you know, it's an investment in your business. And sometimes going through a compliance program, you can identify cost saving measures in your business. If you do a proper data audit, data discovery, you map out your records of processing activities, you might find activities going on in your business that are unnecessary. You might have systems that don't need to be used or paid for. So, there there can be benefits, that that can be gained, from these compliance programs aside from the obvious being compliant. And, and also sales and marketing departments often are in conflict with some operational departments or compliance departments because they've got different goals.
Obviously, the sales and marketing teams want to generate leads and and get money in the door, but it has to be in a lawful and and compliant way. I think 1 1 of the areas where you kinda see differences is if you think
[00:19:19] Unknown:
kind of most IT departments are concerned with security of the data, you know, all the usual things on security, which is I think of as protecting the perimeter. Mhmm. Then you can think about the kind of compliance that's within the perimeter. So for example, if too many people have, access to personal data, people are monitoring who you know, people are applying least privilege within the organization, so too many people have got access to it, then that starts giving you a data protection compliance problem. Now in fact, there's a good example in Portugal the other day, which is obviously in the European Union. There's a Portuguese hospital which had something like, no, 50 doctors, and there were 200 people in the organization who had who had doctor level access to patients' records.
K? So that gives you a good example. That's not a hard you know, it's not like your yes. The server has been hasn't been correctly hard and I mean, everything's working like it should do. Everything's been patched correctly. It's just there's a a laxity Business level. At the business level. No one's already thinking it through about, well, actually, this trust here, we've been trusted to hold on to this data and protect it properly, but actually, that would mean revising
[00:20:34] Unknown:
or checking each time who has access to what data. And, you know, I I know we were talking about conflicts and constraints around the efforts to implement, data protection in organizations. But, you know, it's also important for organizations to realize that, you know, supplier due diligence. I'm sure many of your listeners will have gone through it, an RFI process or you've had to fill in a due diligence questionnaire for a large client. You know, GDPR and data protection compliance is often a big part of that due diligence process, and you may not actually get the business, unless you have certain standards and procedures in place.
[00:21:14] Unknown:
Another thing too is that the initial era of big data was just capture everything because you never know when you might need it. And a lot of the current trends are pushing in the opposite direction of don't collect it unless you know that you need it. So it's interesting to see how that has been manifesting in the industry in terms of the technology choices that people make, as well as the conversations that people are having as far as how to approach analytics
[00:21:40] Unknown:
and data collection and data management. I think it's great that the conversations are being had and that the and the trends are changing. I think that shows a big awareness now of data protection in and of itself.
[00:21:52] Unknown:
And then 1 of the big challenges in these technical implementations and in these systems designs is understanding what data you have, where you have it, who's accessing it. So I'm wondering if you can talk through some of the challenges that you've seen people go through and some of the solutions that they've come up with to approach that idea of just understanding what data exists and how to properly
[00:22:20] Unknown:
maintain and secure and protect it? Yeah. I mean, that is a great point actually because understanding what data you've got in which systems, where it came from, who's access to it, where it's stored is is a foundational step on your data protection, compliance journey. Until you've done that exercise, it's actually quite difficult to do things like build out accurate privacy notices or decent processes and procedures in place, including things like data handling. So my experience with my clients is, it's 1 of the hardest exercises for them to do if they haven't previously done it. So, you know, if you're a smallish organization, you can take a super simple approach like capturing an Excel spreadsheet, but even that is a very time consuming exercise.
Luckily, on the market, there are a number of different tools that can be used. There's there's data mapping tools, the things that can map out your data flows. You can you can enter into which systems your data stored in, who the processors are, where they're located. So there's tools out there that allow you to capture all that data audit or data inventory information. And that is something that I would fully recommend an organization invest in because once you've done it, it's just a case of maintaining it. So so yeah. So that's that's the first thing that I would suggest on that 1. Yeah. I've
[00:23:50] Unknown:
interesting. I've also done it some low tech ways. I had a company that asked me this where I have the Gdpo help them kind of get from where they were, which wasn't very good, frankly, to a good data protection regime. And I bought a large roll of wallpaper, lining paper. I knew, you know, it sounds very cheap. There's a lot of it Yeah. And sliced it up and stuck it up on the wall when we had a kind of war room. And I kinda worked up. We knew roughly what kinds of data were coming in, and you could do a kind of, you know, data in. Okay. That's that. Data comes in.
Data's held. Okay. That's that section. Data out that you have kind of life cycle of the data, and you can work through all that. And then you could get, you know, scribbled bits on your post it notes. You could stick it up. It's quite, very analog, very old school, but you could get a lot of people involved. You bring in and you, you know, you can't you can't just be the the IT guys. It can't be just be the data guys. It can't just be the kind of compliance people. It has to be everybody in there adding their kept talking about it, sticking their bits of information up. So that that worked quite well. 1 of the interesting outcomes is that we went into that. The company thinking had held 2, 000, 000, records records covering 2, 000, 000 people. By the time we finished, we realized they held records, covering 30, 000, 000 people, which is which is a bit of a discrepancy, but, you know, no 1 had really thought about it. Right? Everyone was just doing their normal job, working quite hard. But In their silos. In in their silos. Yeah.
And 1 of the things that works that you kinda realize is this this is not a silo.
[00:25:29] Unknown:
You have to dis you have to kinda remove the silos, get it work get it to work well. It needs involvement across every department within within the organization. It's not just an IT project, which is perhaps how data protection was seen before. I mean, most I think most people would have linked data protection to data security. Well, it's it's it's more than that. You know, data privacy is is is the big side as well of data protection. But, you know, on the on the topic of, tools that are available, the International Association of Privacy Professionals, I'm sure if you're interested, might be your listeners might be interested. They produce every year. They produce privacy technology report, and they list out all the vendors in the privacy tech space.
And there's some really interesting information in that report, and I can certainly send you a link to that afterwards, Tobias.
[00:26:23] Unknown:
Yeah. I'll definitely be interested to take a look at that and see what types of systems they've got in their purview. And another thing that plays in to the idea of data protection and identifying and auditing the data flows is another big challenge in the data management space of data provenance, which covers everything from what you were describing of data in up through data out and into the point of doing analytics or machine learning to figure out what records are actually being used, what attributes of those records are being used within those machine learning workflows to be able to make sure that you're not inadvertently exposing information or inadvertently including information that's not actually necessary for the conclusions that you're trying to derive.
[00:27:09] Unknown:
Exactly. Data provenance is hugely important, especially if I think 1 of the examples, if you if you're part of a project team and you're given a large dataset and you have to go and tidy up, do the data preparation, for example, on it, who is going to be asking the questions around why we got this data? What are we allowed to do with it? You know, at what point in your life cycle of obtaining, preparing, storing, securing, then using that data? Are those important questions gonna be asked?
[00:27:44] Unknown:
And in in interesting. So, you know, 1 of the things I've seen done, which I think works well for for companies that are taking big datasets in or out in and out it is applying a bit like, you know, the cost you know, when you arrive at the airport, you gotta, you know, go through go through the the customs guys check to check you. The once you're checking your passports, you're gonna go out. And 1 of the things I've done in some of these engagements is put in similar kind of things with data. Right? So you can't import any data in the system without, some kind of, analysis of what the data is, where it's coming from, what rights attached, and so on. And then there is some mechanism within the company which allows that data to come in. It doesn't come in automatically, And and the same thing for when you send data out.
And in fact, you know, that was a big movements of data. And 1 thing we were discussing about and this is a company doing a lot of price processing data, so it might have financial data, some of that data, fairly sensitive data on a number of people that you might send it out to 3rd party to take a look at, and there'd be an attachment in an email. Of course, you don't quite know where the attachment's gonna end up, so we started kind of to change that. So you never send attachment. You send a link to a to, you know, a a a data room. So so you can then when the person you sent it to comes and looks at the data, it doesn't leave the location in theory, but only if you don't see what I mean, but at least there's an audit trail of how, you know, how the data is getting accessed.
It's it's it's that kind of thinking.
[00:29:20] Unknown:
And then another question that came up when you were discussing bringing everybody into the same room to map out all the different ways that the data is being used and discovering that they actually had, you know, several multiples more records that they needed to be concerned about than what they had originally thought is the idea of who's actually responsible for making sure that an organization is considering all of the different implications and ensuring that the company is appropriately in compliance with the different regulations and even identifying what regulations they're subject to. And that's
[00:29:57] Unknown:
that's an interesting 1 to talk about because, I mean, obviously, ultimately, the board is responsible in a large organization. Responsibilities, you know, would need to be shared and appropriately, appropriately find. So, you know, I've seen that where you might start from a basis where you've got someone who's a system owner of each of the systems that you've got, and they they would then be responsible, for the data within those systems. And then if an organization is large enough, they might have a privacy team, for example. And, you know, the system slash data owners would then have dialogues with the privacy teams. So it is really important to get back to the foundational step, which which is your data audit and your data inventory, understanding what you've got in order then to be able to appropriately assign the responsibility for certain aspects of data protection within the organization.
And larger companies as well could have, you know, chief data officers have seen that used quite a lot in some of the big banks. And, you know, even in tech startups, for example, just because they're startup and they're small, they might have 100 of millions of of records. Well, they still need somebody in their organization who's responsible for data protection. And if they don't have the skills, they they should go externally and and and source those skills. Yeah. You should have some kind of, you know, data, you know, data
[00:31:34] Unknown:
some kind of operating model. Right? So you know how how it works, you know, and you should, you know, and you should have some kind of monthly governance. So every month, the different disciplines get together, and you have a standing agenda to work through them. And, you know, things like vendors, right, suppliers. So if you outsource some of your processing to someone else while you're still responsible for that data even though someone else is processing it for you. Right? So who who's checking up on the vendors to making sure they're doing what they should be doing? They've got they've got good security and all that kind of stuff. So, you know, although 1 probably 1 person in the end, you know, will be responsible, you know, in the hierarchy as an operational execution, it tends to be distributed, responsibility for different people bringing their their different angles in. And if they don't have a if they don't have a kind of, regular spot at which to meet and discuss issues, that means, you know, it's more likely that issues will get missed.
[00:32:34] Unknown:
And it all starts from training and awareness. You know, you have to give I think I mentioned earlier, you gotta give your employees the chance to be able to do the right thing. And, it's only fair to them. They get the training to understand what the what what it is they they need to be doing. And then as a corollary to the idea that we were discussing earlier of
[00:32:53] Unknown:
tracking and auditing, the information that we're storing and using is in the GDPR, at least. There's the right to be forgotten clause where a company needs to be able to thoroughly delete information pertaining to a given individual, which can be quite complicated, especially when we're dealing with complex systems with multiple different storage layers or multiple different pipelines that are replicating bits and pieces of information throughout. And so I'm curious what you have seen as far as challenges at the technical and organizational level and some of the strategies and technologies that they have found useful for being able to, follow that regulation?
[00:33:34] Unknown:
Yeah. Yeah. Good question. Interestingly, in the so the in the year, we got, like, kind of, I think it's 6 or 7, ax data subject. We call them date technically, then, you know, consumers known as data subjects. The right to erasure, the right to forgotten, the right subject the right to get access to your data, the right of corrections, the right to this right I can't remember them all, but it's either 6 or 7. Now the although the right to be forgotten, grab both the headlines, the 1 that's mainly exercised is the data subject access right, the right to get a copy of the data held about you. Okay? Now access in the GDPR, that access in the CCPA, and you get the same issue, right, multiple systems and and so on. Now talking about the subject access right to begin with, it's not a, you know, you had to go to the end end ends of the earth to to produce everything, but you've got to make a reasonable effort.
Clearly, if you've got a coherent system with, you know, everything in the right place, it's easy to do. If you're if you're straddling, you know, 6 or 7 legacy systems and it gets much more complicated, a good reason to get rid of data, you don't need to be frank because you don't have to report on it. In terms of the right the right, the right forgotten, the right of deletion, it's not an unfettered right. It's it's not unqualified. So if I've got a contract with you and you're still a customer, you can't just have your right you can't say I'm just deleting my data. Well, I can't do that. I've got a contract with you. Even if the contract's over, there's, you know, there's, reasons you can refuse. For example, you're allowed to, to hold on the data if you think that you might need it for legal, in contemplation of, you know, legal defense at some point down the line. So it's not as unlimited as you might think.
Now there are the UK has always been a bit more business friendly about this kind of stuff than other bits of Europe. I'm just still in Europe for the time being. So, you know, things I mean, it's always been accepted that you could, hash some of the information and that would basically count as deletion. The UK used to have this thing called putting out of use where for some reason you might not be able to delete all the data, but you could you could park it somewhere where it was not accessible or not easy accessible to the business, you know, require 2 or 3 sign ups. It's difficult to difficulty difficult to make be accessible, and you can take those those kind of protections.
It's not as despite it's it's despite this exciting name, the right forgotten in most businesses. As a practical matter, it doesn't come up that often. It's not an unqualified ride, and it causes less issues, than you than you might expect.
[00:36:24] Unknown:
So just to follow on from what Mark was saying there, the challenges to that, whether they're deletion requests or subject access requests, in a complex ecosystem where you have a number of potential potentially interconnected systems, The first step in overcoming some of those challenges is, again, this foundational step we've already talked about, which is understanding what data you've got, where, why you've got it, etcetera. Once you've done that, there are then a number of tools on the market. So coming back to the privacy tech sector, there are data discovery tools that can assist with aid working out what data you've got where.
But also, I've also seen some tools where, they can bring you a single view of an individual and allow you to, perform deletion requests in a much quicker and more automated way. So, yes, there are solutions on the market to be able to assist with some of the complex, and time consuming requests that you may get. But it's always important to do 2 things. 1 is understand whether the request is 1 that, is valid under the regulations. And 2, I've done your homework, on on your systems and your processes to understand what's where and actually how much of a task is this that you're gonna have to undertake.
[00:37:53] Unknown:
And then another layer where this manifests particularly in terms of updating data or having a customer elide bits of information from their records is how it's being used in downstream use cases, whether it's business analytics or doing some sort of machine learning on aggregate data and how that plays into the need to either regenerate a model after it's gone through a training regimen once you get the data updated or how the data is actually being used or what particular attributes of a record are being used within those analytics and some of the technologies and techniques that are viable for still remaining within compliance of these regulations, especially as far as some of the explainability
[00:38:39] Unknown:
requirements that come up? So, yeah, they're these are the $1, 000, 000 questions, really. I think, we're we're getting into now. So yeah. Definitely, once we're starting to go downstream from the data collection and we're using, the scientists and the analytic and the analysts are starting to run searches across the data, etcetera. It come it still comes back to the organization's responsibility and requirement to provide the scientists and analysts with a decent clean set of properly obtained lawful data. So if then data scientists or analysts then want to access data within that dataset that could be protected, there should be appropriate controls or tags or logging or audits that the scientists and the analysts can see and be aware of when they then come to go and do the projects that they've been assigned to do.
And it's also possible as as a way of embedding that, the management of that at the beginning of a project. If scientists or analysts or even project managers running a project that involves the scientists or analysts, they do a privacy impact assessment on what the project aims to achieve. So if the outcomes from the intended outcomes from the project are those that might result in the creation of profiled datasets upon which decisions will be made, then really at the beginning of the project, they should do some sort of impact assessment of what data they're going to use, are they using in a lawful way, and what safeguards they'll put in place for the results that are generated from that particular project.
[00:40:39] Unknown:
So, yeah, look look looked at another way. So if you think go back to the beginning, the the point we're talking about right being about trust. Okay? So you've got you've got your data set. So the question is what what what consents have you disclosed, right, and then what are the reasonable expectations of of the consumers? Are you acting within that? Now if you want to go beyond what was disclosed under GDPR, you can do that provided it's kind of akin in the same ballpark, but you typically need to inform the data subjects, that's what you're gonna do.
1 so that's an initial constraint. Now of course, GDPR applies to personal data. If you anonymize a personal data and anonymization is a sliding scale as we all know, so let's say if you sufficiently anonymize the the, the data, then it stops in personal data and you're not regulated by the GDPR anymore. You can do what you want with it provided someone can't come back and, and reverse it back to the original, players. But if if if you can't thinking about what you need to begin with and thinking it through, to the thinking about what you need to begin with and thinking it through, planning it through. In terms of the there's kind of various so the the GDPR is, has a thing about automated decision making, okay, and profiling. And it's not it's not forbidden. You're allowed you're allowed to have profiling, you have to you're allowed to have automated decision making or you like to have automated processing, provided you've done the correct disclosures.
The the issue is where you have automated decisions are taken, right, without any human involvement. In that case, if you're doing that, you have to disclose that and the person, about whom the decision is taken has the right to object, has the right to have an explanation as to how how the decision was taken. And then that brings us into things you were talking about, interpretability. So if you got some machine learning, which is pumping out decisions, and you can't explain and then you can't explain how the decision was reached, that starts to give you a bit of a problem both in gdpr, you know, in the gdpr sense, but also in the more practical sense as well as, you know, is there some bias against particular kind of people?
You know, that's happened in the past before even without. I mean, 1 of the 1 of the concerns is that, machine learning will have inbuilt biases, which is funny odd because we know that humans definitely have inbuilt inbuilt biases. So, yeah, those are the those are the kind of new kind of cutting edge issues that that people are are wrestling with how how to do how to do that. And there's a lot of thinking and discussion and kind of guidance, about it. I mean, part of what I'm hoping to see is AI, which validates AI. Okay. So, you know, you have a test data set and you run it through and and if it produces the right art each time, you know, it's working and make something like that. Does that help? Yeah. That's definitely useful. And
[00:43:51] Unknown:
some of the some of the topics of bias too comes back to the data engineering layer and data collection as far as trying to identify potential biases that exist in the datasets and then either seeking additional or alternative sources of information to complement that or to at least annotate the dataset to say, this is a potential source of bias so that the people who are performing the analysis are aware of that and can try to counteract that in terms of the algorithms that they apply to the data to make sure that they try and strip out some of the bias. But as you said, we're all human.
[00:44:29] Unknown:
All the computer algorithms that we use are written by humans, and so there's no way to completely divorce ourselves from bias, but we can at least try to identify and account for it. Agreed. And and there's and there's some examples where actually it's quite easy to use a machine to remove bias. So if you've got a lot of CVs coming in, you can take out the reference to sex. You could take out all the age references. You might not take out all the unusual foreign sounding names, replace them all with Smith, you know, if that's the kind of background you're from. And so there are ways which the, you know, machines will actually help help counter human bars as well.
In fact, if we'll we'll send you the link. There's some very interesting work done by the UK ICO, on this kind of stuff. They've got a I think it's called I'm looking at Karen. It's called an AI audit. Yeah. AI audit framework, and they've got some kind of guys in it's kind of think tank work to what strategies and what steps should you go through to make your development of, machine learning successful in in all the senses of successful. If I saw a job ad, the other day about who needs to be involved in machine learning, and 1 of them was described in 1 of the new new job I've never heard of before as an ethicist, which is like a practical, you know, say Thomas Aquinas, A bit like the guy in, madams what the man is married to madam secretary if you ever watched that professor of ethics, but with a practical application.
[00:45:59] Unknown:
Are there any other aspects of data protection and the regulatory frameworks that we should be considering or ways to keep up to date with the regulations that are present and new ones that might come out that we didn't discuss yet that you think we should cover before we close out the show?
[00:46:17] Unknown:
Well, it's definitely a kind of fast it's a fast moving world. You know? And I mean I mean, data protection's been around for a while, but the data is moving faster. The issues are kind of are moving quicker. You can listen to podcasts on data protection if that's your thing. You can link to feeds to staff, attend conferences. I would say, I mean, artificial intelligence is gonna be a big 1 going forward. If you're in test interested in in, programmatic advertising, that's another big 1. Yeah. There's a lot of movement on that, coming up soon. I think probably those are the main ones I I would personally call out at the moment. Don't let your other news, Karen.
[00:47:02] Unknown:
Well, what I think would be useful would be the opportunity for data protection to be talked about perhaps more regularly at some of the the engineering or technology conferences that happen. I mean, I remember I listened to 1 of your other podcasts where they were talking about the data council, which is a meeting, as a conference for engineers and developers, quite cutting edge. And I had I had a look at the conference online and I didn't see any topic that covered data protection. So, you know, back to my point about giving employees and engineers and analysts and scientists the knowledge to allow them to do the right thing.
If data protection could be, you know, brought into, more of the syllabus perhaps for for those technical conferences, I think that would be really helpful. That's part of the privacy by design is the word. And bring in privacy and and help them understand privacy by design, getting it right in the beginning. I think that would be really helpful. It's hard to keep up to date with everything I have to say.
[00:48:12] Unknown:
Well, for anybody who wants to follow along with the work that you each are doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:29] Unknown:
Well, I I will answer a non tool thing. I I I think it's not a tooling my view is not a tooling issue. It's a cultural and leadership issue. 1 once a kind of culture and and the leadership in the organization is, aligned, then, you know, the right things start to happen and actually flow out of it. And I'll hand over Karen to to a more tool based. I I do have a tool based,
[00:48:57] Unknown:
gap, actually. Subject access requests, which we've, Mark discussed and we talked about earlier, are very time consuming requests for almost every organization. Larger companies can invest in some of the sophisticated What I'm not seeing at this point is a subject access automation tool that's accessible to organizations that aren't just the big organizations. So some things a bit more, you know, moving in in lower price bracket, say, to a big enterprise wide, privacy management system. So that that's would be my my gap from my perspective. But but certainly the privacy tech market is a really interesting market. There's lots of solutions and tools out there.
There's over 275 vendors out there in the market now. There's $500, 000, 000 a year being, invested and growing into tech startups in the privacy space. So there's a lot of technology out there in the market to help different organizations.
[00:50:11] Unknown:
Well, thank you both for taking the time today to join me and share your expertise and understanding of the data protection space. It's definitely something that, as you said, needs to be discussed more broadly and more widely understood. So thank you for all the efforts on that front, and I hope you each enjoy the rest of your day. Thank you very much, Tobias. It's been a pleasure. Thank you. Thanks,
[00:50:32] Unknown:
Tobias.
[00:50:37] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Data Protection with Karen Heaton and Mark Sherwood Edwards
Understanding Data Protection Regulations
Organizational Challenges in Implementing Data Protection
Data Provenance and Compliance
Right to Be Forgotten and Data Deletion
Downstream Data Use and Machine Learning
Future Trends and Keeping Up with Data Protection