Summary
There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Will Thompson about managing data privacy concerns for data sets used in analytics and machine learning
Interview
- Introduction
- How did you get involved in the area of data management?
- Data privacy is a multi-faceted problem domain. Can you start by enumerating the different categories of privacy concern that are involved in analytical use cases?
- Can you describe what Privacy Dynamics is and the story behind it?
- Which categor(y|ies) are you focused on addressing?
- What are some of the best practices in the definition, protection, and enforcement of data privacy policies?
- Is there a data security/privacy equivalent to the OWASP top 10?
- What are some of the techniques that are available for anonymizing data while maintaining statistical utility/significance?
- What are some of the engineering/systems capabilities that are required for data (platform) engineers to incorporate these practices in their platforms?
- What are the tradeoffs of encryption vs. obfuscation when anonymizing data?
- What are some of the types of PII that are non-obvious?
- What are the risks associated with data re-identification, and what are some of the vectors that might be exploited to achieve that?
- How can privacy risks mitigation be maintained as new data sources are introduced that might contribute to these re-identification vectors?
- Can you describe how Privacy Dynamics is implemented?
- What are the most challenging engineering problems that you are dealing with?
- How do you approach validation of a data set’s privacy?
- What have you found to be useful heuristics for identifying private data?
- What are the risks of false positives vs. false negatives?
- Can you describe what is involved in integrating the Privacy Dynamics system into an existing data platform/warehouse?
- What would be required to integrate with systems such as Presto, Clickhouse, Druid, etc.?
- What are the most interesting, innovative, or unexpected ways that you have seen Privacy Dynamics used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacy Dynamics?
- When is Privacy Dynamics the wrong choice?
- What do you have planned for the future of Privacy Dynamics?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted. Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data. Go to data engineering podcast.com/bigeye today to sign up and start trusting your analyses.
Your host is Tobias Macy. And today, I'm interviewing Will Thompson Thompson about managing data privacy concerns for datasets used in analytics and machine learning and the work that he's doing at privacy dynamics. So, Will, can you start by introducing yourself? Yeah. My name is Will Thompson. I'm the principal
[00:01:49] Unknown:
software engineer at Privacy Dynamics. We're a startup focused on helping people with data privacy. I started a couple years ago. I've been there on the ground floor.
[00:02:00] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:04] Unknown:
Yeah. I've been in data management really my whole career. It's quite a bit different than what I'm doing now, but it started over 15 years ago. I was working for a company called O'Connor's, and we were building a legal research platform. And so our data was more, content type data. So we were dealing with documents, XML, JSON. And so all that stuff is sorted in, NoSQL databases. The thrust of our work was search, discovery, a lot of reporting, management of cross references, and tracking things. So a lot of parallels. There's all these data pipelines that we did, but just a completely different stack. And then O'Connor's got bought out by Thomson Reuters in 2018.
And just prior to that, I had started playing around in Python. I was working on a I wanted to build a audio search engine, using Python tools. And, you know, probably not the best way to get started on Python. I had 0 experience, but it was still interesting, and it was kinda 1 of those projects where I was like, well, how how hard could this be? And and, you know, I'm sure as you're aware, these things can be deceivingly difficult. So I, you know, I play with that for a while, and I actually got to a pretty decent prototype of what I was trying to accomplish. But at that point, my boss who I'd worked with at O'Connor's for many years had teamed up with Graham Thompson, no relation, from Privacy Dynamics, and they pitched me on data privacy. And, you know, I saw a huge opportunity.
It seemed like a really cool thing to work on. So I jumped on board with them in in 2019, and that's really where I kind of formally began my data science focused data management part of my career.
[00:04:02] Unknown:
And so that brings us to the core of the conversation for today, which is focused on data privacy, which is a very multifaceted problem domain with lots of different directions that you can go in. And I'm wondering if you can just start by enumerating some of the broad categories of privacy concerns that are involved in managing datasets for analytical use cases and some of the technical considerations, organizational considerations, kind of data, and sort of personal information considerations that go into that problem space?
[00:04:40] Unknown:
I'll group it into, like, 3 groups. The main 1 that probably everybody thinks about when they think about privacy is, like, data access protection. And this is really kind of the information security problem. That dovetails with all these other other types of subcategories like governance, you know, tracking usage of data, kind of provenance type problems. Basically, who has access? When do they have access? Where do the data go? And so and then there are other related technologies around that, data privacy vaults. These are all designed to kind of protect who has access to the data and then making sure that no unauthorized users have access and that we have good bookkeeping on, you know, who saw it. And then then there's more about protecting what additional information could be gleaned from from a dataset that you wanna release specific information and you don't wanna allow other information to be released. Typically, this is, you know, personal information.
And so the 1 is the protection of aggregate data and statistics. And this is probably the 2nd most understood this is what, like, the real use case for differential privacy or, like, what most people consider differential privacy. And that is, you know, you have some summary statistic about a group of people, and you wanna make sure that it's hard to discover anything else about it. And then the last 1 is kinda where we're more involved, and that is where you're protecting an actual raw dataset. And this is kind of gets into anonymization and de identification.
[00:06:25] Unknown:
As far as those categories, you know, there are definitely the kind of data governance considerations that factor into a number of them, but there are a number of projects and products that exist for handling sort of the access control, sort of data security, system security element of it. And as you were saying, you're focused more on the privacy considerations that exist in situ in the datasets, which is something that needs to be handled in every place that the data lives. And so I'm wondering if you can talk to what it is that the privacy dynamics product is focused on and some of the story behind that and the specific category or categories that you're focused on addressing from what you were just enumerating?
[00:07:11] Unknown:
Like I mentioned, we're focused on people who need to share data for operational use, for sales. Like, the main target at first is we're targeting regulated markets. They have a lot more rigid outline of what is needed in order in order to protect the data. But then also the broader analytics driven to kinda direct consumer retail market, companies that have a lot of customer data. They wanna figure out how to build bigger better products, figure out how to market to their customers. Those are the targets. Our founder, Graham Thompson, no relation, he was at mark Microsoft, and he was working in this enterprise group, and they were helping move all these enterprise customers into Azure, into their cloud.
And this is right around the time that GDPR starts going into effect. It had just gone live. And so in addition to the, like, data sharing rules that they had internally, they had all these new data sharing rules, and it was just a nightmare getting data into the systems that they wanted to to move it into. And so it's this enormous pain point. So Graham saw that as a big opportunity, and I don't remember how long he was there before he eventually said, this is too big of a thing. I I need to to start on my own product. But he eventually left and started focusing on how we could make sharing data or moving data even within a single organization easier.
And so I joined him him and John Craft, and, you know, we were cycling through prototypes trying to figure out what's the best way to address the problem. And I think the original thinking was, we're gonna provide, like, deep tooling for data engineers, data scientists. We're gonna give them all these tools to address privacy. And at the time, we were looking at only going into assessments. Like, how can we evaluate risk? That seemed like a big enough problem on its own. And I still think it is. But the more we worked with potential customers, you know, the more we realized they actually didn't wanna deal with it. 1 of the engineering leaders is like, look. We care about privacy, but we don't wanna deal with it. We just wanna check a box.
That kinda shifted the focus from, you know, more data science tools to, you know, we wanna build a more automated system, something that is just easy to plug in, and then it actually frees up the engineers to work on other things rather than make them better at working on the stuff that we're doing. And so, really, the problem comes down to all these privacy policies, some internal, some external due to regulation, they're all falling on these data teams. And they're being asked to anonymize data or treat it in ways where they don't necessarily have the expertise or the bandwidth on the team to, you know, work on such a hard problem.
And so it's really to solve that that need. And so that kind of made it a harder thing for us because in addition to building the high quality tools, that meant we also needed to do the lift to, you know, make it super easy to use. So that was a lot of extra work. The goal is give them something where they can flip a switch, and then, you know, they can check on it. But something that integrates into their system, and then they don't have to be constantly devoted to the latest privacy methods and and risks.
[00:10:56] Unknown:
As far as the kind of definition and protection and enforcement of data privacy policies, what are some of the considerations that go into that and some of the useful practices that you've identified for people who are trying to be able to kind of check the box in that compliance regimen to say, yes. These datasets that I am responsible for fit all of the kind of regulatory requirements of saying that I have de identified these types of personal information?
[00:11:29] Unknown:
There's not really very good best practices. Like, they're not widespread. The main reason we're focused on health care at first is because they do have much more established best practices, but even then, they're very squishy. And so they basically have 2 sets of requirements in health care, and I think 1 of them is called safe harbor, and that's very restrictive. And it's simple, but you can't give her a useful data out of it. And the other 1 is this expert assessment requirement. And that is also pretty squishy, but it requires you know, the idea was that you would hire a consultant to come in and help you anonymize a dataset, and they generate a report. And that report doesn't have a whole lot of specific requirements.
It just has to be done, and it has to be done, you know, by someone who's competent. And so our goal in health care is to focus on satisfying the expert assessment need while automating as much as possible. That was really kind of the thread we started pulling on was how are these expert assessments done? What type of analysis are they doing to evaluate risk? And and what kind of things are they doing to treat it? And then how can we anonymize those.
[00:12:52] Unknown:
And as far as that element of risk, there are some things that are obvious as far as why it might be risky to have certain types of data. But what are some of the avenues that that risk gets introduced through and the types of information that might have particular categories of risk that they introduce?
[00:13:13] Unknown:
Risk is introduced into datasets through any type of identifiable information. And so most people think of that as direct identifiers, which are, you know, names, addresses, social security numbers, phone numbers, those types of things. And you absolutely have to hide those, conceal them, delete them, redact them from datasets because then then those people can absolutely be identified. And so once you've removed the direct identifiers, that is what people refer to as pseudonymous data. I don't like the term. I think pseudo anonymous is more obvious. But the problem is that now you have all these indirect identifiers, which we refer to as quasi identifiers in the dataset.
This is really anything that is an attribute of a person in the dataset that's essentially public. And it doesn't even really need to be public. It just needs to be, you know, available in data to an attacker. But, generally, we think about it as public. And the the obvious ones are your date of birth, your gender, your ZIP code. These are all things that are easy to find. For example, there was 1 study that showed that if you only had the date of birth, ZIP code, and gender of everyone in the country in a dataset, you could uniquely identify 87%, which is staggering.
And so what it really comes down to is how are you able to combine these quasi identifiers in such a way that it be actually become unique and in effect, direct identifiers. That's what puts you at risk for a linkage attack. But these are the more obvious ones. But it's really any public attribute. There's this somewhat famous attack that happened with Netflix data. Netflix did this. It was like a movie recommendation challenge. So they published this dataset. There's basically, you know, this pseudonymous dataset of all their users or some subset of their users and all the movies that they had liked and how they had rated them. And then the challenge was beat our recommendation algorithm.
And what someone did was they scraped IMDB and took all the movie ratings from that, and they were able to join a surprising number of people from the Netflix dataset, and they were able to identify people. Now the consequences of being identified in this Netflix dataset is not that huge. Maybe that someone would find out that you liked a really lame movie. But but, like, the idea made is very clear that, you know, in any public data, any public attribute can be used to reidentify someone.
[00:16:09] Unknown:
So in terms of the types of information that are considered personally identifiable, There are some that are obvious such as names, Social Security numbers, physical addresses, sometimes email addresses. I'm wondering what are some of the pieces of information that can be considered PII that aren't as broadly considered as such or that might not be obvious targets for these reidentification attacks?
[00:16:42] Unknown:
I think this is where you get into what is the context of the risk that you're trying to address. And so, you know, for most of these examples, we're talking about public data. What could you do with you wanna release a dataset to somebody to do some analysis. And we're talking about, well, what if that person what if they get voter role information or census data, and then they join into that and try to enrich the data. The less obvious ones are more internal where, you know, maybe you have 1 part of the company where there's some very seriously private data. So for example, health care. And that's being released to another part of the company that maybe that doesn't deal with this data all the time. Maybe it's a large company, and this is their data team. And so this data team, they might have access to internal data that maybe is not as private as this data that this just been anonymized. It's handed to them, but it might have more information than the public has access to. And so the thing that's not intuitive is you really have to consider the background knowledge of someone who might have incentive to attack the data. There could be really any number of these things, but, you know, it it would just depend on the organization and, you know, and this would be kind of semi protected data, but not, you know, but not something that's public. In terms
[00:18:12] Unknown:
of being able to protect this information, the challenge is that you want to prevent somebody from being identified, but you still want to be able to perform analytics across the data. And so I know that there are certain statistical mutations that you can take, such as replacing certain first names with other first names that aren't necessarily going to impact the information that you're going to get in aggregate because the name isn't necessarily significant unless you're trying to do some sort of analysis in terms of the kind of ethnic sources of names in your sort of cohort. But I'm wondering what are the types of techniques that are available for being able to convert a concrete record into this pseudonymous reference while maintaining the statistical utility and significance of the information?
[00:19:07] Unknown:
Well, even in that case, like, let's say you only swap names in a dataset. Right? And there's no direct identifiers, and you only swap names. If an attacker suspected that the names have been swapped, which attackers tend to have access to the same name swapping tools that someone would use to do that. So you could generally figure out when names have been swapped. And then if you know that the names have been swapped, then you would still try to do your linkage attack because you would be using the quasi identifiers. But to your question, what we do, it's based on what's called statistical disclosure control.
And the way to think about that is really is, like, you're hiding people in groups. And so that's really your protection. The risk is uniqueness, and your protection is not uniqueness. And so what you wanna do is try to depending on how much protection you need, the technique or the metric that is commonly used to approach this is called k anonymity. And the k value is what is what is the number of most unique quasi identifier tuples to use the date of birth, ZIP code, and gender example. Right? If we had a dataset where k equals 5, then that means there are never fewer than 5 records with matching quasi identifiers.
And so that way, you know, intuitively, you can see how if there's always duplicates of the quasi identifiers, then it's always going to be harder to know who you've linked when you do a linkage attack. There are a few ways to address it, and this is an evolving field. And so the, like, classical way to do this, and I think this is how it was pitched in the original paper, is through generalization strategies. And this is basically where you're like, okay. This ZIP code is too unique, but what we can do is we'll mask the last 2 digits of the ZIP code. And so we'll only have a 3 digit ZIP code for, you know, 20% of the records. We'll just have to generalize it. And so, you know, your data is now blurry in a way. Right?
But it's still potentially useful. And it all just depends on what level of protection do you need and then what is the distortion of the data. And so something that the census was doing, this was kind of 1 of our early prototypes, is they would swap values. And so, essentially, you know, if something is unique, if there's a unique row or unique record, you find a group. Let's say our target is 5, k equals 5. Got a group of 4. Okay. Well, let's copy the quasi identifier values to this other row. So that's 1 way to do it. That was kinda the basis for how we approach it. But you're always targeting the group size, and then you wanna minimize the distortion throughout that process.
[00:22:22] Unknown:
Today's episode is sponsored by prophecy dot io, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Air Flow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing work flows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.
And as far as the kind of technical components that are required to be able to manage these data mutations and data obfuscation routines while maintaining the utility of the underlying information. What are some of the types of systems or sort of engineering capabilities that are necessary for data teams or data platforms or data engineers to be able to incorporate these capabilities into their sort of core run time. So you have a data lake or a data warehouse, and now I want to be able to start introducing some of these kind of data obfuscation practices.
And just curious sort of what they might need to add to their existing tooling to make that possible.
[00:24:04] Unknown:
In the general case, you know, all you need is the data. Right? The problem is, you know, most people have their data in a data warehouse or in some data store. And so the tools that you have it really depends on where your data is. But the tools you have at your disposal to do these kinds of analysis might be pretty limited. I think it's generally pretty limited if it's just a warehouse. Even a, you know, data science oriented data store. It would have to be very specific 1. Maybe 1 of these newer ones, Python data stack type tooling. But you really just need to be able to run custom algorithms on your data. And so the way you would do it typically is, you know, you get it into some kind of data frame, either pandas or r or MATLAB. There's a bunch of stuff going on in MATLAB, you know, for ad hoc analysis and and processing. And so you need to be able to cluster data, and you need to be able to manipulate it. The hard thing about it is it tends to be very slow.
A lot of the I mean, it's not totally true. But in general, a lot of the research, a lot of literature for for treating data, these strategies, they tend to be computationally expensive. And that's the challenge that we've had. And so I think that would be a challenge for anybody. But, really, you just need a good data science stack. You need it on top of your data. Now as far as what we do, ours is, you know, it's essentially 2 phases where we we start out. Step 1 is, let's get your data out of your warehouse and into our system. And then, you know, we use a number of Python most of the Python, data science stack in some way to assess and treat the data. And then we get it back into your warehouse, and get it off of our system.
[00:26:05] Unknown:
In terms of the kind of attack vectors that are available for people who are trying to use these reidentification methods. I'm curious if there is an equivalent in the data space and data privacy domain to what application developers have in the form of the OS top 10 of you know, these are the main avenues of vulnerability that you need to worry about when you're building a web application. I'm just wondering if there's any sort of analog in the work of managing data privacy in your data assets.
[00:26:41] Unknown:
To my knowledge, there really isn't. The part of the problem is it's so abstract, you know, the risk. It's a little bit easier to understand in the differential privacy area because, like, they have a more formal definition of what they're protecting. Theirs is based on a definition of information leakage. The problem with differential privacy, though, is at least in order for it to be really useful, whoever is doing the analysis on the data needs to be doing it through a differentially private system that is able to essentially, you know, you run a query on some data, and then the summary statistics that you get back is protected with a set. Epsilon is the variable they choose. And the advantage of that is these Epsilon values are composable. So on multiple releases, you can just easily figure out what the added risk is, but it's inconvenient.
And so if you wanna actually share a dataset, you have to do this more complicated analysis. That's kinda what we pulled from the literature on these expert assessments. And, you know, it involves, like, a big simulated attack, essentially. But, you know, if I were giving it advice to somebody on just, you know, what are the best practices, I think the main thing is right now, we're struggling with is just awareness of how risky data can be even if the names and Social Security numbers are gone. I think we're at that level now with an automatization technology, which is people don't totally understand that there's risk. A lot of the mindshare I feel like has been sucked into a lot of stuff going on in ad tech right now, which is totally valid. Those are issues too. But that's more adjacent to what we're doing. And for people who are trying to protect this sense of information,
[00:28:33] Unknown:
You know, there's the 1 approach of anonymizing the data, obfuscating it, adding in some of this kind of skew to the underlying information. Another option is to encrypt the data and then offering some avenue for decrypting information or encrypting the query so that you never actually expose the encrypted data so that you can maintain the original integrity of the datasets. And I'm wondering what you see as the calculus for figuring out which avenue to go down for being able to protect a particular data set and the risks and trade offs that are involved in either direction?
[00:29:10] Unknown:
It all depends on, like, what you need the data for. Right? And also what you need to protect. So, you know, the thing that we focused on was people need to share data. And if you need to share the data, then you have to protect it. You're protecting the output, you know, the thing that you hand off. And so the encryption based systems are valuable when I think they'd be more valuable when, like, modifying the data really isn't an option. Like, we really have to have the raw data for some reason. But, ultimately, the result, the output of that query is still gonna be at risk, whatever it is. And so it's always gonna be a question of what actually gets released. What information can people get? Like, what additional background information can people get from data that you wanted to release for another reason? You know, there's, like, homomorphic encryption. I don't know if you were kind of hinting at that.
It's a really interesting field. At least right now, my understanding is it's very impractical for anything but very simple computations, and it's very computationally expensive. But even then, like, the output that you get, if it's rows or even if it's a summary statistic, you still have the issue of information leakage. It's a different problem,
[00:30:34] Unknown:
and it's not necessarily mutually exclusive to the anonymization problem. The homomorphic encryption question is definitely an interesting 1. I actually did a show about that a while ago about a company that was working on scaling that to make it more practical in real world use cases. But another approach that, yeah, I believe the folks at Immuta are doing is where they will actually allow for predicate matches on encrypted values in the database so that you don't ever have to actually decrypt the information and surface it back to the user. Instead, you'll actually encrypt the predicates to be able to match it against the values as they exist in the database. So lots of interesting areas in the encryption field.
Setting that aside, though, another avenue that's interesting in terms of this deidentification and obfuscation approach is how you are able to evaluate the level of risk after the data has been obfuscated, particularly as you start to add in additional sources of information that become available. Because I know that there is a case I'm forgetting the details, but I believe it's relatively well known where particular dataset had been de identified. It was considered to be safe to release to the public, and so they actually produced this. I believe it was a dataset for research purposes that included medical information.
And then once an additional dataset was made available publicly, somebody was actually be able to create 1 of these linkage attacks to go back and create these links to say, okay. You know, this information that's supposedly de identified, I can actually say exactly, you know, who this medical professional is based on this information that I have available. And so I'm curious how situations like that can be identified and mitigated and some of the ways that you can understand how your level of risk changes over time as you start including more datasets?
[00:32:40] Unknown:
It's a super messy problem. Right? Because you can't unrelease data. Not really. Like, Netflix took down that dataset. Right? But if you really wanted to get it, you could get it. I'm sure. It's 1 of those things where there is only adding. I think it really depends on how you wanna think about your risk. So say you're a company that's that has data that's only relevant for a certain period of time. Right? Well, they could take that into account. But, you know, people live only a certain number of years. Right? So that's also something else you might wanna take into account. Now what we do, at least out of the gate, is we take a super pessimistic approach to our risk assessments.
And so there's a bunch of different ways that you can kinda angle your, like, attack analysis. And, you know, it depends on what's the assumed background knowledge of the attacker and what is the size of the dataset that they have, what's the percentage of people that's in it. And so we generally are are very conservative. And so we just are super pessimistic. We're just trying to not understate risk to our customers. And that's how we're approaching it. And and in fact, like, a lot of the literature that we're seeing coming out now is about how to loosen that. Like, what are ways because most of the time, it's way overkill.
And we know it is, but we don't want to tell a customer something safer than it is. And so I think that the problem that you pose is nearly intractable. And so we just are super pessimistic, and we're gonna start from a a really conservative angle. And then we're gonna try to as we get more comfortable, as we get a better understanding of, you know, what are the practical implications of these things, or what what are our customers' risk appetite, then we're gonna try to start exploring these ways to actually kinda do the opposite and dial it back. Like, just for example, 1 of the things you can do is use a population estimator.
You know, if you can estimate the size of the population of the people in your dataset, then you can more accurately assess the risk. Typically, it's it's worst case. And so larger populations are, you know, harder to attack. And so we're kind of taking the opposite approach and then loosening the screws slowly in the other other direction.
[00:35:12] Unknown:
And so now digging into the specifics of what you're building at privacy dynamics, I'm wondering if you can talk through some of the ways that it's designed and implemented and some of the architectural and scalability considerations that go into the way that you approach this problem?
[00:35:29] Unknown:
Fundamentally, architecturally, it's a Kubernetes cluster. That is because we're providing a cloud and enterprise tiers. This is something that, you know, people are very concerned about their data. We knew out of the gate, we wouldn't be able to just offer a cloud only service. So that's kinda how we handle both the cloud and on prem, which isn't really a prem anymore, but, you know, virtual private cloud, essentially. Then the actual implementation itself, it's, for the most part, a Python monolith design. And then we have some of these you could call them microservices, but they're these smaller services that serve the monolith.
And the data science core is all built in, for the most part, all Python data science stack. NumPy, pandas, we've used some clustering algorithms, some scikit learn, just like a handful of these, you know, really common data libraries. Scalability was a problem from the beginning. Right? Because the stuff that we're doing is very computationally expensive, so we're always looking for ways to optimize, and it's a long term project for us. Like, we have a long road map of scalability improvements for the data science. But 1 of the things is just simply you have to be able to load stuff in memory. It requires a lot of memory.
So in addition to all the heavy lifting you have to do to do these data transformations, data science tooling is designed to, like, load everything in memory and then operate on that and then make copies of it and do that kind of thing, which is just not scalable. And so it's a multi step process for us. And, you know, we're only part of the way there. So we're gonna be making all these improvements to the data science core. But 1 of the things we're doing is we had to build a job system to schedule jobs because we have limited resources. This isn't this isn't something that you can easily just fit into a Lambda function and just, like, you know, let the cloud absorb your computation as it goes out.
We plan to leverage those types of services, but it's very hard. And so now we're focused on scalability with how can we limit memory usage, have knowable, you know, computational requirements, and build our job system, and then have that thing scale using Kubernetes to address the whatever load requirements we have at at a given time. To speak to 1 of the challenges, 1 of the biggest challenges that we found was making it possible, even using Kubernetes, to be able to deploy on prem on prem customers. Like, even once you have a Kubernetes system, it is nontrivial to get that system onto a customer system in a way that is manageable.
Use it also to serve a cloud service that has different scaling requirements, and then all the kind of satellite microservices we have around it be able to operate with both. It was a much more complicated design and an implementation problem than we we ever imagined from the from the start. Going back to your point about
[00:38:42] Unknown:
starting from a very pessimistic position of the kind of level of risk associated with the dataset even after obfuscating it. I'm wondering how you think about being able to kind of expose that dial of how extreme you want to go of modifying the data and sort of how you expose the trade offs to your end users of you can, you know, dial it up to 11, and we will replace all the values with something that we fabricate versus we're actually going to maintain the original information, but we're just gonna shuffle it around a bit and how that impacts the security of the dataset after the fact and just being able to validate your assertions about sort of what level of impact it's going to have.
[00:39:34] Unknown:
This really gets to the core of of how our system works, which is our treatment algorithm, It is scalable in privacy terms in the sense that you give it essentially a group size requirement, a k value, and it will hit that target. And it will do everything it can to minimize distortion in the data based on the target you give it. That's really the core of the treatment system, and then that's buttressed by our risk analysis and then our distortion analysis. And so this is really kind of the the complete picture, which is privacy utility trade off. And so we do this risk analysis and that you know, there's a top line score and then other metrics. Right? Kind of dig into it to understand it. And that tells you the privacy that you have. And then we have data distortion analysis, and that shows you what the utility is of your data. And we give a breakdown of that. So whoever has to use the data can go look and they could say, you know, how has the distribution changed on the age column in this dataset? Or how many cells have actually changed?
Is it less than this many? Okay. Then we don't care. Okay. It's more than this threshold. Now we need to drill down. How meaningful has it changed? And so what we try to do, this is something that the the treatment system is really more of a platform that we we're building on. And so we try to do things where we try to group similar data together. This is part of the clustering aspect of it. And so the idea is to minimize the distance between the values that we have to swap. So we wanna change someone's age from 31 to 35, not 31 to 61.
And so we're always improving the system so that we can make better decisions and reduce distortion for any given privacy target. But also to scale it in other ways that aren't necessarily increasing utility, like, some customers may want a traditional generalized record. Right? I can imagine in some case where you're like, well, the swaps you could think of it as like a synthetic data. It's not quite synthetic, but maybe that's a good way to think about it. But maybe you say, okay. Well, that's not right. That's not true. We this data needs to actually reflect exactly 1 to 1 what was here before. And so then you you could use a range. But in another case, you could potentially reduce distortion depending on the kind of analysis you intend to do. If rather than swap a value with the, say, median to get the lowest distance of change, You swap it with the average, and then maybe all the values change, but you get a more accurate summary statistic for the type of analysis you're doing. And so it's these types of things we wanna build all this in to, you know, allow the user to experiment with their data, figure out what risks they're comfortable with, understand it, and then also see, you know, what is a utility good enough for what I need. And this is probably a whole another
[00:42:41] Unknown:
episode's worth of conversation, but I don't know if it's worth touching on
[00:42:46] Unknown:
the approach of data obfuscation versus differential privacy or maybe how the 2 relate to each other. I don't wanna dig in too much because I'm probably not qualified to do a deep dive on it. But, you know, data obfuscation, it kinda depends on what you mean by data obfuscation. Right? But our opinion on this is that differential privacy is best in a system where the person doing the analysis, writing the queries, is operating on a differentially private system. From our perspective, that's fundamentally different from the type of problem we're trying to solve, which is people need to share the data because people wanna use their tools. They wanna use their analysis tool and, you know, their BI system.
And, you know, it doesn't work if you are taking that control away. Now that doesn't mean that it's not very valuable. And in fact, something we might offer Out of the gate, we wanted to do this. How can we get the most value to the most customers? And also data obfuscation, I would consider that more of an umbrella term, but there are other people who do perturbation to data where they'll say, you're not doing k targeting like we are, but you'll say, okay. I'm gonna take these dates, and I'm gonna wiggle them with some random values.
Now I'm not quite sure how you would assess risk in that case because you could still end up with a lot of unique data. So we would probably need some other kind of risk analysis, but that's how I would see those kind of fitting together, just based on use case.
[00:44:16] Unknown:
In terms of the actual workflow of using privacy dynamics and integrating it with an existing dataset to be able to de identify the information and generate a shareable data asset, I'm wondering what that looks like for being able to handle the kind of integration path. And then particularly once you're connected up to the source system, how you approach the heuristics of understanding which columns or which fields or which tables might have personally identifiable information so that you can then understand what types of mutations you need to implement on top of the source set? On a high level, we connect to a data warehouse.
[00:45:03] Unknown:
We provide these connectors. We're trying to build as many useful connectors as we can. And so we integrate into kind of a ELT pipeline, typically, you usually at the beginning or at the end. And then we'll look at a source table or a query, and then run it through our system and then pipe it out to a target table or the same table in a different schema or something like that. The PII detection problem is a hard problem because it's another 1 of these messy problems where you have to use heuristics. Right? So this is another 1 of those things that's I don't think it's ever going to get perfect. And so it's just constant incremental improvement.
But the problem that we have to solve is essentially, we wanna know what are direct identifiers and what are quasi identifiers. And so the first thing you do is do pattern matching. And I've seen people, you know, and having to solve certain types of problems like this online kind of scoffing and pattern matching. It really is, you know, in a first pass, absolute best way to find data or find the categorization of data. Most people have names on columns that are that are meaningful. You know, we do a lot of that. But then our system, it tries to be a lot more clever.
Essentially, we will we use a lot of patterns, and patterns have weights assigned to them. We also look at the data. You know, we look at the column and we look at the data. And those, you know, do the combination of patterns matching the column values and the data values. We combine those to give us a confidence score and decide whether or not we think that that's, you know, someone's name or, you know, Social Security number. But, like, normal values are the easier ones to do, like a credit card. Right? You know, there's certainly a lot of 16 digit numbers that would be false positives on credit cards, but credit cards have these CRC checks that you can do. And so every value passes CRC check. Even with no column name, you can be very confident it's a credit And so we do that, and those are heuristics. And, you know, I fully expect that over time as we encounter more and more real real world data, we will have to be making constant tweaks to make that more accurate. But 1 of the things that we do is we use another heuristic. It's a categorical heuristic. And this is, again, this is part of our pessimistic view. And that is we train a model. It's not real complicated. It's a decision tree model.
And we train a model on some data to try to guess whether a column is categorical. So if we don't know what the type is, we say, is this a category? And then categories, we consider quasi identifiers. Now there's some cases where you wanna turn that off. Not that automatic, but you wanna say, no. Something's just an attribute. Right? Something's actually a value you don't wanna protect. This is the value you're studying, not the attribute of the person. But generally, that's a little bit more permissive, and it tends to be pretty good. And it's absolutely necessary because there's a lot of cases where people have categorical data encoded as integers. And so we have to have a good way of determining, are these actually categories?
And, you know, if so,
[00:48:22] Unknown:
we need to protect it. And I imagine that beyond just these heuristic and self discovery processes, you have an avenue for people to be able to go in and label specific fields as this is something that needs to be processed and these are the types of transformations that need to be executed on them or, you know, this is the underlying data type.
[00:48:42] Unknown:
Yeah. We're trying to make it as automatic as possible. And, you know, like, the long term goal is to kinda have, like, layers of configuration where, you know, you've got basic settings, and then you wanna tweak it. You have more specific tweaks you wanna do. You do that. And so we're really starting from the top. But, yes, absolutely, that's our intent. You know, we wanna get away from having people specify a specific transformation in certain ways because we can do probably better if we just know if this is a date, you know, for some reason, we couldn't identify this as a date. But you sell us as a date. Now we can do more customized treatment on it. And that's the important thing, is you get the best utility and protection.
[00:49:26] Unknown:
And so in terms of the applications of privacy dynamics and the overall space of data de identification for the purposes of sharing, what are some of the most interesting or innovative or unexpected ways that you've seen either the privacy dynamics product or the principles that you're applying used in your experience?
[00:49:49] Unknown:
I'll go with the unexpected 1 just because we've just launched our product. And so what we didn't expect was that customers might use our anonymization system to not anonymize data. And so this is something that I just kind of built into the core of the system. It's like, if there's data that can't be anonymized or shouldn't be anonymized, it it blows up, and it has an error, and it doesn't work. But it turns out we have customers who they wanna copy everything from, you know, 1 location to another. And then some of the data either shouldn't be treated or can't be treated, and so they locked it all up. Or, you know, tables that didn't have any rows in them. Essentially, they wanted to use us as a data replication service that also anonymizes where appropriate.
It makes total sense that someone would wanna do that. Right? But, you know, we were so focused on the anonymization use case. We didn't think that, you know, obviously, someone would wanna do this.
[00:50:49] Unknown:
And in your experience of working on privacy dynamics and exploring this overall space of data privacy, What are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:51:00] Unknown:
For me, there's this kinda so the startup element of engineering for product market fit. Because it's very unintuitive to a software engineer, simply because you have to go against your instincts a lot of the time on how to build things, because you're trying to prove a use case. You're not trying to prove your, you know, high quality system is going to be scalable because you don't have time. The whole point is try to discover what someone wants. Build a a stable demo that, you know, does most of the things that you need, but fully expect to have to throw this thing away in a few months if you need to pivot in some other direction. That was something that was really challenging for me. And if I had to do it again, I don't know that it would be any more comfortable. You know? I was just gonna have to remind myself that, you know, you have to shoot for the goal we're aiming for at the moment.
[00:51:57] Unknown:
For people who are interested in being able to apply some of these anonymization techniques to their data and be able to satisfy some of these risk or regulatory requirements in order to be able to share these datasets? What are the cases where privacy dynamics is the wrong choice either because it doesn't have the proper set of integrations available or it doesn't meet a particular use case or handle particular data types?
[00:52:28] Unknown:
Right. So we're not a compliance tool. So we're not set up to specifically help you meet certain regulations. We're a privacy tool. We don't deal with data access control and reporting on provenance and usage. And especially, you know, if you're an ad tech company who needs to identify individuals for marketing purposes, we're not a good choice for that because we kinda do the opposite. Right? So there are a lot of companies who wanna satisfy GDPR or CCPA, but they also wanna still be able to uniquely identify individuals for marketing.
That's not us.
[00:53:09] Unknown:
And so as you continue to build out the privacy dynamics platform and now that you are in the early stages of launching the product, what are some of the things you have planned for the near to medium term or any particular projects that you're excited to dig into?
[00:53:24] Unknown:
I'm really excited about all the stuff we get to work on next. This is some of the most fun, which is the first is digging into the data science core and, you know, really getting to do a lot of analysis of our own to figure out, you know, ways to improve privacy and utility, kinda push the edge of that curve just to generate the most high quality data we can, and also improve automation. You mentioned you were talking about these heuristics, you know, for PII detection. That's something that is going to need a lot of data and a lot of analysis. We're all in on, you know, hitting all the major data types that we can.
And some of the most fun stuff, for me anyway, is system performance. Figuring out how to make our algorithms faster, how to handle more complicated data, streaming data, bigger datasets. You know, we'll have to probably do a lot of retooling on the algorithm to handle datasets that are just so much bigger than memory and figuring out how to, you know, scale more granularly with user demand. You know? That's 1 of those more classical data engineering infrastructure problems. And then kinda a little bit further down the road, like, we already have some dbt integration with privacy dynamics. You can plug into, dbt repo and pull down models and use them directly in privacy dynamics.
We wanna go all in on DBT integration. And so that's kind of 1 of the next big projects is really integrating into the modern data stack, handling a lot of these more advanced use cases where people want to include privacy and anonymization as in part of their pipelines, you know, the graph for their data processing. And so we see huge opportunities for people to get a lot more value if we can leverage that. That's something we're, we're really excited about.
[00:55:27] Unknown:
Are there any other aspects of the work that you're doing at privacy dynamics or the overall space of data privacy management that we didn't yet that you'd like to cover before he blows out the show? The only thing I can think of is, you know, the way we had to address this. This might be too nuanced to dig in on, but,
[00:55:44] Unknown:
like, the way we approached the system was we have essentially assessment and we have treatment. And so for the assessment, that's where we're we were very conservative both in how we approached implementing it to begin with, which was we tried to pull straight from the literature and not take many liberties other than to tailor it to fit data. That was important because we didn't want to blaze any new trail for the risk assessment. But then on the treatment, we wanted to go beyond what was available from the literature and do be more experimental to try to get better utility.
And that was an interesting balance, and I thought it really paid off for us because, you know, it allowed us to kinda branch out and try new ideas. And then we had this stable kind of established technology or analysis practice that would keep us honest. And so, you know, if we ever did anything that would increase risk, we would know. That was really interesting to me about kind of 2 important pieces of the system that we approach in different ways with respect to prior art and research.
[00:57:03] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:19] Unknown:
Probably the thing that I find missing the most, this is a very self serving thing to pick on. I know there's companies who are definitely trying to address this, but it's moving data around. And this is kind of an old problem. Right? Like, there are more and more databases, and every database has its own little design that's specific. You know, we do these types of queries the best. We store data this way. We handle these use cases. But you always have to move data around. So for us, what that has meant is we need to build these connectors. In order to make this stuff easy for our customers, we have to provide a out of the box connector that they can just say, oh, you use Snowflake. Okay. Enter your credentials. Boom. You're connected to Snowflake.
These connectors aren't enormous lifts, but they're not trivial, especially when you need to essentially download the entire table or an entire database. So I think data movement, it's always been a problem, I guess. So I don't know if that speaks to how there will be a great solution to it, but I certainly think that more convenient tools, different layers of the stack too. Right? So not just kinda high level connectors where it's just kinda moving something from syncing databases, but then also down low for people like us where, you know, we just wanna hook in and get things into a data frame. That's a lot harder than I expected it to be.
[00:58:42] Unknown:
Alright. Well, thank you very much for taking the time today to join me and explore this broad space of data privacy and some of the risks associated with leaving data as is as well as mutating it. So it's definitely a very interesting problem space and interesting approach that you're taking at privacy dynamics, and it's great to see more people contributing to solutions in this space. So appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. I really enjoyed speaking with you. For listening. Don't forget to check out our other show, podcast.init@pyth on podcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineering podcast dot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Will Thompson's Background in Data Management
Core Discussion on Data Privacy
Privacy Dynamics' Focus and Product
Defining and Protecting Data Privacy
Techniques for Data Anonymization
Technical Components for Data Privacy
Attack Vectors and Risk Assessment
Privacy Dynamics' Architecture and Scalability
Balancing Privacy and Utility
Integration and Heuristics for PII Detection
Applications and Lessons Learned
When Privacy Dynamics is Not the Right Choice
Future Plans and Projects
Final Thoughts and Closing Remarks