Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

The only thing worse than having bad data is not knowing that you have it. With Bigeye's data observability

platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted.

Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders.

With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.

Go to data engineering podcast.com/bigeye

today to sign up and start trusting your analyses.

Your host is Tobias Macy. And today, I'm interviewing Will Thompson Thompson about managing data privacy concerns for datasets used in analytics and machine learning and the work that he's doing at privacy dynamics. So, Will, can you start by introducing yourself? Yeah. My name is Will Thompson. I'm the principal

software engineer at Privacy Dynamics.

We're a startup focused on

helping people with data privacy.

I started a couple years ago. I've been there

on the ground floor.

And do you remember how you first got involved in the area of data management?

Yeah. I've been in data management really my whole career.

It's

quite a bit different than

what I'm doing now, but it started

over 15 years ago.

I was working for a company called O'Connor's,

and we were building a legal research platform. And so our data was more, content type data.

So we were dealing with documents,

XML, JSON.

And so all that stuff is sorted

in, NoSQL databases.

The thrust of our work was

search, discovery,

a lot of reporting,

management of cross references,

and tracking things. So

a lot of parallels. There's all these data pipelines that we did, but just

a completely different stack. And then

O'Connor's got bought out by

Thomson Reuters

in

2018.

And

just prior to that, I had started

playing around in Python. I was working on a

I wanted to build a audio search engine,

using Python tools.

And,

you know, probably not the best way to get started on Python.

I had 0 experience,

but it was still interesting, and it was kinda 1 of those projects where I was like, well, how how hard could this be?

And and,

you know, I'm sure as you're aware,

these things can be deceivingly difficult.

So I, you know, I play with that for a while, and I actually got to a pretty decent prototype of what I was trying to accomplish.

But at that point,

my boss who I'd worked with at O'Connor's for many years had teamed up with Graham

Thompson, no relation, from Privacy Dynamics, and they pitched me on

data privacy. And,

you know, I saw a huge opportunity.

It seemed like a really cool thing to work on. So I jumped on board with them in in 2019, and that's really where I kind of formally began

my

data science focused data management

part of my career.

And so that brings us to the core of the conversation for today, which is focused on data privacy, which is a very multifaceted

problem domain with lots of different directions that you can go in. And I'm wondering if you can just start by enumerating some of the broad categories

of privacy concerns that are involved in

managing datasets for analytical use cases

and some of the

technical considerations,

organizational considerations,

kind of data,

and

sort of personal information considerations that go into that problem space?

I'll group it into, like, 3 groups. The main 1 that probably everybody thinks about when they think about privacy

is, like, data access protection.

And this is really

kind of the information security problem.

That dovetails with all these other other types of

subcategories

like governance,

you know, tracking usage of data, kind of provenance

type problems. Basically, who has access? When do they have access?

Where do the data go? And so

and then there are other

related technologies around that, data privacy vaults.

These are all designed to kind of protect

who has access to the data and then making sure that no unauthorized users have access and that we have good bookkeeping on, you know, who saw it. And then then there's more about protecting

what additional information could be

gleaned from

from a dataset that you wanna release

specific information and you don't wanna allow other information to be released. Typically, this is, you know, personal information.

And so

the 1 is

the protection of aggregate data and statistics.

And this is probably the 2nd most

understood

this is what, like, the real use case for differential privacy or, like, what most people consider differential privacy.

And that is, you know, you have some summary statistic

about

a group of people,

and you wanna make sure that

it's hard to discover anything else about it. And then the last 1

is kinda where we're more involved,

and that is

where you're protecting an actual raw dataset.

And this is kind of gets into

anonymization

and de identification.

As far as those categories, you know, there are definitely

the kind of data governance considerations

that factor into a number of them, but there are a number of projects and products that exist for handling sort of the access control,

sort of data security, system security element of it. And as you were saying, you're focused more on

the privacy considerations that exist

in situ in the datasets, which is

something that needs to be handled in every place that the data lives. And so I'm wondering if you can talk to what it is that the privacy dynamics

product is focused on and some of the story behind that

and the specific category or categories that you're focused on addressing from what you were just enumerating?

Like I mentioned, we're focused on

people who need to share

data

for

operational use, for sales.

Like, the main target at first is we're targeting regulated markets. They have a lot more

rigid outline of what is needed in order in order to protect the data.

But then also

the broader analytics driven

to kinda direct consumer

retail market, companies that have a lot of customer data.

They wanna figure out how to build bigger better products, figure out how to market to their customers.

Those are the targets.

Our founder, Graham Thompson,

no relation,

he was at mark Microsoft,

and

he

was

working in this enterprise group, and they were helping move all these enterprise customers into Azure, into their cloud.

And this is right around the time that GDPR

starts going into effect. It had just gone live.

And so

in addition to the, like, data sharing

rules that they had internally,

they had all these new data sharing rules,

and

it was just a nightmare

getting data into the systems that they wanted to to move it into. And so it's this enormous pain point.

So Graham saw that as a big opportunity, and I don't remember how long he was there before he eventually

said,

this is too big of a thing. I I need to

to start on my own product.

But he eventually left

and started focusing on how we could make sharing data or moving data even within a single organization

easier.

And so I joined him him and John Craft,

and, you know, we were cycling through prototypes trying to figure out what's the best way to address the problem. And I think the original

thinking was, we're gonna provide,

like, deep tooling for

data engineers,

data scientists. We're gonna give them all these tools

to address privacy. And at the time, we were looking at only going into

assessments. Like, how can we evaluate risk? That seemed like a big enough problem on its own. And I still think it is. But the more we worked with potential customers,

you know, the more we realized they actually didn't wanna deal with it. 1 of the engineering leaders is like, look. We care about privacy,

but we don't wanna deal with it. We just wanna check a box.

That kinda shifted the focus from,

you know, more data science tools

to, you know, we wanna build a more automated system,

something that is just easy to plug in, and then it actually

frees up the engineers to work on other things rather than make them better at working on the stuff that we're doing.

And so, really, the problem comes down to

all these privacy

policies, some internal, some external

due to regulation,

they're all falling on these data teams.

And

they're being asked to anonymize data or treat it in ways where they don't necessarily have the expertise

or

the bandwidth on the team to, you know, work on such a hard problem.

And so it's really to

solve that that need. And so that kind of made it a harder thing for us because in addition to building

the high quality tools, that meant we also needed to do the lift to, you know, make it super easy to use. So that was a lot of extra work. The goal is give them something where they can flip a switch, and then, you know, they can check on it. But something that integrates into their system, and then they don't have to be constantly devoted to the latest

privacy methods and and risks.

As far as the

kind of definition

and protection and enforcement

of data privacy policies, what are some of the

considerations that go into that and some of the useful practices

that you've identified

for people who are trying to be able to

kind of check the box in that compliance regimen to say, yes. These datasets that I am responsible for

fit all of the kind of regulatory requirements of saying that I have de identified these types of personal information?

There's not really

very good

best practices. Like, they're not widespread.

The main reason we're focused on health care at first is because they do have much

more

established best practices, but even then, they're very squishy. And so

they basically have 2 sets of requirements

in health care, and I think 1 of them is called safe harbor,

and that's

very restrictive.

And it's simple, but you can't give her a useful data out of it. And the other 1 is this expert assessment requirement.

And that is

also pretty squishy, but it requires

you know, the idea was that you would hire a consultant to come in

and help you anonymize a dataset, and they generate a report. And that report

doesn't have a whole lot of specific requirements.

It just has to be done, and it has to be done, you know, by someone who's competent.

And so

our goal in health care is to focus on

satisfying the expert assessment need

while automating as much as possible.

That was really kind of the thread we started pulling on was how are these expert assessments done? What type of analysis are they doing to evaluate

risk? And and what kind of things are they doing to treat it? And then how can we anonymize those.

And as far as that element of risk,

there are some things that are obvious as far as why it might be risky to have certain types of data. But what are some of the

avenues that that risk gets introduced through and the types of information that might have particular categories of risk that they introduce?

Risk is

introduced into datasets

through any type of identifiable information.

And so most people think of that as

direct identifiers,

which are, you know, names, addresses, social security numbers, phone numbers, those types of things.

And

you absolutely have to

hide those, conceal them, delete them, redact them from datasets

because then then those people can absolutely be identified.

And so

once you've removed the direct identifiers,

that is what people refer to as pseudonymous

data.

I don't like the term. I think pseudo anonymous is

more obvious. But the problem is that now you have all these indirect identifiers,

which we refer to as quasi identifiers in the dataset.

This is really anything that is an attribute of

a person in the dataset that's essentially public. And it doesn't even really need to be public. It just needs to be, you know, available in data to an attacker. But, generally, we think about it as public. And the the obvious ones are

your date of birth,

your gender,

your ZIP code.

These are all things that are easy to find. For example, there was 1 study

that showed that

if you only had the date of birth, ZIP code, and gender

of

everyone in the country in a dataset, you could uniquely identify

87%,

which is staggering.

And so what it really comes down to is

how are you able

to combine

these quasi identifiers

in such a way that it be actually become unique and in effect, direct identifiers.

That's what puts you at risk for a linkage attack.

But these are the more obvious ones. But it's really any public attribute.

There's this somewhat famous

attack that happened with Netflix data.

Netflix did this. It was like a movie recommendation

challenge.

So they published this dataset.

There's basically, you know, this

pseudonymous

dataset of all their users or some subset of their users

and all

the movies that they had liked

and how they had rated them. And then the challenge was beat

our recommendation algorithm.

And what someone did was they scraped IMDB and took all the

movie ratings from that,

and they were able to join

a surprising number of people from the Netflix dataset, and they were able to identify people.

Now the consequences of being identified in this Netflix dataset

is not that huge. Maybe that someone would find out that you

liked a really lame movie.

But but, like, the idea made is very clear that, you know,

in any public data, any public attribute can be used to reidentify someone.

So in terms of

the types

of information that are considered

personally identifiable,

There are some that are obvious such

as names,

Social Security numbers, physical addresses,

sometimes email addresses.

I'm wondering what are some of the

pieces of information

that

can be considered PII that aren't as

broadly considered as such or that might not be

obvious targets for these reidentification

attacks?

I think this is where you get into

what is the context

of the

risk that you're trying to address.

And so, you know, for most of these examples, we're talking about public data. What could you do with you wanna release a dataset to somebody to do some analysis.

And we're talking about, well, what if that person

what if they get voter

role information

or census data,

and then they join into that and try to enrich the data.

The less obvious ones are more

internal

where,

you know, maybe you have 1 part of the company where there's some very seriously private data. So for example, health care.

And

that's being released to another part of the company that maybe that doesn't deal with this data all the time. Maybe it's a large company, and this is their data team. And so

this data team,

they might have access to internal data that maybe is not as private as this

data that this just been anonymized. It's handed to them, but it might have more information than the public has access to. And so

the thing that's not intuitive is you really have to consider the background knowledge of

someone who might have incentive to attack the data. There could be really any number of these things, but, you know, it it would just depend on the organization and, you know, and this would be kind of semi protected data,

but not,

you know, but not something that's public. In terms

of being able to protect this information,

the

challenge is that you want to

prevent somebody from being identified, but you still want to be able to perform analytics across the data. And so I know that there are certain

statistical mutations that you can take, such as replacing certain

first names with other first names that aren't necessarily

going to

impact

the

information that you're going to get in aggregate because the name isn't necessarily

significant unless you're trying to do some sort of

analysis in terms of the kind of ethnic sources of names in your sort of cohort.

But I'm wondering what are the types of techniques that are available for being able to convert a concrete record into this pseudonymous reference

while maintaining the statistical

utility and significance of the information?

Well, even in that case, like, let's say you only swap names in a dataset. Right? And there's no direct identifiers, and you only swap names.

If an attacker suspected that the names have been swapped, which

attackers tend to have access to the same name swapping tools that someone would use to do that.

So you could generally figure out when names have been swapped. And then if you know that the names have been swapped, then you would still try to do your linkage attack because you would be using the quasi identifiers.

But to your question,

what we do,

it's based on

what's called statistical disclosure control.

And

the way to think about that is really is, like, you're hiding people in groups.

And so

that's really your protection.

The risk is uniqueness,

and your protection is

not uniqueness.

And so

what you wanna do is try to

depending on how much protection you need, the technique or the metric that is commonly used to approach this is called k anonymity.

And the k value is

what is what is the number

of most unique

quasi identifier tuples

to use the

date of birth, ZIP code, and gender example. Right? If we had a dataset where k equals 5,

then that means there are

never

fewer than

5

records

with matching quasi identifiers.

And so that way,

you know,

intuitively, you can see how

if there's always

duplicates

of the quasi identifiers, then it's always going to be harder

to

know who you've linked when you do a linkage attack.

There are a few ways to address it,

and this is an evolving

field.

And so

the, like, classical way to do this, and I think this is how it was pitched in the original paper,

is through generalization

strategies.

And this is basically where you're like, okay.

This ZIP code is too unique,

but what we can do is

we'll mask the last 2 digits of the ZIP code. And so we'll only have a 3 digit ZIP code for,

you know, 20% of the records. We'll just have to generalize it. And so, you know,

your data is now

blurry in a way. Right?

But it's still potentially useful. And it all just depends on

what level of protection do you need

and then what is the distortion of the data.

And

so something that

the census was doing, this was kind of 1 of our early prototypes,

is they would swap

values.

And so, essentially, you know, if something is unique, if there's a unique row or unique record,

you find a group. Let's say our target is 5, k equals 5. Got a group of 4. Okay. Well, let's copy the quasi identifier values

to this other row. So that's 1 way to do it. That was kinda the basis for how we approach it. But you're always targeting the group size, and then you wanna minimize the distortion

throughout that process.

Today's episode is sponsored by prophecy dot io, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices. Git

tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on Air Flow. You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing work flows in AB Initio,

Informatica, or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.

And as far as the

kind of technical

components that are required to be able to

manage these

data mutations and data obfuscation

routines

while maintaining

the utility of the underlying information.

What are some of the types of systems

or

sort of engineering capabilities that are necessary for

data teams or data platforms or data engineers to be able to incorporate these capabilities

into their sort of core run time. So

you have a data lake or a data warehouse, and now I want to be able to start introducing some of these

kind of data obfuscation practices.

And just curious sort of what they might need to add to their existing tooling to make that possible.

In the general case,

you know, all you need is the data. Right?

The problem is,

you know, most people have their data in a data warehouse or in some data store. And so

the tools that you have

it really depends on where your data is. But the tools you have at your disposal

to do these kinds of analysis

might be pretty limited. I think it's generally pretty limited if it's just a warehouse. Even a, you know, data science oriented

data store. It would have to be very specific 1. Maybe 1 of these newer ones,

Python data stack type tooling. But you really just need to be able to run custom algorithms on your data. And so the way you would do it typically is, you know, you get it into some kind of data frame, either pandas or r

or MATLAB. There's a bunch of stuff going on in MATLAB, you know, for

ad hoc analysis and and processing. And so you need to be able to cluster data, and you need to be able to

manipulate it. The hard thing about it is it tends to be very slow.

A lot of the

I mean, it's not totally true. But in general,

a lot of the research,

a lot of literature for for treating data,

these strategies,

they tend to

be computationally expensive.

And that's the challenge that we've had. And so I think that would be a challenge for anybody.

But,

really, you just need a good data science stack. You need it on top of your data. Now as far as what we do,

ours is,

you know, it's essentially 2 phases

where we we start out.

Step 1 is, let's get your data out of your warehouse and into our system.

And then,

you know, we use

a number of Python

most of the Python,

data science stack

in some way to

assess and treat the data. And then we get it back into your warehouse,

and get it off of our system.

In terms of the

kind of attack vectors that are available for people who are trying to use these

reidentification

methods. I'm curious if there is an equivalent in the data space and data privacy

domain

to what application developers have in the form of the OS top 10 of you know, these are the main

avenues of vulnerability that you need to worry about when you're building a web application. I'm just wondering if there's any sort of analog in

the work of managing data privacy in your data assets.

To my knowledge, there really isn't.

The part of the problem is it's so

abstract,

you know, the risk. It's a little bit easier to understand in the differential privacy

area

because, like, they have a more formal definition of what they're protecting. Theirs is based on a definition of information leakage.

The problem with differential privacy, though, is at least in order for it to be really useful,

whoever is doing the analysis

on the data

needs to be doing it through a differentially private system

that is able to essentially, you know, you run a query on some data, and then the summary statistics that you get back is protected

with a

set. Epsilon is the

variable they choose. And the advantage of that is these Epsilon values are composable. So on multiple releases,

you can just easily figure out what the added risk is,

but it's inconvenient.

And so if you wanna actually share a dataset,

you have to do this more complicated analysis.

That's kinda what we pulled from the literature

on these expert assessments.

And,

you know, it involves, like, a big simulated attack, essentially.

But,

you know, if I were giving

it advice to somebody on just, you know, what are the best practices,

I think the main thing is right now, we're struggling with is just awareness of how risky data can be even if the names and Social Security numbers are gone. I think we're at that level now with an automatization technology, which is people don't totally understand that there's risk.

A lot of the mindshare I feel like has been sucked into

a lot of stuff going on in ad tech right now, which is totally valid. Those are issues too. But that's more adjacent to what we're doing. And for people who are trying to protect this sense of information,

You know, there's the 1 approach of

anonymizing the data, obfuscating it, adding in some of this kind of skew to the underlying information.

Another option is to encrypt the data and then offering some avenue for

decrypting information or encrypting the query so that you never actually expose the encrypted data so that you can maintain the original integrity of the datasets.

And I'm wondering what you see as the calculus

for figuring out

which avenue to go down for being able to protect a particular data set

and the risks and trade offs that are involved in either direction?

It all depends on, like, what you need the data for. Right? And also what you need to protect.

So,

you know, the thing that we focused on was

people need to share data.

And if you need to share the data, then you have to protect

it. You're protecting the output,

you know, the thing that you hand off.

And so

the encryption based systems are valuable

when

I think they'd be more valuable when, like,

modifying the data really isn't an option. Like, we really have to have the raw data for some reason.

But, ultimately,

the result, the output of that query is still gonna be at risk, whatever it is. And so it's always gonna be a question of

what actually gets released. What information

can people get? Like, what additional background information can people get from data that you wanted to release for another reason? You know, there's, like, homomorphic encryption. I don't know if you were kind of hinting at that.

It's a really interesting

field.

At least right now,

my understanding is it's very impractical for anything but very simple

computations, and it's very computationally expensive.

But even then, like,

the output that you get,

if it's rows or even if it's a summary statistic,

you still have the issue of information leakage.

It's a different problem,

and it's not necessarily mutually exclusive to the anonymization problem. The homomorphic encryption question is definitely an interesting 1. I actually did a show about that a while ago about a company that was working on scaling that to make it more practical in real world use cases. But

another approach that, yeah, I believe the folks at Immuta are doing is where they will actually allow for

predicate matches on

encrypted values in the database so that you don't ever have to actually decrypt the information and surface it back to the user.

Instead, you'll actually

encrypt the predicates

to be able to match it against the values as they exist in the database. So lots of interesting areas in the encryption field.

Setting that aside, though, another avenue that's interesting in terms of this deidentification

and obfuscation

approach

is

how you are able to evaluate

the

level of risk

after the data has been obfuscated,

particularly as you start to add in additional sources of information

that become available. Because I know that there is a case I'm forgetting the details, but I believe it's relatively well known where

particular dataset had been de identified. It was considered to be safe to release to the public,

and so they actually produced this. I believe it was a dataset

for research purposes

that included medical information.

And then once an additional

dataset was made available publicly, somebody was actually be able to create 1 of these linkage attacks to go back and create

these links to say, okay. You know, this information that's supposedly

de identified, I can actually say exactly, you know, who this medical professional is based on this information that I have available.

And so I'm curious

how situations like that can be

identified and mitigated

and some of the ways that you can understand

how your level of risk changes over time as you start including more datasets?

It's a super messy problem.

Right? Because you can't unrelease data. Not really. Like, Netflix took down that dataset.

Right? But

if you really wanted to get it, you could get it. I'm sure. It's 1 of those things where

there is only adding. I think it really depends on how

you wanna think about

your risk.

So say you're a company that's that has data that's only relevant for a certain period of time. Right? Well, they could take that into account.

But,

you know, people live only a certain number of years. Right? So that's also something else you might wanna take into account. Now what we do,

at least out of the gate, is we take a super pessimistic

approach to our risk assessments.

And so there's a bunch of different ways that you can kinda angle

your, like, attack analysis.

And, you know, it depends on

what's the assumed background knowledge of the attacker

and

what is the size of the dataset that they have,

what's the percentage of people that's in it. And so we generally are are very conservative.

And so we just are super pessimistic.

We're just trying to

not

understate risk to our customers.

And that's how we're approaching it.

And and in fact, like, a lot of the literature

that we're seeing coming out now is about how to

loosen that. Like, what are ways because

most of the time, it's way overkill.

And we know it is, but we don't want to tell a customer something safer than it

is. And so

I think that the problem that you pose is

nearly intractable.

And so we just are super pessimistic, and we're gonna start from a a

really conservative angle. And then we're gonna try to as we get more comfortable, as we get a better understanding of, you know, what are the practical implications of these things, or what what are our customers' risk appetite,

then we're gonna try to start exploring these ways to actually kinda do the opposite and dial it back. Like, just for example, 1 of the things you can do is use a population

estimator.

You know, if you can estimate the size of the population

of the

people in your dataset,

then you can more accurately

assess the risk. Typically, it's it's worst case.

And so larger populations are, you know, harder to

attack.

And so

we're kind of taking the opposite approach and then

loosening the screws slowly in the other other direction.

And so now digging into the specifics of what you're building at privacy dynamics, I'm wondering if you can talk through some of the ways that it's designed and implemented and some of the

architectural

and

scalability

considerations

that go into

the way that you approach this problem?

Fundamentally,

architecturally, it's a Kubernetes cluster.

That is because we're providing

a cloud and enterprise tiers. This is something that, you know, people are very concerned about their data.

We knew out of the gate, we wouldn't be able to just offer a cloud only service. So that's kinda how we handle both the cloud and on prem,

which isn't really a prem anymore, but, you know, virtual private cloud, essentially.

Then the actual implementation itself, it's, for the most part, a Python monolith

design. And then we have some of these you could call them microservices,

but they're these smaller services that

serve the monolith.

And

the data science core is all built in, for the most part, all Python data science stack. NumPy,

pandas,

we've used some clustering

algorithms,

some scikit learn,

just like a handful of these, you know, really common

data libraries.

Scalability was a problem from the beginning. Right? Because the stuff that we're doing is very computationally expensive, so we're always looking for ways

to optimize,

and it's a long term

project for us. Like, we have a long road map of scalability improvements for the data science. But 1 of the things is just simply

you have to be able to load stuff in memory. It requires a lot of memory.

So

in addition to

all the heavy lifting you have to do to

do these data transformations,

data science tooling is designed to, like, load everything in memory and then operate on that and then

make copies of it and do that kind of thing, which is just not

scalable.

And so

it's a multi

step process for us. And, you know, we're only part of the way there. So we're gonna be making all these improvements to the data science core. But 1 of the things we're doing is we had to build a job system to schedule jobs because we have limited resources. This isn't this isn't something that you can easily just fit into a Lambda function and just, like, you know, let the cloud

absorb your computation

as it goes out.

We plan to leverage those types of services,

but it's very hard. And so

now we're focused on scalability

with

how can we

limit memory usage,

have knowable,

you know, computational

requirements,

and

build our job system, and then have that thing scale

using Kubernetes

to address the

whatever load requirements we have at at a given time. To speak to 1 of the challenges,

1 of the biggest challenges that

we found was

making it possible, even using Kubernetes,

to be able to deploy on prem on prem customers.

Like, even once you have a Kubernetes system, it is nontrivial

to get that system onto a customer system in a way that is manageable.

Use it also to serve a cloud

service that has different scaling requirements,

and then

all the kind of satellite microservices we have around it

be able to operate with both. It was a much more complicated design and an implementation problem than we

we ever imagined from the from the start. Going back to your point about

starting from a very

pessimistic

position

of the

kind of level of risk associated with the dataset even after obfuscating it.

I'm wondering

how you think about

being able

to kind of expose that dial of how extreme you want to go of

modifying the data and

sort of how you expose

the

trade offs to your end users of

you can, you know, dial it up to 11, and we will replace all the values with something that we fabricate

versus

we're actually going to maintain the original information, but we're just gonna shuffle it around a bit and how that impacts the security of the dataset after the fact and just

being able to

validate

your assertions

about sort of what level of impact it's going to have.

This really gets to the core of of how our system works, which is our treatment algorithm,

It is scalable in privacy terms

in the sense that you give it essentially a group size requirement, a k value,

and it will hit that target.

And it will do everything it can to minimize distortion in the data

based on the target you give it. That's really the core of the treatment system, and then that's

buttressed by our risk analysis

and then our distortion analysis. And so this is really kind of the the complete

picture, which is privacy

utility

trade off. And so we do this risk analysis and that

you know, there's a top line score and then other metrics. Right? Kind of dig into it to understand it. And that tells you the privacy that you have. And then we have

data distortion analysis,

and that shows

you what the utility is of your data. And we give a breakdown of that. So whoever has to use the data can go look and they could say,

you know, how has the distribution

changed on the age column in this dataset? Or how many cells

have actually

changed?

Is it less than this many? Okay. Then we don't care. Okay. It's more than this threshold.

Now we need to drill down. How meaningful

has it changed? And so what we try to do, this is something that

the the treatment system is really more of a platform

that we we're building on. And so we try to do things

where

we try to group similar

data together. This is part of the clustering aspect of it. And so

the idea is to minimize the distance between the values

that we have to swap. So we wanna change someone's age from 31

to 35,

not

31 to 61.

And so we're always improving the system

so that we can make better decisions and reduce distortion for any given

privacy

target.

But also to

scale it in other ways that aren't necessarily increasing utility,

like, some customers may want a traditional

generalized

record. Right? I can imagine in some case where you're like, well, the swaps

you could think of it as like a synthetic data. It's not quite synthetic, but maybe that's a good way to think about it. But

maybe you say, okay. Well, that's not right. That's not true. We this data needs to actually reflect

exactly 1 to 1 what was here before. And so then you you could use a range. But in another case, you could potentially reduce distortion depending on the kind of analysis you intend to do. If rather

than swap a value with the, say, median

to get the lowest

distance of change, You swap it with the average, and then maybe all the values change,

but you get a more accurate summary statistic for the type of analysis you're doing. And so it's these types of things we wanna build all this in to,

you know, allow the user to experiment with their data,

figure out what risks they're comfortable with, understand it,

and then also see, you know, what is a utility good enough for what I need. And this is probably a whole another

episode's worth of conversation,

but I don't know if it's worth touching on

the approach of data obfuscation versus differential privacy or maybe how the 2 relate to each other. I don't wanna dig in too much because I'm probably not qualified to do a deep dive on it. But, you know, data obfuscation,

it kinda depends on what you mean by data obfuscation. Right? But

our opinion on this is that

differential privacy is best in a system where

the person doing the

analysis,

writing the queries,

is operating on a differentially private system. From our perspective, that's fundamentally different from the type of problem we're trying to solve,

which is people need to share the data because people wanna use their tools. They wanna use their

analysis tool and, you know, their BI system.

And, you know, it doesn't work if you are taking that control away. Now that doesn't mean that it's not very valuable. And in fact, something we might

offer

Out of the gate, we wanted to do this.

How can we get the most value to the most customers? And also data obfuscation,

I would consider that more of an umbrella term, but there are other people who do perturbation

to data where they'll say, you're not doing k targeting

like we are, but you'll say, okay. I'm gonna take these dates, and I'm gonna wiggle them with some

random values.

Now

I'm not quite sure how you would assess risk in that case because you could still end up with a lot of unique data. So we would probably need some other kind of risk analysis, but that's how I would see those kind of fitting together, just based on use case.

In terms of the actual workflow of using privacy dynamics

and integrating it with an existing dataset to be able

to de identify the information and

generate a shareable

data asset,

I'm wondering what that looks like for being able to handle the kind of integration path. And then particularly once you're connected up to the

source system,

how you approach

the

heuristics of understanding

which columns or which fields or which tables might have personally identifiable information so that you can then

understand what types of mutations you need to implement on top of the source set? On a high level, we connect to a data warehouse.

We provide these connectors. We're trying to build as many

useful connectors as we can. And so

we integrate into kind of a ELT pipeline, typically, you usually at the beginning or at the end.

And then we'll look at a source table or a query,

and then run it through our system and then pipe it out to a target table or the same table in a different schema or something like that.

The

PII detection problem

is a hard problem

because it's another 1 of these messy problems

where

you have to use heuristics. Right?

So

this is another 1 of those things that's I don't think it's ever going to get perfect. And so it's just

constant

incremental improvement.

But the problem that we have to solve is essentially,

we wanna know what are direct identifiers

and what are quasi identifiers.

And so

the first thing you do is do pattern matching.

And

I've seen people, you know, and having to solve certain types of problems like this online kind of scoffing and pattern matching.

It really is,

you know, in a first pass,

absolute best way

to find data or find the categorization of data. Most people have names on columns that are that are meaningful. You know, we do a lot of that. But then our system,

it tries to be a lot more clever.

Essentially,

we will

we use a lot of patterns,

and

patterns have weights assigned to them. We also look at the data. You know, we look at the column and we look at the data. And those, you know, do the combination

of

patterns matching the column values and the data values.

We combine those to give us a confidence score and decide whether or not

we think that that's, you know, someone's name or, you know, Social Security number. But, like, normal values are the easier ones to do, like a credit card. Right? You know, there's certainly a lot of 16 digit numbers that would be false positives on credit cards, but credit cards have these CRC checks that you can do. And so every value passes CRC check. Even with no column name, you can be very confident it's a credit And so we do that,

and those are heuristics. And, you know, I fully expect that over time as we encounter more and more real real world data, we will have to be making constant tweaks to make that more accurate. But 1 of the things that we do is we use another heuristic. It's a categorical heuristic. And this is, again, this is part of our pessimistic view. And that is we train a model. It's not real complicated. It's a decision tree model.

And we train a model on some data

to try to guess whether a column is categorical. So if we don't know what the type is, we say, is this a category?

And then categories,

we consider

quasi identifiers.

Now there's some cases where you wanna turn that off. Not that automatic, but you wanna

say, no. Something's just an attribute.

Right? Something's actually

a value you don't wanna protect. This is the value you're studying, not the attribute of the person. But generally, that's a little bit more permissive, and it tends to be pretty good. And it's absolutely necessary because there's a lot of cases where people have categorical data encoded as integers.

And so we have to have a good way of determining,

are these actually categories?

And, you know, if so,

we need to protect it. And I imagine that beyond just these heuristic and

self discovery processes, you have an avenue for people to be able to go in and label specific fields as this is something that needs to be processed and these are the types of transformations that need to be executed on them or, you know, this is the underlying data type.

Yeah. We're trying to make it as automatic as possible. And, you know, like, the long term goal is to kinda have, like, layers of configuration

where, you know, you've got basic settings, and then you wanna tweak it. You have more specific tweaks you wanna do. You do that. And so we're really starting from the top.

But, yes, absolutely, that's our intent.

You know, we wanna get away from having people specify a specific

transformation

in certain ways because

we can do probably better if we just know if this is a date, you know, for some reason, we couldn't identify this as a date. But you sell us as a date. Now we can do

more customized

treatment on it. And that's the important thing, is you get the best utility and protection.

And so in terms of the

applications of privacy dynamics

and the overall space of data de identification

for the purposes of sharing,

what are some of the most interesting or innovative or unexpected ways that you've seen either the privacy dynamics

product or the principles that you're applying

used in your experience?

I'll go with the unexpected 1 just because we've just launched our product. And so

what we didn't expect was that customers might use

our anonymization system to not anonymize data.

And so this is something that I just kind of built into the core of the system. It's like, if there's data that can't be anonymized or shouldn't be anonymized, it it blows up, and it has an error, and it doesn't work. But

it turns out we have customers who they wanna copy everything from, you know, 1 location to another. And then some of the data

either shouldn't be treated or can't be treated, and so they locked it all up. Or, you know, tables that didn't have any rows in them. Essentially, they wanted to use us as a data replication service

that also anonymizes where appropriate.

It makes

total sense that someone would wanna do that. Right? But, you know, we were so focused on the anonymization

use case. We didn't think that, you know, obviously, someone would wanna do this.

And in your experience

of working on privacy dynamics and exploring this overall space of data privacy, What are some of the most interesting or unexpected or challenging lessons that you've learned?

For me,

there's this kinda

so the startup element of

engineering for product market fit.

Because

it's very unintuitive

to a software engineer,

simply because you have to go against your instincts a lot of the time on how to build things, because you're trying to prove a use case. You're not trying to prove your,

you know,

high quality system is going to be scalable

because you don't have time. The whole point is

try to discover what someone wants. Build a a

stable demo that, you know,

does most of the things that you need, but

fully expect to have to throw this thing away in a few months if you need to pivot in some other direction. That was something that was really challenging for me. And if I had to do it again, I don't know that it would be any more comfortable. You know? I was just gonna have

to remind myself that,

you know,

you have to shoot for the goal we're aiming for at the moment.

For people who are interested in being able to

apply some of these anonymization

techniques to their data and be able to

satisfy some of these risk or regulatory

requirements

in order to be able to share these datasets? What are the cases where privacy dynamics is the wrong choice either because it doesn't have

the proper set of integrations available or

it doesn't meet a particular

use case or handle particular data types?

Right. So

we're not a compliance tool.

So we're

not set up to specifically

help you

meet certain regulations.

We're a privacy tool.

We don't deal with data access control and

reporting

on provenance and usage.

And especially, you know, if you're an ad tech company who needs to identify individuals

for marketing purposes,

we're not a good choice for that because we kinda do the opposite.

Right? So there are a lot of companies who wanna satisfy GDPR or CCPA,

but they also wanna still be able to uniquely identify

individuals for marketing.

That's not us.

And so as you continue to build out the privacy dynamics platform and

now that you are in the early stages of launching the product, what are some of the things you have planned for the near to medium term or any particular

projects that you're excited to dig into?

I'm really excited about

all the stuff we get to work on next. This is some of the most fun, which is the first is digging into the data science core

and, you know, really getting to do a lot of

analysis of our own to figure out, you know, ways to improve privacy and utility, kinda push the edge of that

curve

just to generate the most high quality data we can, and also improve automation.

You mentioned you were talking about these heuristics,

you know, for PII detection.

That's something that is going to need a lot of data and a lot of analysis.

We're all in on, you know, hitting

all the major data types that we can.

And some of the most fun stuff, for me anyway, is system performance.

Figuring out how to make our algorithms faster,

how to handle

more complicated data, streaming data,

bigger datasets.

You know, we'll have to probably do a lot of retooling on the algorithm to handle

datasets that are just

so much bigger than memory and figuring out how to, you know, scale more

granularly

with user demand. You know? That's 1 of those

more classical data engineering

infrastructure problems.

And then kinda a little bit further down the road, like, we already have some dbt integration

with privacy dynamics. You can plug into,

dbt repo

and pull down

models

and use them directly in privacy dynamics.

We wanna go all in on DBT integration. And so that's kind of 1 of the next big projects is

really integrating into the modern data stack, handling a lot of these more advanced use cases where people want to

include

privacy

and anonymization

as in part of their

pipelines, you know, the graph for their data processing.

And so we see huge opportunities for people to get a lot more value if we can leverage that. That's something we're, we're really excited about.

Are there any other

aspects of the work that you're doing at privacy dynamics or the overall space of data privacy management that we didn't yet that you'd like to cover before he blows out the show? The only thing I can think of is, you know, the way we had to address this. This might be too nuanced to dig in on, but,

like, the way we approached

the system was

we have essentially assessment

and we have treatment.

And so

for

the assessment, that's where we're we were very conservative both in

how we approached implementing it to begin with, which was we tried to pull straight from the literature and not take many liberties other than to tailor it to fit

data. That was important because we didn't want to blaze any new trail for

the risk assessment.

But then on the

treatment,

we wanted to

go beyond what was available from the literature

and do be more experimental

to try to get better utility.

And

that was an interesting balance, and I thought it really paid off for us because,

you know, it allowed us to kinda branch out and try new ideas. And then we had this stable kind of established

technology

or analysis

practice that would keep us honest.

And so, you know, if we ever did anything that would increase risk, we would know. That was really interesting to me

about kind of 2 important pieces of the system

that we approach in different ways with respect to prior art and research.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Probably the thing that I find missing the most,

this is a very self serving

thing to pick on. I know there's companies who are definitely trying to address this, but it's moving data around. And this is kind of an old problem. Right? Like, there are more and more databases,

and every database has its own little design that's

specific. You know, we do these types of queries the best. We store data this way. We handle these use cases. But you always have to move data around. So for us, what that has meant is we need to build these connectors.

In order to make this stuff easy for our customers, we have to provide a out of the box

connector that they can just say, oh, you use Snowflake. Okay. Enter your credentials. Boom. You're connected to Snowflake.

These connectors aren't

enormous lifts, but they're not trivial, especially when you need to essentially download

the entire table or an entire database.

So I think data movement,

it's always been a problem, I guess. So I don't know if that speaks to how there will be a great solution to it, but I certainly think that

more convenient tools,

different layers of the stack too. Right? So not just kinda high level connectors where it's just kinda moving something from syncing databases,

but then also down low for people like us where, you know, we just wanna hook in and get things into a data frame. That's

a lot harder than I expected it to be.

Alright. Well, thank you very much for taking the time today to join me and explore this broad space of data privacy

and some of the risks associated

with leaving data as is as well as mutating it. So it's definitely a very

interesting

problem space and interesting approach that you're taking at privacy dynamics, and it's great to see more people

contributing

to

solutions in this space. So appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. I really enjoyed speaking with you.

For listening. Don't forget to check out our other show, podcast.init@pyth

on podcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at data engineering podcast dot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links