Low Friction Data Governance With Immuta

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Feature flagging is a simple concept that enables you to ship faster, test in production, and to do easy rollbacks without redeploying code.

Teams using feature flags release new software with less risk and release more often.

Config Cat is a feature flag service that lets you easily add flags to your Python code and to 9 other platforms.

By adopting Config Cat, you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration,

including granular targeting rules.

You can roll out new features to a subset of your users for beta testing or canary deployments.

With their simple API, clear documentation, and pricing that is independent of your team size, you you can get your first feature flags added in minutes without breaking the bank. Go to data engineering podcast.com/configcat

today to get 35% off any paid plan with code data engineering or try out their free forever plan. Your host is Tobias Macy. And today, I'm interviewing Steve Thao and Stephen Bailey about Immuta and how they work to automate data governance. So Steve, can you start by introducing yourself?

Yeah. Hi. I'm Steve Tao. I'm 1 of the co founders at Immuta and also the CTO.

And Steven, how about you? Hi, Tobias. Thanks for having us. I'm Steven Bailey. I'm the director of internal analytics at at Immuta.

And going back to you, Steve, do you remember how you first got involved in data management?

This story is actually somewhat similar to why we started Immuta too, but it dates back to originally, I was an analyst

with the some of the US intelligence community and the military.

And we

obviously did a lot of analytical work on very sensitive data,

and we found ourselves

kinda struggling with the same problem over and over again. And how do we

enable the best analytics that we could possibly do with the data that that we're collecting, but at the same time,

enforce the appropriate controls

that are necessary

to protect how that data is being collected, but also, you know, follow US guidelines on how to handle data.

So it's been a long road of dealing with that that has eventually led me to being at Immuta and starting the company.

And, Steven, how about you?

Thanks, Tobias. I've been in a number of data science positions over the past 10 years or so. And often, I was operating in a solo capacity. So I got really used to the process process of starting a new data project, finding the data, ingesting it, storing it, and then realizing value from it. And it was really in my PhD where I was looking at biomedical image analysis

that I really became enamored with the, I think, complexities of data engineering and data management.

We were analyzing

very complex

medical imaging datasets, so 3 or 4 dimensional images across multiple subjects. And it really became

clear to me during that time that if you didn't get the data management side of things right, then

you weren't able to answer the really critical, important questions that you set out to answer in the first place.

And then, you know, the other nuance to that is that if you aren't able to

manage the data efficiently

and in accordance with the ethical and legal considerations that you're sort of bound to, you know, those were especially

strict in a biomedical

research facility,

then you also couldn't get to the insights. And so when we transitioned to Immuta, I really saw

an opportunity

and a gap here in the tooling

for marrying

the

data management side of things with effective governance and then also with the human components and the decisions that ultimately are driving all of those decisions.

And so digging a bit more into Immuta, can you give a bit of an overview about what it is that you've built there and the motivation for starting the company?

I touched a little bit on the motivation. We just thought to ourselves, hey. There's gotta be a better way

to

be able to understand

where your data lives, how to essentially release that data appropriately to analysts

in a way that doesn't require,

you know, humans having to spend all their time doing that work? How do we add automation around that? How do we make that more scalable?

And that's really what the product is about. I like to think

of Immuta as a way to take your ideas on not necessarily ideas, but concrete goals on how you want to

protect your data, but also where you wanna potentially take some risks and and what your risk thresholds are and apply that to how you release data to your analysts.

And to me, it's really about

if you do this right, and Steven alluded to this, it's not about

restricting access to data. It's about getting more access

And

And before we get too much further into Immuta itself, I'd like to get your perspective on how you see the definition of data governance because it's a very broad canopy of things that all need to work together.

And I'm wondering from your perspective, particularly as a business that's focusing on

particularly

the enforcement

and security aspects of it, what do you see as being data governance? And how much of the solution is technological,

and how much of that still needs to be delegated to the human factors of process and communication?

Data governance. Like, what is data governance part? I actually really hate that term, to be honest, even if it is in our slogan.

Because as you mentioned, it is a very loaded term, and I think that's a problem

in this space, frankly, because when people say say data governance, if you ask 10 different people that question, you get 10 10 different answers.

My answer to you is

what I think it is is is how do you

understand

the data that you have, how it's being used, how you protect it, and how you enable analytics as efficiently as possible.

And, you know, again, that's what our our product strives to achieve.

But, you know, I'll defer to Steven because he's, you know, been using

our product to manage our own internal data to to talk through some of the

process first technology question.

Yeah. Totally agree with Steve that the data governance word is so overloaded. It has a lot of baggage associated with it. We've tried internally to come up with a better word for sort of the set of practices

that you need to have in place to manage your data effectively. And data governance does seem like the best word, but there's just so much baggage with it. I think what's exciting about right now is that there does seem to be

a set of tasks or capacities that people are starting to agree on that need to be in place in the data pipeline. So things like data quality control,

metadata management and capture, some data observability, and then, of course, our favorite, access control and and security and entitlements.

I think what I'm looking for in a tool, like, for our internal organization, what I'm always looking for is something that's gonna let me, as the analyst or the as the human,

do human things better and

let

our pipelines and our tooling do the automated things better. And 1 of the things I enjoy least about managing a data pipeline is setting up, for example, Snowflake roles and schema management

and

integrating my identity management system with the tool

and and thinking through permissions like that type of stuff. That's not what I'm good at, and I don't think that's what humans are good at, And it's all very automatable

if you have

a blueprint in mind, like a main policy in mind.

Yeah. And that's evidenced by the number of different breaches that have happened because of people thinking that they had the right policies in place and then realizing that it was actually ineffective for a particular edge case or access pattern that they didn't even consider.

Yeah. Exactly. No one's policy is to leave customer data publicly accessible. But in effect, a lot of people

have that policy in place on some important data.

Digging a bit more into the topic of data governance,

what do you see as being the current landscape of solutions for addressing some of these different governance problems, particularly

in terms of the access control and entitlements that you're focusing on? And what are some of the motivating factors that you see as being the reason that somebody would choose to use Immuta over any of the other either open source or proprietary tools that exist?

So in terms of the other tools in the space, you know, I think a lot of teams are looking for a data catalog solution. So there's a lot of products going on around catalog data discoverability and cataloging.

I think in a lot of cases, people are relying on the built in access controls

and systems to the applications that they're working with for enforcing entitlements and security

and policy enforcement.

But I think the challenge is that

data teams,

generally, I think, are going through a process of decentralization. So whereas

you could really just leverage your data warehouses or your data lakes access access control systems

and manage that 1 system pretty effectively.

Now it's very easy for a line of business to to spin up their own Snowflake accounts,

their own Looker or Mode application

and or have a couple of databases that they're pumping live production data into. We're seeing this proliferation of of production data assets across the organization, and that's really presenting a challenge

with the traditional approach of just using the built in access control layer in your application.

So what we're seeing across a lot of our customers is that 1 thing that is very appealing to them is having a centralized

enforcement

layer that they can define policies that then

propagate out to their database systems. And so you can define your policy once, you define it in a very expressive language, and then that policy gets enforced

down on their different databases.

Yeah. And the access control question is definitely 1 I wanna dig into in a little bit. But in terms of the overall

ecosystem of data tools and where Immuta sits within that space, I'm curious if you can dig a bit more into

how Immuta integrates with the tooling and the storage layers and the compute that people are using for trying to do analysis and just what the surface area looks like in terms of what you are trying to build to be able to fit well with people's data platforms and their, you know, the flexibility of compute that they might be trying to bring to the problem?

A good way to think about Immuta is think about it as this metadata layer that kinda sits outside from your data,

almost like a metadata aggregator. And that metadata information is not only useful for understanding,

where your data exists, what data is sensitive,

but also you can leverage that metadata to build policy.

And when I say metadata to build policy, I'm not just talking about

data metadata, but also

information about your users.

So typically, we wanna see the separation of user definition from policy definition,

which really gives us a lot of our scalability.

I think we'll talk a little bit later about ABAC and how Immuta leverages that. But without getting into all those data that detail here,

essentially,

you can pull in all that information about your users and your data assets

and build policies separately in a scalable way. And then since that all is kind of abstracted, if you will, from where your actual compute and data lives,

those policies can get pushed down into that compute layer at runtime so that the user

interacting with the data can interact it with it like they always have. Immuta is just invisibly enforcing that policy

at runtime uniquely for that user that's interacting with the data. So our goal as a company is

to allow

a customer to pick whatever compute they want, you know, BYOC, bring your own compute, and Immuta will be able to enforce your policies consistently

across any of those and multiple of those. I mean, a lot of our customers,

for example, will be using Databricks and Snowflake at the same time. They can build a single set of policies in Immuta and have them enforced

consistently in both places because we are that metadata abstraction where you build the policy.

And then for the broader set of responsibilities for data governance, where it also comes into the policy definitions and understanding

what data you have that is sensitive and understanding the lineage aspects and

for somebody

for somebody who's using Immuta and who then needs to also

cover all of the other aspects of data governance that aren't necessarily

built into the Immuta platform itself. Briefly alluded to this. So we have platform itself?

Briefly alluded to this. So we have interfaces

that organizations can implement

to suck in

things like, you know, if they've already created a business glossary in Calibra,

you know, we could pull that in, and you could continue to treat Calibra as your source of truth,

for example, but then use Immuta to build policy against that business glossary that you've established.

So we don't want to kind of interfere with the existing workflows that might exist in an organization. It's more about operationalizing

those existing workflows that you might have. I briefly mentioned ABAC or attribute based access control for those that haven't heard that acronym.

Most

data enforcement today happens through role based access control,

which

really means that you conflate who your user is with what they have access to. And when you do that conflating,

you essentially

create role explosion in your organization. So 1 of the workflows that we try to break is this role explosion problem,

where

if you're able to separate who the user is from and define who your users are based on who they actually are, not what they're supposed to have access to, then define those policies separately,

you get a lot more flexibility and scalability,

which is

something that our customers find very, very valuable.

And, you know, it's not only about policy,

but also we have a concept of purposes

where we can define those separately too. So it's not just about who the user is, but what they are doing, which is really relevant

to the existing

regulatory

controls that are out there today, such as GDPR and CZPA.

And, Steven, do you have do you have some more you wanted to add to that?

Yeah. Definitely. I think 1 of the workflows that really is a prerequisite for

getting the benefits of Immuta is the exercise of an organization going through and really defining what their policies are. We are

trying to get to a place where we have more standardized policies.

A lot of companies treat personally identifying information pretty similarly. Right? They wanna mask it or redact it. And the policy, therefore, can be written in a way that's very generic. Right? It doesn't really matter what schema or what source the data comes from.

What matters is whether an attribute is personally identifying or an indirect identifier. But an organization has to find those things upfront.

We do find that

it's a challenge to think through

a generic approach to

protecting your data in an organization.

It's much simpler to simply say, you know, this group has access to this data and this group has access to this data. And so in the moment, those decisions are very easy.

But if you don't put that upfront cost in defining your policies in a scalable way, then you have to make those ad hoc decisions every time you onboard a new data source from a new system or you add a new group in your organization or you restructure your organization.

So having that discussion upfront and coming up with a good enough

metadata vocabulary

is definitely a workflow that we see people having to adopt.

And 1 of the other interesting aspects of this problem is

the

definition of what constitutes sensitive data because that can be very different based on the industry that you're in or the regulatory regimes that you might fall under or the

ways that that data is being used or aggregated where if you're

using the information, but it feeds into a machine learning model where that information is ever actually going to be exposed, it's very different than if somebody's building a business intelligence report that might then be published as part of their quarterly earnings or something like that. And just the the responsibilities

of how that data needs to be protected and controlled and also the fact that when you collect certain set of information

at 1 point in time,

it's not necessarily classified as sensitive, but then because of either changing

environments as far as

the laws and regulations that you're subject to or

changes in terms of the nature of your industry or the business that you're in or maybe acquisitions,

what was not sensitive at 1 point may become sensitive down the road. And so I'm curious how you

try to

approach that kind of evolution of what is sensitive and how people are accessing it and using it and understanding the intent of the data and the way that it's being used beyond just the static aspect of what is contained in, you know, column a and row b.

This touches a little bit on on what Steven was just mentioning, which

you really need to pull apart the policy from the data at the physical layer and really focus on how do you build policy at the logical layer.

And this allows you to manage,

you know, just using a silly example, but relevant is instead of saying, I want to mask the address and last name column, you could say something like, I wanna mask anything that's PII.

And then you could extend that policy later

and say something like, I'm going to mask PII except when the user is acting under purpose, you know, HR or whatever. And you could define those purposes separately,

and you could add other exceptions based on, you know, potentially some training courses that the user has taken. Or if a new column is now deemed PII for some reason

where it wasn't before, you simply add that tag and the policy will automatically propagate itself

to that tag. And as I mentioned earlier,

those tags on data could come from several different places

or multiple different source systems

where Immuta is simply acting as that aggregator

to be able to take all that information you've gathered across your organization or potentially that Immuta has self discovered for you, because we have capabilities to discover sensitive data as well, and use that to drive where your policies get enforced. So it's really this idea of separating

the policy from the physical data, which is key to being able to handle

these changes in both rules and how you think of data and the different purposes that you've been processing it. The other paradigm shift I think we're going through is

from a model of release and forget

where I can prepare a dataset and then just send it out there for the business, and then I can kind of, like, just let it live out there forever

to

a more man actively managed model where I'm gonna release a set of data. I'm going to, you know, understand

what processes are feeding this particular release. I understand who's accessing this release,

and I can report on whether it's up to standard or not, and then I can terminate access to it if if I need to. You know, having that record of what's been put out there and and keeping

your compliance policies

sort of at the forefront of

how you're

managing your warehouse is a much more proactive approach to

managing risk of, you know, individuals'

data leaking or

individuals being reidentified than than a model where you kind of apply all your policies on the front end and then just, like, let whatever downstream

activities happen.

That's a really good point that I don't think we've really

fired home yet is that

all our policies are completely dynamic. It's not

as if we are

in the transformation

phase creating anonymized tables. It's that we are enforcing policies real time.

So if you change a policy, you do not need to change your data. It will act and be enforced.

And so

the

and to Steven's point, the fire and forget, I mean, that's how a lot of people think about data sharing today where I'm gonna create this anonymized set, and then I'm gonna share that copy

with somebody.

And the dumb analogy I use is that's kind of like the blockbuster

way of sharing data,

where we're more the Netflix way of sharing data. You can, you know, point people at your data and have policy be enforced and completely audited.

And then, you know, you might change a policy

5 seconds ago, and that will immediately be implemented in protecting your data the way you wanted it to for those people that you've shared it with.

Yeah. The question of data proliferation

and data copying is definitely an important 1 in the governance space. Because as you said, if you publish it and then you decide, actually, you need to retract or obfuscate some aspect of it or maybe a new technique has come out for being able to de identify information, or there's a new dataset that's been made public that will allow somebody to

combine that with the information that you've published to re identify people, that can definitely be problematic. And so I'm interested in maybe digging a bit more into some of those

controls for sharing of data and some of

the practices that you've seen be effective in encouraging people not

to make offline copies of that information and things like Excel or CSV files and improve the experience such that they actually enjoy using it in the place where it lives rather than moving it somewhere else to be able to do any further analysis.

I think the key, you said it right there, is is enjoy where it lives. So at the end of the day, if if you enforce your policy dynamically at the data layer, if you will,

then your downstream users that are benefiting from that data are simply connecting to the database like they normally would in executing queries,

which of course even becomes more viable in our world of of SaaS data warehouses,

where they're more scalable and you're not really worried about impacting other

database workloads by analytical workloads.

So if you consider that paradigm that more people can get on the database with fewer restrictions,

then you also want to adopt the paradigm of enforcing policies at the data layer and enabling people to simply connect to data and use, you know, SQL to ask whatever questions they need asked and then take that to any analytical use case that makes sense. You know, you touched a little bit on

the

re identification

piece.

The other part of that

is that,

you know, there are tricks that Amida could do as well where,

you know, you might not be masking a column just because

it's sensitive. You might be masking it because it's a foreign key that could join some other table that would cause a data leak. And we take that into consideration as well, and we can do tricks where, hey, look. We'll mask these 2 columns that are normally joinable,

but they won't be joinable when it's masked. And you can define when they you actually want tables to be joinable. So we essentially can retweak that mask on those columns so that it now becomes joinable, but you still can't see the underlying values, but only under certain circumstances or under certain purposes that a person's acting under. So So, again, that's kind of some benefits you get from live interactive, you know, Netflix mode versus,

you know, package up, ship, blockbuster mode. 1 of the features that is really useful, in my opinion, is our we have this feature called projects, and

the comparison I would draw is to creating a compliant copy in the database. So if you were, you know, managing a data warehouse and an analyst group came up and asked, well, hey, can we get a copy of this g just these 10 tables? But certain policies have to be put in place. 1 thing you might do is you just create maybe a clone of those tables in a separate schema, manage access controls to that schema such that only those people

would have access to it. And that's essentially what a project in Immuta is, is we can you know, you can kinda point and click, select some data sources,

cordon them off in a a managed schema, and then give access to other users

to that data. And then they could create derived tables or, you know, connect their data tools to that. And that project lives

on its own,

you know, kind of apart from the rest of the dataset. So it's a nice way of

creating

either short lived or just sandboxed

sets of data for users and changing policies based on that specific purpose.

Yeah. And to be clear, those are our views and not we never create data copies.

That's 1 of the goals of the of the platform is to have everything be dynamic.

You invest so much in your data infrastructure,

you simply can't afford to settle for unreliable data.

Fortunately, there's hope. In the same way that New Relic, Datadog, and other application performance management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo's end to end data observability platform monitors and alerts for data issues across your data warehouses,

data lakes, ETL, and business intelligence.

The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business.

By empowering data teams with end to end data reliability,

Monte Carlo helps organizations save time, increase revenue, and restore trust in their data.

Visitdataengineeringpodcast.com/montecarlo

today to to request a demo and see how Monte Carlo delivers data observability across your data infrastructure.

The first 25 people will receive a free limited edition Monte Carlo hat.

Can you dig a bit more now into how the Immuta platform itself is actually architected and some of the ways that it has evolved in terms of the design of the system or the particular goals that you had for it?

Originally

so I talked a little bit about how we push policy at into the data layer.

And, originally,

we did that through what we called our query engine, which essentially is a proxy

that sits in front of your database

that you connect to. And essentially, we rewrite queries to push them down into the database to be enforced,

basically through a query rewrite. So as far as the database concerned, it just looks like a client

must run-in queries, except we've rewritten the query to enforce policy.

We've since enhanced that.

I wanna use the word enhance carefully because I think our proxy still works great, and we support that for, you know, I think over 20 databases now. But in these more modern

SaaS data warehouses where it's more than just a database, for example, you actually go into

a GUI, a UI when you're using Snowflake to execute Snowflake queries in some cases. Similarly with Databricks,

you're in in a notebook and you're potentially doing things with Python or Scala.

And so

we've built what we call these native integrations where we actually live

in the database. The way we do this is different depending on the database technology.

But essentially, we're able

to propagate that policy down

into the Compute Engine. So in the case of Databricks, for example, we are rewriting Spark queries

to enforce the policy live. In the case of Snowflake, we're creating dynamic views, which are 1 to 1 mapping to the original tables that are completely dynamic. So

you, Steven, and I all query the same table.

You know, we would see different data because of the way we've constructed

that policy into the view. So got native integrations with

many of the SaaS data warehouses

to enable us to be, you know, completely invisible to that downstream user.

For somebody who is adopting Immuta,

what is the process of actually getting it integrated into their existing data platform and data systems?

And what does the workflow look like once they have adopted Immuta? And, you know, how might that change from where they were before to where they are afterwards?

Again, it's it's a little bit dependent on where we're enforcing the policy, but sticking with the native and and SaaS Warehouse story.

Immuta, again, is kind of the separate metadata layer. So it's very easy

for you to install and play with Immuta without having to, you know, change anything you're currently doing.

So

I'll just run through a real world example with Databricks, for example. So

you could

install Immuta, And Immuta can be deployed inside your VPC if you want. We also have SaaS service that you could spin up. And you deploy our software,

would typically hook us up to your whatever your identity management system is, which would probably be shared with Databricks.

We would pull all that identity management

all that identity information in,

you would point us at your meta store and we would populate Immuta with all your tables and databases,

and then you would start building policies. And then once you've

built your policies, again, leveraging things like tags and metadata and building those at the logical layer. So, you know, we've seen cases where we've been able to boil down, you know, hundreds of policies that might have existed for an organization down to 2 or 3 at Immuta because of all the scalability we provide with ABAC and and logical based policy building.

Then you would simply

configure your Databricks cluster to be Immuta aware, which involves adding some jars to it, and you spin up that cluster and policies are being enforced.

So if you wanted to avoid Immuta, you would simply spin up a cluster that's not Immuta aware. So that's we're very noninvasive

from that perspective. Similarly,

if you're using Snowflake, you would simply, you know, grant people access to

tables that aren't Immuta protected if you wanted to bypass us. So it's a very noninvasive deployment, a very easy deployment. And if a customer knows what policies they want to have enforced, it's very easy to get them up and running. I think where we spend the most time is kind of and this is something I could talk about about where we're headed, but taking the ideas or their written laws and rules for our organization and kind of, like, turning those into real

operational policies in Immuta is probably where our customers spend,

like, the most time on implementing our product.

And in terms of the complexities involved in the overall platform, there are a number of areas that I can see as being particularly challenging with things like

managing a unified access control layer across all of those different data systems that you're working with, like BigQuery,

Snowflake, Databricks, etcetera.

And then also with data masking where it can be challenging to

understand what data is actually sensitive, where I know that you can go in and do some labeling. But some of the automated aspects,

understanding what are some effective heuristics and what are the edge cases that you need to be aware of, particularly as an end user,

and then

some of the challenges of actually

masking that data

dynamically at read time versus actually having the data stored in an obfuscated format. And just curious what you see as being some of the complexities

and trade offs to some of those different challenges and what you're trying to build.

I'll start with what you just ended with on the

you know, do you mask on the way in to the cloud or do you do it dynamically?

And so there are other approaches you could take here where you might have data that's

very sensitive and you never want it to live in the cloud at all, and so you would encrypt it before it leaves the walls of your network on premise. And so there are some use cases, and we typically will talk to customers about, okay, you might wanna do that with your most sensitive data. Because as soon as you do that, you're essentially making it useless because in any way you want to ask a real question of that data, it's going to be encrypted to be able to answer that question. And we have the ability to be able to do that. And in the case

of queries where you're trying to query for a specific value in that column, You know, if you meet the policy, we have tricks we could do where we'll actually encrypt the predicate of your query so that it'll actually match the encrypted value in the database so that you can ask specific questions like that.

But as soon as you start trying to do anything fuzzy against an encrypted column, you're gonna have a hard time, like any kind of math operations against numeric

values. We try to coach our customers

through these steps of, hey. You want to potentially encrypt your most sensitive data, but then your other

indirect identifiers, for example, which are also very critical to things like linkage attacks,

where, you know, first name, last name, credit card number, obviously, direct identifiers. But you need to worry about

things like, hey. If I owned

a very unique vehicle

that existed in this table and someone knew that, like, hey. I owned a 1968

Volkswagen Rabbit.

Someone could go into this table and immediately find my record and maybe my other assets because they knew about that very rare car that I owned. And so while on the surface, you wouldn't think that, you know, the make and model of your vehicle

is sensitive, it can become an indirect identifier. And so this is the really the big challenge I think customers face with things like GDPR and CCPA. Because

at the end of the day, if you want to anonymize your data, you need to be worried about indirect identifiers. But if you're worried about indirect identifiers

and you are naive about how you mask them, you're going to make your

data completely useless. And

there's this paper by Paul Ohm.

If anyone's interested in doing some reading, it's called broken promises of privacy,

responding to the surprising failure of anonymization.

And there's a quote in there that he says, it's something along the lines of data can either be useful or perfectly anonymous, but never both. So you really need to play in this gray area between

completely cut off from the column or have access to the column. And most

built in security tools today on the market, it's basically a binary decision like that. Do you get to see the column or not? And Immuta provides these advanced

privacy and tech enhancing technologies that are called pets. Things like k anonymization

and local differential privacy,

also termed randomized response,

where we essentially give you a level of utility from the column, but also enforce a level of privacy

that allows you to, you know, meet the both the demands of your legal team, but also your analytical teams. And so combining

this concept of encrypting on ingest to the SaaS data warehouses

really, really gives you a powerful,

you know, leverage on both meeting your your legal responsibilities, but also enabling your data analytics on the cloud.

That is challenging. And and part of that is our reliance

on things like a common identity manager.

So, you know, Tobias in Immuta needs to be the same Tobias in Databricks.

And so we'll approach that by saying, hey. Look. We need to have a common identity.

We can either do that mapping in Immuta

or more typically, you know, if the customer is using something like LDAP or Active Directory, we would both

us and Databricks and Snowflake all use the same identity manager.

And similarly,

you know, when you're capturing audit logs, you want the user and those audit logs to be consistent.

And the nice thing about this is your audit logs will then be consistent. So you're not having different audit logs from Databricks as you do from Snowflake, as you do from BigQuery, as you do from Redshift. They'll be consistent now because Immuta is basically

monitoring these queries at the data plane and capturing

not only audit about

who's querying what, which, of course, is important. But, honestly, I think what's equally important is who's building what policies

and why

and when and how are they changing.

That history is also captured in our platform

so that you could understand basically your governance stance over

time. And in working mirrors and trying to help them gain better control and security around the data that they're working with, what are some of the common blind spots or areas that are completely overlooked that you have found?

I think 1 of the areas that we

often don't think about at all is the idea of purposes and, you know, for what purpose is this data being

released and how do we tie that to the individual consumers who are actually

accessing that data? If you think about

sort of historical

dimensional modeling, the data platform owners is

optimizing for the most

generally

useful model that's also performant in the database.

But that comes at a cost of losing

track of exactly why data is being accessed

for a particular reason. And

starting to get customers to think in that sort of context specific

approach

is really useful because it simplifies some of the

access control decisions.

You can exempt yourself

or a certain set of users from a certain set of policies for a certain project,

but not change the core

model or the core policies of your database. So we find that that's a really useful

abstraction to introduce.

It's sort of the way we think about data projects

already

in sort of a modular way

that have a start, a middle, and end.

But at the database level, that doesn't really exist. So I would say that's 1 area that is becoming increasingly important with new privacy laws and that we see as a new concept for a lot of customers.

1 other thing I'll add just real quick is that

the other thing we see a lot, and taking a step back, is

we call it the 3 phases of

data security and privacy.

So phase 0, I call it 0 because everyone should be doing this. This is you just don't let the bad guys in. Right? You have security

and logins, and only your company gets to your data. And then the next phase of that is privacy, where you're enforcing fine grain controls on your data to your internal employees.

And a lot of people say, okay, we're done. That's great. We've enforced our controls.

They forget about phase 2,

which is data collaboration. When your employees are creating new derived tables, how do you ensure that your policies get inherited and passed down to those derivative data products?

And that's a really, really hard problem, which, again, our projects concepts, without getting into too much detail, aims to solve.

And in terms of people who are using Immuta, what are some of the most interesting or unexpected or innovative ways that you've seen it employed?

Great question. I mean, my favorite use case, which I'm allowed to talk about,

is

is probably 1 of our smallest customers, but it's the, COVID Alliance.

So it's this group that has built a data platform. They built it on top of Snowflake,

and they are collecting COVID data

and sharing that with researchers, and they wanna share it in a way where the researchers

can do their research on this on data in a way that builds on each other because they're able to share

their models and and the data across,

efforts

through this platform.

And we're the foundational piece of all of this where

they're essentially using us to not only anonymize the data, but manage the sharing across these research teams

on top of Snowflake. And some of the

advanced anonymization techniques that we have that I mentioned earlier where we're playing in that gray area between utility and privacy

is really the power that's enabling them to do this because they obviously, this is highly sensitive information.

And, you know, they've given me quotes from researchers

along the lines of, hey. I never thought

collaboration like this on data would ever be possible, not only from a data anonymization

perspective, but from the ease of collaboration with other researchers that we've never even met before at our university, for example.

I think COVID Alliance is a great example because

they also

have a very

strong

human level of governance

on top of the the data infrastructure. So for each project or use case that they're presenting to their end users, there's a data privacy impact assessment that goes into very

rigorous detail around

what are the potential privacy impacts, you know, what's the purpose, you know, what policies are in place to

mitigate the risk of harm to the individuals who contribute to the study. And

that sort of mindset is very familiar to me from the academic world where every time you started an experiment, you would go through an institutional review board process where

you would get authorized to do certain things with a certain set of data for a certain limited purpose, and you'd have to keep track of that from start to finish. And

that's really missing from a lot of our data workflows

in industry, in my opinion. Having that sort of

check from

the time data was acquired to the time it was used and making sure that there's alignment between those 2 things is a really challenging

governance problem,

technically and

sort of organizationally.

And in terms of your own experience of building the Immuta product and working with your customers and trying to advance the case for data governance and security and access control, what have you found to be some of the most interesting or unexpected or challenging lessons in that process?

I think it's really hard as a data scientist

or data platform owner to get in the mindset

of an attacker. A lot of the folks in the security

world,

you know, might come from a background of thinking in terms of vulnerabilities

and, you know, worst case scenarios

if an attacker got a hold of the database. But, you know, I guess just speaking personally, it doesn't come natural to me as a data scientist. I'm usually

more attuned to think of the potential value

in a very noisy dataset rather than the potential risks. And so

it's been

challenging to put myself in the shoes of thinking how can something go wrong with this dataset? Why can't I release this dataset as it currently sets?

What potentially nefarious things could people do

against the data subjects in my dataset? But I think that's really a skill that we're gonna

have to become better at if we're gonna be better as a community at protecting data.

The the concept of privacy guarantees,

and we get into situations where

I think the data engineering teams

understand the basics of what they wanna do, and they've kind of kicked the can down the road on the complicated scenarios because they didn't know that there was an automated solution that could solve this. And, again, I'm referring to these privacy enhancing technologies.

For example, I can't name the customer name, but, you know, we got in there with them. They started enforcing, you know, table level and column level basic column level masking, and they discovered

that we had the k anonymization policy.

And they were like, oh, wait. This means I could open up this table

to all my analysts because you could hide,

you know, like, the CEO and and, like, you know, other highly unique

values in certain columns from someone being able to do a linkage attack. So long story short, I think people understand the the bare bone basics

of anonymizing data,

which is basically hide it or not. And once, you know, people understand kinda the art of the possible with some of these advanced privacy enhancing technologies, that just opens a whole new world of use cases that they weren't even considering because they didn't know they could consider it. For somebody who is

looking for a governance solution and considering the use of Immuta, what are the cases where it's the wrong choice?

What I usually say is you want your data to be structured to the level that you need your policies to be enforced. So we are not great for

unstructured use cases. Now that being said, we can do object level controls if you've got, like, images tagged or PDF tagged

in s 3, but you need to give us some kind of structure in order to

enforce policy. So that is a limitation of the platform.

And in terms of your goals for the near to medium term, what do you have planned for the future of both the platform and the business?

I kind of alluded to this earlier. 1 of the things that we find our customers struggle with is this idea of how do I take rules and turn them into policies?

So essentially, you know, we give them all the wood and the hammers and the nails to build whatever house they want, right, based on their policies blueprint.

We wanna actually take a step back a little bit and say, hey. Let's help you build the blueprint. You give us the risk profile that you're willing to take. We'll create the right policies for you. So adding some more automation around

the construction

or the creation of a blueprint for what you wanna construct policy wise.

Kind of related to that is

being more focused on

sharing use cases. And if you're familiar with HIPAA at all,

HIPAA is the US regulation for health care data. And there's essentially 2 ways to enforce HIPAA on your data. 1 of them is what's called HIPAA safe harbor,

where there's essentially

a defined 18 columns that you have to mask in order to have your data be compliant with HIPAA. You know, it's things like names,

addresses, social security numbers, things like that. The harder, more complex way to do it is something called HIPAA expert determination,

where you actually have to hire a statistician

to come in. And, you know, this is a slow kind of arduous process,

and it's not cheap, where they'll basically say, okay. You anonymize this well enough for your use case. And the reason you wanna do that is to set the safe harbors to

restrict it in some cases for use cases

that you want to solve. So

we actually are going to automate

HIPAA expert determination

and make it so that the platform can act as that expert and, you know, enforce the regulation in a way that does not require a human in the loop. So and we think we can extend that process to other regulatory controls like GDPR and CCPA.

And then the last thing I'll bring up is

we really see ourselves

as the last step in ELT.

So I think, you know, everyone understands

this movement of kind of putting the transform after the e and the l, and we believe the g comes after the transform.

So again, this is removing all of that policy logic out of the transform layer and let it happen dynamically

and define that in a declarative way in Immuta

so that we become a natural extension of the ELT

pipeline, becoming ELTg,

g being the governance.

And, you know, how do we better integrate with tools like a DBT.

1 of the thing that's really exciting to me as a data consumer and platform owner is

is the idea of getting to a place

where we are much more

modular and prescriptive with how

privacy is handled and policies are put in place. And I think

attribute based access control is really the only way to get to a place where we can reuse policies across

companies because it's much more expressive, it's much more generic,

it gives data teams a way to communicate with each other to say, like, hey. This is the policy that we have in place. We give this data to these users. We put these transforms in place. And it's really by abstracting that g

from the ELT that we can build a set of best practices that are easily implementable.

Are there any other aspects of the Immuta platform or the challenges of data governance and security and access control that we didn't discuss yet that you'd like to cover before we close out the show?

You know, in terms of data governance,

there's a lot of talk right now about data quality

and integrating

metadata from

jobs and from

tests and from

consumption tools into 1 place. And I think as we look at managing a data platform beyond just the access control, but to, you know, making sure that the right people get the right data at the right time and it it looks the right way.

You know, I think that

it's really an exciting time to think about merging these things into a common place. Because at the end of the day, it's it's all about

reliability of the data and reducing risk and maximizing value. And I think,

the more we can

become very concrete about the actual tasks that have to happen in the data platform,

the better off we're all gonna be long term.

Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so for the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And I'll start with you, Steve.

Yeah. I think that 1 of the biggest gaps right now

is having

1 place

for

understanding

where the data lives, what's happening to it right now, and, you know, and what I need to do as a data owner. You know, when we think about governance,

entitlement security,

specifically, 1 of the big problems

existentially

is that, you know, a lot of times, we don't have a really good sense of what's happening with my data at any given moment. There's a very real

spread problem where we've got tons of data assets. There's lots of important decisions that have to happen,

and there's lots of monitoring that has to go on as well. And I think we're I'm hoping that we're kinda moving as a community to to having some standards on

how to

manage and monitor and understand the health of the data pipeline and then being able to communicate that to end users. So it's it's not just about me as the data platform owner knowing what's happening, but also about end users who are using the data

having visibility into

into what's happening and having a central place they can go to. So I think that's a huge gap.

And, Steve.

To me, it's kind of related to what Steven said, but I think more,

tied to,

again, those 3 phases I mentioned earlier where

I don't think people are thinking enough about

how

data analysts

need to do their own transforms and kind of, you know, manage the data how they want to manage it to some degree

for their analytical use cases. And my belief is

compliance or enforcement of policy is the biggest blocker to that. And so this contributes to there only being a small set of data engineers that can really service all these

transformations that are required across the organization because they're that small trusted group that's allowed to see all the data and that you trust with your database.

And I think

to really, again, going back to data governance is really the inverse word. It really should be, you know, data acceleration. How do we make everyone more efficient? And I think if you have the right kind of controls in place

to enable everyone

to manage data while still meeting the demands of compliance,

That's really the end state that everyone wants to get to.

That's really what what we as a product are are trying to achieve. So it's more than just

how do you create policies on existing tables and forget. It's more about how do

you enable everyone to

transform data in a way that makes sense for them,

share those transformations.

And that's, I think, when governance gets really, really hard.

And there's just not much out there to help with that.

Well, thank you both for taking the time today to join me and discuss the work that you've been doing at Immuta. It's definitely a very interesting platform and solving a very real problem that we have in the data industry. So I appreciate all the time and energy that you put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. It was a pleasure being here. Thanks so much, Tobias. It was great.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links