Summary
Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan.
- You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!
- Your host is Tobias Macey and today I’m interviewing Steve Touw and Stephen Bailey about Immuta and how they work to automate data governance
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you have built at Immuta and your motivation for starting the company?
- What is data governance?
- How much of data governance can be solved with technology and how much is a matter of process and communication?
- What does the current landscape of data governance solutions look like?
- What are the motivating factors that would lead someone to choose Immuta as a component of their data governance strategy?
- How does Immuta integrate with the broader ecosystem of data tools and platforms?
- What other workflows or activities are necessary outside of Immuta to ensure a comprehensive governance/compliance strategy?
- What are some of the common blind spots when it comes to data governance?
- How is the Immuta platform architected?
- How have the design and goals of the system evolved since you first started building it?
- What is involved in adopting Immuta for an existing data platform?
- Once an organization has integrated Immuta, what are the workflows for the different stakeholders of the data?
- What are the biggest challenges in automated discovery/identification of sensitive data?
- How does the evolution of what qualifies as sensitive complicate those efforts?
- How do you approach the challenge of providing a unified interface for access control and auditing across different systems (e.g. BigQuery, Snowflake, RedShift, etc.)?
- What are the complexities that creep into data masking?
- What are some alternatives for obfuscating and managing access to sensitive information?
- How do you handle managing access control/masking/tagging for derived data sets?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Immuta?
- When is Immuta the wrong choice?
- What do you have planned for the future of the platform and business?
Contact Info
- Steve
- @steve_touw on Twitter
- Stephen
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Immuta
- Data Governance
- Data Catalog
- Snowflake DB
- Looker
- Collibra
- ABAC == Attribute Based Access Control
- RBAC == Role Based Access Control
- Paul Ohm: Broken Promises of Privacy
- PET == Privacy Enhancing Technologies
- K Anonymization
- Differential Privacy
- LDAP == Lightweight Directory Access Protocol
- Active Directory
- COVID Alliance
- HIPAA
- GDPR
- CCPA
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Feature flagging is a simple concept that enables you to ship faster, test in production, and to do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk and release more often. Config Cat is a feature flag service that lets you easily add flags to your Python code and to 9 other platforms. By adopting Config Cat, you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset of your users for beta testing or canary deployments.
With their simple API, clear documentation, and pricing that is independent of your team size, you you can get your first feature flags added in minutes without breaking the bank. Go to data engineering podcast.com/configcat
[00:01:43] Unknown:
today to get 35% off any paid plan with code data engineering or try out their free forever plan. Your host is Tobias Macy. And today, I'm interviewing Steve Thao and Stephen Bailey about Immuta and how they work to automate data governance. So Steve, can you start by introducing yourself?
[00:01:59] Unknown:
Yeah. Hi. I'm Steve Tao. I'm 1 of the co founders at Immuta and also the CTO.
[00:02:04] Unknown:
And Steven, how about you? Hi, Tobias. Thanks for having us. I'm Steven Bailey. I'm the director of internal analytics at at Immuta.
[00:02:12] Unknown:
And going back to you, Steve, do you remember how you first got involved in data management?
[00:02:16] Unknown:
This story is actually somewhat similar to why we started Immuta too, but it dates back to originally, I was an analyst with the some of the US intelligence community and the military. And we obviously did a lot of analytical work on very sensitive data, and we found ourselves kinda struggling with the same problem over and over again. And how do we enable the best analytics that we could possibly do with the data that that we're collecting, but at the same time, enforce the appropriate controls that are necessary to protect how that data is being collected, but also, you know, follow US guidelines on how to handle data.
So it's been a long road of dealing with that that has eventually led me to being at Immuta and starting the company.
[00:03:03] Unknown:
And, Steven, how about you?
[00:03:05] Unknown:
Thanks, Tobias. I've been in a number of data science positions over the past 10 years or so. And often, I was operating in a solo capacity. So I got really used to the process process of starting a new data project, finding the data, ingesting it, storing it, and then realizing value from it. And it was really in my PhD where I was looking at biomedical image analysis that I really became enamored with the, I think, complexities of data engineering and data management. We were analyzing very complex medical imaging datasets, so 3 or 4 dimensional images across multiple subjects. And it really became clear to me during that time that if you didn't get the data management side of things right, then you weren't able to answer the really critical, important questions that you set out to answer in the first place.
And then, you know, the other nuance to that is that if you aren't able to manage the data efficiently and in accordance with the ethical and legal considerations that you're sort of bound to, you know, those were especially strict in a biomedical research facility, then you also couldn't get to the insights. And so when we transitioned to Immuta, I really saw an opportunity and a gap here in the tooling for marrying the data management side of things with effective governance and then also with the human components and the decisions that ultimately are driving all of those decisions.
[00:04:28] Unknown:
And so digging a bit more into Immuta, can you give a bit of an overview about what it is that you've built there and the motivation for starting the company?
[00:04:37] Unknown:
I touched a little bit on the motivation. We just thought to ourselves, hey. There's gotta be a better way to be able to understand where your data lives, how to essentially release that data appropriately to analysts in a way that doesn't require, you know, humans having to spend all their time doing that work? How do we add automation around that? How do we make that more scalable? And that's really what the product is about. I like to think of Immuta as a way to take your ideas on not necessarily ideas, but concrete goals on how you want to protect your data, but also where you wanna potentially take some risks and and what your risk thresholds are and apply that to how you release data to your analysts.
And to me, it's really about if you do this right, and Steven alluded to this, it's not about restricting access to data. It's about getting more access And
[00:05:41] Unknown:
And before we get too much further into Immuta itself, I'd like to get your perspective on how you see the definition of data governance because it's a very broad canopy of things that all need to work together. And I'm wondering from your perspective, particularly as a business that's focusing on particularly the enforcement and security aspects of it, what do you see as being data governance? And how much of the solution is technological, and how much of that still needs to be delegated to the human factors of process and communication?
[00:06:16] Unknown:
Data governance. Like, what is data governance part? I actually really hate that term, to be honest, even if it is in our slogan. Because as you mentioned, it is a very loaded term, and I think that's a problem in this space, frankly, because when people say say data governance, if you ask 10 different people that question, you get 10 10 different answers. My answer to you is what I think it is is is how do you understand the data that you have, how it's being used, how you protect it, and how you enable analytics as efficiently as possible. And, you know, again, that's what our our product strives to achieve.
But, you know, I'll defer to Steven because he's, you know, been using our product to manage our own internal data to to talk through some of the process first technology question.
[00:07:03] Unknown:
Yeah. Totally agree with Steve that the data governance word is so overloaded. It has a lot of baggage associated with it. We've tried internally to come up with a better word for sort of the set of practices that you need to have in place to manage your data effectively. And data governance does seem like the best word, but there's just so much baggage with it. I think what's exciting about right now is that there does seem to be a set of tasks or capacities that people are starting to agree on that need to be in place in the data pipeline. So things like data quality control, metadata management and capture, some data observability, and then, of course, our favorite, access control and and security and entitlements.
I think what I'm looking for in a tool, like, for our internal organization, what I'm always looking for is something that's gonna let me, as the analyst or the as the human, do human things better and let our pipelines and our tooling do the automated things better. And 1 of the things I enjoy least about managing a data pipeline is setting up, for example, Snowflake roles and schema management and integrating my identity management system with the tool and and thinking through permissions like that type of stuff. That's not what I'm good at, and I don't think that's what humans are good at, And it's all very automatable if you have a blueprint in mind, like a main policy in mind.
[00:08:24] Unknown:
Yeah. And that's evidenced by the number of different breaches that have happened because of people thinking that they had the right policies in place and then realizing that it was actually ineffective for a particular edge case or access pattern that they didn't even consider.
[00:08:37] Unknown:
Yeah. Exactly. No one's policy is to leave customer data publicly accessible. But in effect, a lot of people have that policy in place on some important data.
[00:08:47] Unknown:
Digging a bit more into the topic of data governance, what do you see as being the current landscape of solutions for addressing some of these different governance problems, particularly in terms of the access control and entitlements that you're focusing on? And what are some of the motivating factors that you see as being the reason that somebody would choose to use Immuta over any of the other either open source or proprietary tools that exist?
[00:09:12] Unknown:
So in terms of the other tools in the space, you know, I think a lot of teams are looking for a data catalog solution. So there's a lot of products going on around catalog data discoverability and cataloging. I think in a lot of cases, people are relying on the built in access controls and systems to the applications that they're working with for enforcing entitlements and security and policy enforcement. But I think the challenge is that data teams, generally, I think, are going through a process of decentralization. So whereas you could really just leverage your data warehouses or your data lakes access access control systems and manage that 1 system pretty effectively.
Now it's very easy for a line of business to to spin up their own Snowflake accounts, their own Looker or Mode application and or have a couple of databases that they're pumping live production data into. We're seeing this proliferation of of production data assets across the organization, and that's really presenting a challenge with the traditional approach of just using the built in access control layer in your application. So what we're seeing across a lot of our customers is that 1 thing that is very appealing to them is having a centralized enforcement layer that they can define policies that then propagate out to their database systems. And so you can define your policy once, you define it in a very expressive language, and then that policy gets enforced down on their different databases.
[00:10:43] Unknown:
Yeah. And the access control question is definitely 1 I wanna dig into in a little bit. But in terms of the overall ecosystem of data tools and where Immuta sits within that space, I'm curious if you can dig a bit more into how Immuta integrates with the tooling and the storage layers and the compute that people are using for trying to do analysis and just what the surface area looks like in terms of what you are trying to build to be able to fit well with people's data platforms and their, you know, the flexibility of compute that they might be trying to bring to the problem?
[00:11:19] Unknown:
A good way to think about Immuta is think about it as this metadata layer that kinda sits outside from your data, almost like a metadata aggregator. And that metadata information is not only useful for understanding, where your data exists, what data is sensitive, but also you can leverage that metadata to build policy. And when I say metadata to build policy, I'm not just talking about data metadata, but also information about your users. So typically, we wanna see the separation of user definition from policy definition, which really gives us a lot of our scalability.
I think we'll talk a little bit later about ABAC and how Immuta leverages that. But without getting into all those data that detail here, essentially, you can pull in all that information about your users and your data assets and build policies separately in a scalable way. And then since that all is kind of abstracted, if you will, from where your actual compute and data lives, those policies can get pushed down into that compute layer at runtime so that the user interacting with the data can interact it with it like they always have. Immuta is just invisibly enforcing that policy at runtime uniquely for that user that's interacting with the data. So our goal as a company is to allow a customer to pick whatever compute they want, you know, BYOC, bring your own compute, and Immuta will be able to enforce your policies consistently across any of those and multiple of those. I mean, a lot of our customers, for example, will be using Databricks and Snowflake at the same time. They can build a single set of policies in Immuta and have them enforced consistently in both places because we are that metadata abstraction where you build the policy.
[00:13:06] Unknown:
And then for the broader set of responsibilities for data governance, where it also comes into the policy definitions and understanding what data you have that is sensitive and understanding the lineage aspects and for somebody for somebody who's using Immuta and who then needs to also cover all of the other aspects of data governance that aren't necessarily built into the Immuta platform itself. Briefly alluded to this. So we have platform itself?
[00:13:38] Unknown:
Briefly alluded to this. So we have interfaces that organizations can implement to suck in things like, you know, if they've already created a business glossary in Calibra, you know, we could pull that in, and you could continue to treat Calibra as your source of truth, for example, but then use Immuta to build policy against that business glossary that you've established. So we don't want to kind of interfere with the existing workflows that might exist in an organization. It's more about operationalizing those existing workflows that you might have. I briefly mentioned ABAC or attribute based access control for those that haven't heard that acronym.
Most data enforcement today happens through role based access control, which really means that you conflate who your user is with what they have access to. And when you do that conflating, you essentially create role explosion in your organization. So 1 of the workflows that we try to break is this role explosion problem, where if you're able to separate who the user is from and define who your users are based on who they actually are, not what they're supposed to have access to, then define those policies separately, you get a lot more flexibility and scalability, which is something that our customers find very, very valuable.
And, you know, it's not only about policy, but also we have a concept of purposes where we can define those separately too. So it's not just about who the user is, but what they are doing, which is really relevant to the existing regulatory controls that are out there today, such as GDPR and CZPA. And, Steven, do you have do you have some more you wanted to add to that?
[00:15:15] Unknown:
Yeah. Definitely. I think 1 of the workflows that really is a prerequisite for getting the benefits of Immuta is the exercise of an organization going through and really defining what their policies are. We are trying to get to a place where we have more standardized policies. A lot of companies treat personally identifying information pretty similarly. Right? They wanna mask it or redact it. And the policy, therefore, can be written in a way that's very generic. Right? It doesn't really matter what schema or what source the data comes from. What matters is whether an attribute is personally identifying or an indirect identifier. But an organization has to find those things upfront.
We do find that it's a challenge to think through a generic approach to protecting your data in an organization. It's much simpler to simply say, you know, this group has access to this data and this group has access to this data. And so in the moment, those decisions are very easy. But if you don't put that upfront cost in defining your policies in a scalable way, then you have to make those ad hoc decisions every time you onboard a new data source from a new system or you add a new group in your organization or you restructure your organization. So having that discussion upfront and coming up with a good enough metadata vocabulary is definitely a workflow that we see people having to adopt.
[00:16:37] Unknown:
And 1 of the other interesting aspects of this problem is the definition of what constitutes sensitive data because that can be very different based on the industry that you're in or the regulatory regimes that you might fall under or the ways that that data is being used or aggregated where if you're using the information, but it feeds into a machine learning model where that information is ever actually going to be exposed, it's very different than if somebody's building a business intelligence report that might then be published as part of their quarterly earnings or something like that. And just the the responsibilities of how that data needs to be protected and controlled and also the fact that when you collect certain set of information at 1 point in time, it's not necessarily classified as sensitive, but then because of either changing environments as far as the laws and regulations that you're subject to or changes in terms of the nature of your industry or the business that you're in or maybe acquisitions, what was not sensitive at 1 point may become sensitive down the road. And so I'm curious how you try to approach that kind of evolution of what is sensitive and how people are accessing it and using it and understanding the intent of the data and the way that it's being used beyond just the static aspect of what is contained in, you know, column a and row b.
[00:17:58] Unknown:
This touches a little bit on on what Steven was just mentioning, which you really need to pull apart the policy from the data at the physical layer and really focus on how do you build policy at the logical layer. And this allows you to manage, you know, just using a silly example, but relevant is instead of saying, I want to mask the address and last name column, you could say something like, I wanna mask anything that's PII. And then you could extend that policy later and say something like, I'm going to mask PII except when the user is acting under purpose, you know, HR or whatever. And you could define those purposes separately, and you could add other exceptions based on, you know, potentially some training courses that the user has taken. Or if a new column is now deemed PII for some reason where it wasn't before, you simply add that tag and the policy will automatically propagate itself to that tag. And as I mentioned earlier, those tags on data could come from several different places or multiple different source systems where Immuta is simply acting as that aggregator to be able to take all that information you've gathered across your organization or potentially that Immuta has self discovered for you, because we have capabilities to discover sensitive data as well, and use that to drive where your policies get enforced. So it's really this idea of separating the policy from the physical data, which is key to being able to handle these changes in both rules and how you think of data and the different purposes that you've been processing it. The other paradigm shift I think we're going through is
[00:19:34] Unknown:
from a model of release and forget where I can prepare a dataset and then just send it out there for the business, and then I can kind of, like, just let it live out there forever to a more man actively managed model where I'm gonna release a set of data. I'm going to, you know, understand what processes are feeding this particular release. I understand who's accessing this release, and I can report on whether it's up to standard or not, and then I can terminate access to it if if I need to. You know, having that record of what's been put out there and and keeping your compliance policies sort of at the forefront of how you're managing your warehouse is a much more proactive approach to managing risk of, you know, individuals' data leaking or individuals being reidentified than than a model where you kind of apply all your policies on the front end and then just, like, let whatever downstream activities happen.
[00:20:32] Unknown:
That's a really good point that I don't think we've really fired home yet is that all our policies are completely dynamic. It's not as if we are in the transformation phase creating anonymized tables. It's that we are enforcing policies real time. So if you change a policy, you do not need to change your data. It will act and be enforced. And so the and to Steven's point, the fire and forget, I mean, that's how a lot of people think about data sharing today where I'm gonna create this anonymized set, and then I'm gonna share that copy with somebody.
And the dumb analogy I use is that's kind of like the blockbuster way of sharing data, where we're more the Netflix way of sharing data. You can, you know, point people at your data and have policy be enforced and completely audited. And then, you know, you might change a policy 5 seconds ago, and that will immediately be implemented in protecting your data the way you wanted it to for those people that you've shared it with.
[00:21:29] Unknown:
Yeah. The question of data proliferation and data copying is definitely an important 1 in the governance space. Because as you said, if you publish it and then you decide, actually, you need to retract or obfuscate some aspect of it or maybe a new technique has come out for being able to de identify information, or there's a new dataset that's been made public that will allow somebody to combine that with the information that you've published to re identify people, that can definitely be problematic. And so I'm interested in maybe digging a bit more into some of those controls for sharing of data and some of the practices that you've seen be effective in encouraging people not to make offline copies of that information and things like Excel or CSV files and improve the experience such that they actually enjoy using it in the place where it lives rather than moving it somewhere else to be able to do any further analysis.
[00:22:26] Unknown:
I think the key, you said it right there, is is enjoy where it lives. So at the end of the day, if if you enforce your policy dynamically at the data layer, if you will, then your downstream users that are benefiting from that data are simply connecting to the database like they normally would in executing queries, which of course even becomes more viable in our world of of SaaS data warehouses, where they're more scalable and you're not really worried about impacting other database workloads by analytical workloads. So if you consider that paradigm that more people can get on the database with fewer restrictions, then you also want to adopt the paradigm of enforcing policies at the data layer and enabling people to simply connect to data and use, you know, SQL to ask whatever questions they need asked and then take that to any analytical use case that makes sense. You know, you touched a little bit on the re identification piece.
The other part of that is that, you know, there are tricks that Amida could do as well where, you know, you might not be masking a column just because it's sensitive. You might be masking it because it's a foreign key that could join some other table that would cause a data leak. And we take that into consideration as well, and we can do tricks where, hey, look. We'll mask these 2 columns that are normally joinable, but they won't be joinable when it's masked. And you can define when they you actually want tables to be joinable. So we essentially can retweak that mask on those columns so that it now becomes joinable, but you still can't see the underlying values, but only under certain circumstances or under certain purposes that a person's acting under. So So, again, that's kind of some benefits you get from live interactive, you know, Netflix mode versus,
[00:24:14] Unknown:
you know, package up, ship, blockbuster mode. 1 of the features that is really useful, in my opinion, is our we have this feature called projects, and the comparison I would draw is to creating a compliant copy in the database. So if you were, you know, managing a data warehouse and an analyst group came up and asked, well, hey, can we get a copy of this g just these 10 tables? But certain policies have to be put in place. 1 thing you might do is you just create maybe a clone of those tables in a separate schema, manage access controls to that schema such that only those people would have access to it. And that's essentially what a project in Immuta is, is we can you know, you can kinda point and click, select some data sources, cordon them off in a a managed schema, and then give access to other users to that data. And then they could create derived tables or, you know, connect their data tools to that. And that project lives on its own, you know, kind of apart from the rest of the dataset. So it's a nice way of creating either short lived or just sandboxed sets of data for users and changing policies based on that specific purpose.
[00:25:25] Unknown:
Yeah. And to be clear, those are our views and not we never create data copies. That's 1 of the goals of the of the platform is to have everything be dynamic.
[00:25:37] Unknown:
You invest so much in your data infrastructure, you simply can't afford to settle for unreliable data. Fortunately, there's hope. In the same way that New Relic, Datadog, and other application performance management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo's end to end data observability platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business.
By empowering data teams with end to end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visitdataengineeringpodcast.com/montecarlo today to to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 people will receive a free limited edition Monte Carlo hat.
[00:26:39] Unknown:
Can you dig a bit more now into how the Immuta platform itself is actually architected and some of the ways that it has evolved in terms of the design of the system or the particular goals that you had for it?
[00:26:51] Unknown:
Originally so I talked a little bit about how we push policy at into the data layer. And, originally, we did that through what we called our query engine, which essentially is a proxy that sits in front of your database that you connect to. And essentially, we rewrite queries to push them down into the database to be enforced, basically through a query rewrite. So as far as the database concerned, it just looks like a client must run-in queries, except we've rewritten the query to enforce policy. We've since enhanced that. I wanna use the word enhance carefully because I think our proxy still works great, and we support that for, you know, I think over 20 databases now. But in these more modern SaaS data warehouses where it's more than just a database, for example, you actually go into a GUI, a UI when you're using Snowflake to execute Snowflake queries in some cases. Similarly with Databricks, you're in in a notebook and you're potentially doing things with Python or Scala.
And so we've built what we call these native integrations where we actually live in the database. The way we do this is different depending on the database technology. But essentially, we're able to propagate that policy down into the Compute Engine. So in the case of Databricks, for example, we are rewriting Spark queries to enforce the policy live. In the case of Snowflake, we're creating dynamic views, which are 1 to 1 mapping to the original tables that are completely dynamic. So you, Steven, and I all query the same table. You know, we would see different data because of the way we've constructed that policy into the view. So got native integrations with many of the SaaS data warehouses to enable us to be, you know, completely invisible to that downstream user.
[00:28:42] Unknown:
For somebody who is adopting Immuta, what is the process of actually getting it integrated into their existing data platform and data systems? And what does the workflow look like once they have adopted Immuta? And, you know, how might that change from where they were before to where they are afterwards?
[00:29:00] Unknown:
Again, it's it's a little bit dependent on where we're enforcing the policy, but sticking with the native and and SaaS Warehouse story. Immuta, again, is kind of the separate metadata layer. So it's very easy for you to install and play with Immuta without having to, you know, change anything you're currently doing. So I'll just run through a real world example with Databricks, for example. So you could install Immuta, And Immuta can be deployed inside your VPC if you want. We also have SaaS service that you could spin up. And you deploy our software, would typically hook us up to your whatever your identity management system is, which would probably be shared with Databricks.
We would pull all that identity management all that identity information in, you would point us at your meta store and we would populate Immuta with all your tables and databases, and then you would start building policies. And then once you've built your policies, again, leveraging things like tags and metadata and building those at the logical layer. So, you know, we've seen cases where we've been able to boil down, you know, hundreds of policies that might have existed for an organization down to 2 or 3 at Immuta because of all the scalability we provide with ABAC and and logical based policy building. Then you would simply configure your Databricks cluster to be Immuta aware, which involves adding some jars to it, and you spin up that cluster and policies are being enforced.
So if you wanted to avoid Immuta, you would simply spin up a cluster that's not Immuta aware. So that's we're very noninvasive from that perspective. Similarly, if you're using Snowflake, you would simply, you know, grant people access to tables that aren't Immuta protected if you wanted to bypass us. So it's a very noninvasive deployment, a very easy deployment. And if a customer knows what policies they want to have enforced, it's very easy to get them up and running. I think where we spend the most time is kind of and this is something I could talk about about where we're headed, but taking the ideas or their written laws and rules for our organization and kind of, like, turning those into real operational policies in Immuta is probably where our customers spend, like, the most time on implementing our product.
[00:31:12] Unknown:
And in terms of the complexities involved in the overall platform, there are a number of areas that I can see as being particularly challenging with things like managing a unified access control layer across all of those different data systems that you're working with, like BigQuery, Snowflake, Databricks, etcetera. And then also with data masking where it can be challenging to understand what data is actually sensitive, where I know that you can go in and do some labeling. But some of the automated aspects, understanding what are some effective heuristics and what are the edge cases that you need to be aware of, particularly as an end user, and then some of the challenges of actually masking that data dynamically at read time versus actually having the data stored in an obfuscated format. And just curious what you see as being some of the complexities and trade offs to some of those different challenges and what you're trying to build.
[00:32:08] Unknown:
I'll start with what you just ended with on the you know, do you mask on the way in to the cloud or do you do it dynamically? And so there are other approaches you could take here where you might have data that's very sensitive and you never want it to live in the cloud at all, and so you would encrypt it before it leaves the walls of your network on premise. And so there are some use cases, and we typically will talk to customers about, okay, you might wanna do that with your most sensitive data. Because as soon as you do that, you're essentially making it useless because in any way you want to ask a real question of that data, it's going to be encrypted to be able to answer that question. And we have the ability to be able to do that. And in the case of queries where you're trying to query for a specific value in that column, You know, if you meet the policy, we have tricks we could do where we'll actually encrypt the predicate of your query so that it'll actually match the encrypted value in the database so that you can ask specific questions like that.
But as soon as you start trying to do anything fuzzy against an encrypted column, you're gonna have a hard time, like any kind of math operations against numeric values. We try to coach our customers through these steps of, hey. You want to potentially encrypt your most sensitive data, but then your other indirect identifiers, for example, which are also very critical to things like linkage attacks, where, you know, first name, last name, credit card number, obviously, direct identifiers. But you need to worry about things like, hey. If I owned a very unique vehicle that existed in this table and someone knew that, like, hey. I owned a 1968 Volkswagen Rabbit.
Someone could go into this table and immediately find my record and maybe my other assets because they knew about that very rare car that I owned. And so while on the surface, you wouldn't think that, you know, the make and model of your vehicle is sensitive, it can become an indirect identifier. And so this is the really the big challenge I think customers face with things like GDPR and CCPA. Because at the end of the day, if you want to anonymize your data, you need to be worried about indirect identifiers. But if you're worried about indirect identifiers and you are naive about how you mask them, you're going to make your data completely useless. And there's this paper by Paul Ohm.
If anyone's interested in doing some reading, it's called broken promises of privacy, responding to the surprising failure of anonymization. And there's a quote in there that he says, it's something along the lines of data can either be useful or perfectly anonymous, but never both. So you really need to play in this gray area between completely cut off from the column or have access to the column. And most built in security tools today on the market, it's basically a binary decision like that. Do you get to see the column or not? And Immuta provides these advanced privacy and tech enhancing technologies that are called pets. Things like k anonymization and local differential privacy, also termed randomized response, where we essentially give you a level of utility from the column, but also enforce a level of privacy that allows you to, you know, meet the both the demands of your legal team, but also your analytical teams. And so combining this concept of encrypting on ingest to the SaaS data warehouses really, really gives you a powerful, you know, leverage on both meeting your your legal responsibilities, but also enabling your data analytics on the cloud.
That is challenging. And and part of that is our reliance on things like a common identity manager. So, you know, Tobias in Immuta needs to be the same Tobias in Databricks. And so we'll approach that by saying, hey. Look. We need to have a common identity. We can either do that mapping in Immuta or more typically, you know, if the customer is using something like LDAP or Active Directory, we would both us and Databricks and Snowflake all use the same identity manager. And similarly, you know, when you're capturing audit logs, you want the user and those audit logs to be consistent. And the nice thing about this is your audit logs will then be consistent. So you're not having different audit logs from Databricks as you do from Snowflake, as you do from BigQuery, as you do from Redshift. They'll be consistent now because Immuta is basically monitoring these queries at the data plane and capturing not only audit about who's querying what, which, of course, is important. But, honestly, I think what's equally important is who's building what policies and why and when and how are they changing.
That history is also captured in our platform so that you could understand basically your governance stance over
[00:36:47] Unknown:
time. And in working mirrors and trying to help them gain better control and security around the data that they're working with, what are some of the common blind spots or areas that are completely overlooked that you have found?
[00:37:00] Unknown:
I think 1 of the areas that we often don't think about at all is the idea of purposes and, you know, for what purpose is this data being released and how do we tie that to the individual consumers who are actually accessing that data? If you think about sort of historical dimensional modeling, the data platform owners is optimizing for the most generally useful model that's also performant in the database. But that comes at a cost of losing track of exactly why data is being accessed for a particular reason. And starting to get customers to think in that sort of context specific approach is really useful because it simplifies some of the access control decisions.
You can exempt yourself or a certain set of users from a certain set of policies for a certain project, but not change the core model or the core policies of your database. So we find that that's a really useful abstraction to introduce. It's sort of the way we think about data projects already in sort of a modular way that have a start, a middle, and end. But at the database level, that doesn't really exist. So I would say that's 1 area that is becoming increasingly important with new privacy laws and that we see as a new concept for a lot of customers.
[00:38:26] Unknown:
1 other thing I'll add just real quick is that the other thing we see a lot, and taking a step back, is we call it the 3 phases of data security and privacy. So phase 0, I call it 0 because everyone should be doing this. This is you just don't let the bad guys in. Right? You have security and logins, and only your company gets to your data. And then the next phase of that is privacy, where you're enforcing fine grain controls on your data to your internal employees. And a lot of people say, okay, we're done. That's great. We've enforced our controls. They forget about phase 2, which is data collaboration. When your employees are creating new derived tables, how do you ensure that your policies get inherited and passed down to those derivative data products?
And that's a really, really hard problem, which, again, our projects concepts, without getting into too much detail, aims to solve.
[00:39:20] Unknown:
And in terms of people who are using Immuta, what are some of the most interesting or unexpected or innovative ways that you've seen it employed?
[00:39:28] Unknown:
Great question. I mean, my favorite use case, which I'm allowed to talk about, is is probably 1 of our smallest customers, but it's the, COVID Alliance. So it's this group that has built a data platform. They built it on top of Snowflake, and they are collecting COVID data and sharing that with researchers, and they wanna share it in a way where the researchers can do their research on this on data in a way that builds on each other because they're able to share their models and and the data across, efforts through this platform.
And we're the foundational piece of all of this where they're essentially using us to not only anonymize the data, but manage the sharing across these research teams on top of Snowflake. And some of the advanced anonymization techniques that we have that I mentioned earlier where we're playing in that gray area between utility and privacy is really the power that's enabling them to do this because they obviously, this is highly sensitive information. And, you know, they've given me quotes from researchers along the lines of, hey. I never thought collaboration like this on data would ever be possible, not only from a data anonymization perspective, but from the ease of collaboration with other researchers that we've never even met before at our university, for example.
[00:40:49] Unknown:
I think COVID Alliance is a great example because they also have a very strong human level of governance on top of the the data infrastructure. So for each project or use case that they're presenting to their end users, there's a data privacy impact assessment that goes into very rigorous detail around what are the potential privacy impacts, you know, what's the purpose, you know, what policies are in place to mitigate the risk of harm to the individuals who contribute to the study. And that sort of mindset is very familiar to me from the academic world where every time you started an experiment, you would go through an institutional review board process where you would get authorized to do certain things with a certain set of data for a certain limited purpose, and you'd have to keep track of that from start to finish. And that's really missing from a lot of our data workflows in industry, in my opinion. Having that sort of check from the time data was acquired to the time it was used and making sure that there's alignment between those 2 things is a really challenging governance problem, technically and sort of organizationally.
[00:42:01] Unknown:
And in terms of your own experience of building the Immuta product and working with your customers and trying to advance the case for data governance and security and access control, what have you found to be some of the most interesting or unexpected or challenging lessons in that process?
[00:42:18] Unknown:
I think it's really hard as a data scientist or data platform owner to get in the mindset of an attacker. A lot of the folks in the security world, you know, might come from a background of thinking in terms of vulnerabilities and, you know, worst case scenarios if an attacker got a hold of the database. But, you know, I guess just speaking personally, it doesn't come natural to me as a data scientist. I'm usually more attuned to think of the potential value in a very noisy dataset rather than the potential risks. And so it's been challenging to put myself in the shoes of thinking how can something go wrong with this dataset? Why can't I release this dataset as it currently sets?
What potentially nefarious things could people do against the data subjects in my dataset? But I think that's really a skill that we're gonna have to become better at if we're gonna be better as a community at protecting data.
[00:43:15] Unknown:
The the concept of privacy guarantees, and we get into situations where I think the data engineering teams understand the basics of what they wanna do, and they've kind of kicked the can down the road on the complicated scenarios because they didn't know that there was an automated solution that could solve this. And, again, I'm referring to these privacy enhancing technologies. For example, I can't name the customer name, but, you know, we got in there with them. They started enforcing, you know, table level and column level basic column level masking, and they discovered that we had the k anonymization policy.
And they were like, oh, wait. This means I could open up this table to all my analysts because you could hide, you know, like, the CEO and and, like, you know, other highly unique values in certain columns from someone being able to do a linkage attack. So long story short, I think people understand the the bare bone basics of anonymizing data, which is basically hide it or not. And once, you know, people understand kinda the art of the possible with some of these advanced privacy enhancing technologies, that just opens a whole new world of use cases that they weren't even considering because they didn't know they could consider it. For somebody who is
[00:44:30] Unknown:
looking for a governance solution and considering the use of Immuta, what are the cases where it's the wrong choice?
[00:44:37] Unknown:
What I usually say is you want your data to be structured to the level that you need your policies to be enforced. So we are not great for unstructured use cases. Now that being said, we can do object level controls if you've got, like, images tagged or PDF tagged in s 3, but you need to give us some kind of structure in order to enforce policy. So that is a limitation of the platform.
[00:45:00] Unknown:
And in terms of your goals for the near to medium term, what do you have planned for the future of both the platform and the business?
[00:45:09] Unknown:
I kind of alluded to this earlier. 1 of the things that we find our customers struggle with is this idea of how do I take rules and turn them into policies? So essentially, you know, we give them all the wood and the hammers and the nails to build whatever house they want, right, based on their policies blueprint. We wanna actually take a step back a little bit and say, hey. Let's help you build the blueprint. You give us the risk profile that you're willing to take. We'll create the right policies for you. So adding some more automation around the construction or the creation of a blueprint for what you wanna construct policy wise.
Kind of related to that is being more focused on sharing use cases. And if you're familiar with HIPAA at all, HIPAA is the US regulation for health care data. And there's essentially 2 ways to enforce HIPAA on your data. 1 of them is what's called HIPAA safe harbor, where there's essentially a defined 18 columns that you have to mask in order to have your data be compliant with HIPAA. You know, it's things like names, addresses, social security numbers, things like that. The harder, more complex way to do it is something called HIPAA expert determination, where you actually have to hire a statistician to come in. And, you know, this is a slow kind of arduous process, and it's not cheap, where they'll basically say, okay. You anonymize this well enough for your use case. And the reason you wanna do that is to set the safe harbors to restrict it in some cases for use cases that you want to solve. So we actually are going to automate HIPAA expert determination and make it so that the platform can act as that expert and, you know, enforce the regulation in a way that does not require a human in the loop. So and we think we can extend that process to other regulatory controls like GDPR and CCPA.
And then the last thing I'll bring up is we really see ourselves as the last step in ELT. So I think, you know, everyone understands this movement of kind of putting the transform after the e and the l, and we believe the g comes after the transform. So again, this is removing all of that policy logic out of the transform layer and let it happen dynamically and define that in a declarative way in Immuta so that we become a natural extension of the ELT pipeline, becoming ELTg, g being the governance. And, you know, how do we better integrate with tools like a DBT.
[00:47:41] Unknown:
1 of the thing that's really exciting to me as a data consumer and platform owner is is the idea of getting to a place where we are much more modular and prescriptive with how privacy is handled and policies are put in place. And I think attribute based access control is really the only way to get to a place where we can reuse policies across companies because it's much more expressive, it's much more generic, it gives data teams a way to communicate with each other to say, like, hey. This is the policy that we have in place. We give this data to these users. We put these transforms in place. And it's really by abstracting that g from the ELT that we can build a set of best practices that are easily implementable.
[00:48:25] Unknown:
Are there any other aspects of the Immuta platform or the challenges of data governance and security and access control that we didn't discuss yet that you'd like to cover before we close out the show?
[00:48:35] Unknown:
You know, in terms of data governance, there's a lot of talk right now about data quality and integrating metadata from jobs and from tests and from consumption tools into 1 place. And I think as we look at managing a data platform beyond just the access control, but to, you know, making sure that the right people get the right data at the right time and it it looks the right way. You know, I think that it's really an exciting time to think about merging these things into a common place. Because at the end of the day, it's it's all about reliability of the data and reducing risk and maximizing value. And I think, the more we can become very concrete about the actual tasks that have to happen in the data platform, the better off we're all gonna be long term.
[00:49:24] Unknown:
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so for the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. And I'll start with you, Steve.
[00:49:42] Unknown:
Yeah. I think that 1 of the biggest gaps right now is having 1 place for understanding where the data lives, what's happening to it right now, and, you know, and what I need to do as a data owner. You know, when we think about governance, entitlement security, specifically, 1 of the big problems existentially is that, you know, a lot of times, we don't have a really good sense of what's happening with my data at any given moment. There's a very real spread problem where we've got tons of data assets. There's lots of important decisions that have to happen, and there's lots of monitoring that has to go on as well. And I think we're I'm hoping that we're kinda moving as a community to to having some standards on how to manage and monitor and understand the health of the data pipeline and then being able to communicate that to end users. So it's it's not just about me as the data platform owner knowing what's happening, but also about end users who are using the data having visibility into into what's happening and having a central place they can go to. So I think that's a huge gap.
[00:50:47] Unknown:
And, Steve.
[00:50:48] Unknown:
To me, it's kind of related to what Steven said, but I think more, tied to, again, those 3 phases I mentioned earlier where I don't think people are thinking enough about how data analysts need to do their own transforms and kind of, you know, manage the data how they want to manage it to some degree for their analytical use cases. And my belief is compliance or enforcement of policy is the biggest blocker to that. And so this contributes to there only being a small set of data engineers that can really service all these transformations that are required across the organization because they're that small trusted group that's allowed to see all the data and that you trust with your database.
And I think to really, again, going back to data governance is really the inverse word. It really should be, you know, data acceleration. How do we make everyone more efficient? And I think if you have the right kind of controls in place to enable everyone to manage data while still meeting the demands of compliance, That's really the end state that everyone wants to get to. That's really what what we as a product are are trying to achieve. So it's more than just how do you create policies on existing tables and forget. It's more about how do you enable everyone to transform data in a way that makes sense for them, share those transformations.
And that's, I think, when governance gets really, really hard. And there's just not much out there to help with that.
[00:52:22] Unknown:
Well, thank you both for taking the time today to join me and discuss the work that you've been doing at Immuta. It's definitely a very interesting platform and solving a very real problem that we have in the data industry. So I appreciate all the time and energy that you put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. It was a pleasure being here. Thanks so much, Tobias. It was great.
[00:52:47] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Steve Thao and Stephen Bailey
Overview of Immuta and Data Governance
Current Landscape of Data Governance Solutions
Integrating Immuta with Data Platforms
Dynamic Policy Enforcement and Data Sharing
Architectural Evolution of Immuta
Challenges in Data Masking and Access Control
Innovative Use Cases and Lessons Learned
Future Goals and Enhancements for Immuta
Final Thoughts and Closing Remarks