Summary
Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
- Your host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Privacera is and the story behind it?
- What is your working definition of "data governance" and how does that influence your product focus and priorities?
- What are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera?
- How would you characterize your position in the market for data governance/data security tools?
- What are the unique constraints and challenges that come into play when managing data in cloud platforms?
- Can you explain how the Privacera platform is architected?
- How have the design and goals of the system changed or evolved since you started working on it?
- What is the workflow for an operator integrating Privacera into a data platform?
- How do you provide feedback to users about the level of coverage for discovered data assets?
- How does Privacera fit into the workflow of the different personas working with data?
- What are some of the security and privacy controls that Privacera introduces?
- How do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems?
- What are the most interesting, innovative, or unexpected ways that you have seen Privacera used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera?
- When is Privacera the wrong choice?
- What do you have planned for the future of Privacera?
Contact Info
- @Balaji_Blog on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png) The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at [dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at dataengineeringpodcast.com/accryl. That's acryl. Your host is Tobias Macy. And today, I'm interviewing Balaji Ganesan about his work at Pravacera and his view on the state of data governance,
[00:01:41] Unknown:
and I'm CEO, cofounder of Privacera, and we'll talk about Privacera in a minute. We are at the intersection of data governance and data analytics. Very exciting space, and so really excited to talk about those topics today. And do you remember how you first got involved in working with data? It's a bit of a history walk here before Privacera. Just as a background for me, I grew up in India. I did my schools there. I came to the US for work. I was working for a consulting company, and first gigs was at a company a company called Apple, which had, you know, at that time, was inventing new iPods and iPhones. Kinda was bit by this Silicon Valley culture, this notion of, hey. You can produce really good things to change the world. So it kinda bit me, and they ended up staying in the valley to go and build my career around. That's part of it.
Having worked with consulting companies, you've come across a large variety of enterprises of all size. So as in pre 2010 and starting to shift in the way companies were trying to run their business and get the competitive edge, which is we're really looking at data across the industries. And people are modeling at that time around how Google ran their business or Yahoo was big at time and now it's not. How how these Silicon Valley companies have used data to leverage their business, and then every organization, every Fortune Finder, was trying to model around that point of thing. Right around the time is when I met my cofounder, Bosco. Okay? And he had sold his startup Oracle, and we were thinking about the newer set of paradigms that was happening. And data was 1 of the topics we picked in because it was was a shift in the way companies were looking at it where they want to now store as much data as possible as Yahoo or Google would do and analyze that. And this is where they started to look into these big data technologies at that time, which was called Hadoop and other technologies, which was open source.
You can store as much data as possible and analyze it, which was a proposition most companies starting to like. And Vasco and I fundamentally believe every time there's an architectural change, there are certain things come to change. And security was 1 of them. Governance was 1 of them, where we found a gap of this always lags behind a technology paradigm and a shift. And it was the same thing with Hadoop and Big Data as well, where we felt like the focus was on analyzing data. The focus was on making how you can store and run fast queries, but governance was an afterthought. So when this platform needs to go into enterprise, there is a challenge of there is gaps always exist, and and those things are always training. So we saw this as an opportunity to say, hey. We have worked with enterprises. We really know understand enterprise needs of governance and security. This boss, Chris, spent a lot of his career in the identity and access management page. So we took this paradigm and shift and say, hey. How can we help with companies doing that? And that was a start of our earliest start up called exisecure.
It started in 2011 around the big data area on how we can apply a common authorization, common way of people accessing data. That took off in the market, and eventually, we were acquired by another big data company called Hortonworks. This was before the IPO products was acquired and open source and now known as Apache Ranger. And I led security and governance at Hortonworks at a a ringside view of enterprises going through this journey of moving from legacy to modern big data distributed systems, but also how do you think about the challenges that comes with data part of it, especially around governance. And helped a lot of enterprise companies put together an approach and as part of it. But companies will always come back and say, hey. What you're doing within your Hadoop platform is great, but our data is spread across multiple data sources. We have Oracle. We have Teradata. We're now getting into the cloud.
We really need governance across all our layers. And that set out the notion for Privacera, which was take 2, our attempt of solving this problem where their architectures have shifted again from Hadoop into cloud, but security and has become governance has become even more a demanding topic. So we are excited to be in this journey of helping customers in that part of it. But this has been a culmination of a lot of years of work in this space
[00:06:04] Unknown:
that we're excited to bring and help customers with. In terms of the focus of Privacera, I know that 1 of the areas that you're investing in is enabling this data governance and access control in the cloud. But I'm wondering if you can just give a bit of an overview about what you've built there and why it is that you feel that this is a problem space that's worth spending your time and energy focusing on? Yeah. Absolutely. Let's talk about what is data governance in the first place, where,
[00:06:34] Unknown:
you know, data governance includes policies, standards, procedures for making sure data is available, it's consistent, and it's secure across the life cycle. Viewing data as a unit across the organization and making sure there are guardrails around how data can be used and how data is processed as part of it. And as this data architecture has grown, as the data use has grown across the enterprises, become a fundamental need in the organization. The area where we are tackling around security and privacy, those areas have come into more of a prominence more of prominence, I would say, with privacy laws coming in and culture changing within the organizations, I would say, in the last few years.
With breaches happening and more breaches happening, there was a Capital 1 breach which happening in the cloud, and there are these breaches which gives organization executives a pause in terms of how they're thinking about cloud. But privacy has come back and laws have come up and say, hey. You have really have to treat customer data with extra length. You have to know what data you have. But also using this data for right purpose for all time. So those are behavioral changes, cultural changes happening across the organization. So and we're talking to a Bay Area company, which has been traditionally been fairly open inside the organization, which means if you are within the firewalls of the company, within the boundaries of the company, you can get access to any data. This was the culture a few years back.
And now they are moving into a culture where, you know, if you're an analyst or data scientist, before you can access data, you have to prove why you need access to that. So fundamental shift in culture is happening today where people are asking questions by making sure and also implementing these procedures and controls around who can access what data. And it's a fairly complex problem because data is everywhere. I know it's even in the cloud, it's spread across everywhere. It's spread across multiple databases, data stores. And if the organizations are moving towards making sure right people have access to right data, those controls have to be implemented in at all levels where data is stored. And that's everywhere.
The data is on prem in the data center. There's data in public cloud. They're distributed across services. So implementing controls in that paradigm is hard. It's still fairly manual. Most organizations will go up and put in administrators who can go and put these rules in. But it's hard to stitch all of this together and really know you're doing the right thing. And this is where a tool like Privasaur is coming in. It's helping customers in their journey where, on 1 hand, they have tremendous requests coming in from the users for access to data. The more people are as data democratization has grown, this data revolution is going on where more business teams are really looking to leverage data. These requests for access to data are increasing.
On the other hand, privacy security team are carrying more weight and are saying, if you're on the cloud, you know, you have these distributed data. Really have to make sure right people have access to right data, and it's used for right purpose. And doing that, what we call is this dual mandate between balancing the use frictionless use of data versus balancing security and privacy is an incredibly hard job. It's incredibly complex job. And it's fairly manual today, and this is where tool like Privacera comes in. We really automate all of that by providing a single pane of glass where we have built these plumbing within our tool to take care of those controls.
We give them visibility on data. We give them visibility on who's doing what, but be able to enforce all the policies, all the nuances of policies that comes in the enterprise of who can access what data, being able to do that in 1 place, and being able to enforce wherever the data is stored. So in effect, we help enterprises solve that dual mandate problem. It's not a 0 sum game. You can have faster access to the data, but also put in security, privacy, and governance.
[00:10:42] Unknown:
In terms of the overall space of data governance, as you said, there are a lot of aspects to it, and there are different challenges that are introduced depending on your operating environment, your tooling ecosystem. And I'm curious, what are some of the lessons that you've learned from the work that you did on Apache Ranger during your time at Hortonworks that has helped to inform and help you prioritize your efforts around the work that you're doing at Privacero?
[00:11:11] Unknown:
As I mentioned, this is a complex endeavor, and we have been working this with close to a decade. There's a lot of learnings working with a community. Fortunate to have a very growing community we had with Apache Ranger. A lot of 3, 000 plus enterprises adopting it. And there's a lot of learnings you have by seeing different flavors of how organizational challenges are, how they implemented their data ecosystem, or they're looking at governance across industries, across geographies in some cases. So we had a ringside view of how organizations have done that over the years and what different challenges comes in. And not an easy endeavor to do that at this level when you're going and looking at implementing controls at wherever data is stored.
So those learnings have been built in into Apache Ranger over the years, and we have learned in terms of performance, and we have learned in terms of scalability, and we have learned in terms of what the best architectures can support it, what are the do's and don'ts that comes in with implementing governance and best practices that comes with that? Those have all gone into our private server journey. So we didn't start from scratch. We leveraged those learnings in solving a bigger challenge that is happening in the cloud, which is going to be data is even more distributed. There's even more choices that customers have in terms of how they can run their workloads and data, which and governance and privacy has become even more real over the years. So those learnings have all gone in into those best practices as part of it. So there's but we have leveraged the work from the Ranger community to jump start on that part of it. So and we continue to do that. There's this Apache Ranger is still a very thriving community.
We continue to add new boundaries of where how new data sources, newer platforms that we can continue to cover, and we leverage that part. We contribute back to the Ranger community to keep that community going. At the end of the day, we believe in open standards, and we believe in doing things which can integrate and talk with other tools and other platforms as well. And being built on top of Apache Ranger helps us do that as well. So in terms of learning, in terms of best practices, in terms of the open standards, Apache Ranger helps us build even better modern data platform governance than if you had done it from scratch.
[00:13:36] Unknown:
Yeah. The open standards piece is something that has definitely been getting a lot of attention and focus in recent years with efforts in a variety of domains. And on the sort of governance side, I know that Apache Ranger has been a sort of de facto integration for a number of different products for their sort of open source option for governance. And I'm wondering if you can maybe speak to some of the sort of standardized aspects of that and some of the ways that Apache Ranger has helped to influence the rest of the Pravatar or Cripitar or some of these other systems that are focused on data governance or data privacy and data security?
[00:14:23] Unknown:
Apache Ranger fundamentally was built with the notion of how you can implement an architecture for a distributed system. How you can do governance in a distributed system, but not being a data platform in itself, but being able to help customers bring together different data platforms and build governance on top of that. So it's built always with an architecture of being a sidecar, being an architecture which is not in line with between users and the data. The fundamental principles built in when we built Ranger was that that we are not a data processing engine. We are a governance engine or a security engine. So there was a shift from traditional security models of running proxies to how we had built Ranger for for distributed systems takes more work.
But at the end of the day, we are absorbing that work in lieu of customers and and taking away the complexity away from customers because we are not impacting the user access to the data, the frictionless access to the data by not being in line with that. That's the fundamental approach and that architectural bet that we have made with Apache Ranger that we have continued with private setup part of it. And the second part which you alluded to, which is going to be the open standard based of things where at the end of the day, what customer is really looking for is making sure this governance applies to whatever data platform they choose today and whatever their platforms they have in mind for tomorrow.
And it's a constantly evolving paradigm landscape where the business needs are evolving. And so technology can't be a static landscape here. It has to evolve with the modern needs of the business, and and the business needs drives everything, the adoption part of which platforms they use. So they will really need to not reinvent governance every time they go and adopt a new platform, a new cloud or a new platform within that, which is where they're really looking at how they can standardize across 1 1 setup port. And that needs compatibility. That needs compatibility. That needs ability for this governance platform to interact with any cloud data platform, any cloud database, any cloud data storage, any cloud application as part of it.
And we fundamentally believe it that it has to be an ecosystem play where it's hard for anyone, including us, to build every possible connection to the data sources. We have to build a community. We have to build an open plug and play where the platforms can also plug in into a standard. And we have spent quite a bit of time with Ranger in the templated way, which has done it. It's really had adoption in the open source community as well as, you know, we had vendors who are coming in who have built using the open source template integrations already. And they are coming to us and saying, hey. We already built an integrations. We would like to go and certify as part of it. So we are really seeing an inbound from a lot of these community adoption where what the customers are pushing towards is making sure the governance layer is compatible with whatever technology decisions they are making and vice versa.
So governance is top of the mind for many of the data platform vendors, and having an open source layer has made it easier to go and compatible. So we are seeing adoption from the Presto community or the Trino community. I know companies like Starburst or Ahana, and there are companies who have built on top of the existing layers, who have taken the open source Ranger template, already built in their own enforcement plug in. And that makes it even easier for customers to adopt the platform. So our approach has been fundamentally around making an architectural bet, which has been making it easy for customers, not relying on proxy or virtualization, but also relying on the open standards here.
At the end of the day, it's the open standards will win. These platforms and the governance tools will need to talk to each other. It's hard to solve this problem by being a single proprietary vendor. Absolutely.
[00:18:26] Unknown:
In terms of the cloud specific aspects of what you're doing at Privacera, I'm wondering if you can speak to the particular constraints and challenges that are introduced by operating in cloud environments where maybe the end user doesn't have full control over the underlying systems, and so you have to rely on the cloud vendors to expose certain interfaces or being able to create a system that is accessible and integrates natively with this variety of platforms where there are some common ideas and abstractions, but each cloud vendor is going to implement their own particular functionalities and just some of the details of having to operate in that space?
[00:19:08] Unknown:
It's not very specific to cloud in general, but it's a general data challenges today. The industry has not adopted a common standard for, let's call it authorization. Right now, who gets access to what every database, every data platform have implemented. A slight variations of how they think it. They all follow similar models, but now there's role based or there's attribute based. And the way database will do implement that part, they will follow some guidelines, but there are nuances around that. In the longer run, we believe there is a play for standardization, just like in the identity world, you know, SAML and other identity part took out. These standards take time. And these standards usually hold close to a decade and a lot of work across Ranger and other parts, we were starting to see that momentum starting to build up where the common interest of addressing making it easier for customers in the longer end of play will be standards.
But in the short run, the challenge, as you outlined, is real, is where even in the cloud, the rate of innovation is incredible in the world where many of these customers did not have a Snowflake or a Databricks 3 years back. Right? And they're adopting at a much faster rate. There are newer tools coming up in the cloud. Cloud has accelerated that rate of innovation that can happen, where your data is in 1 place. You can easily bring up a new service, your new data platform fairly easily once you have data in 1 place. So cloud is accelerating the data democratization.
Cloud is accelerating the choices customers have today. That will continue to innovate. There's incredible innovation happening. Their customers have newer and newer choices. Now the challenge for any governance vendor, whether it's us or anybody else, is how do you keep pace with that innovation, which is the underlying data platforms are evolving when there there are newer platforms that are coming rapidly. And you have to make sure that you were you're providing a consistent single pane of glass experience to the customers. And that's something we have spent quite a bit of time mastering and thinking about how we build these connections and connectors.
It's taken for us few years of that work to iterate and master that philosophy into keeping up with, yeah, on top of a dynamic environment. Of these technology of these technology shifts. Cloud is making easier for customers to switch. Cloud is making these customers to go and adopt newer platforms, which is fantastic for the industry because you have choices. Choices means competition, and and competition drives everybody better into providing a better experience. So I think the rate of innovation is great. Challenge for vendors like us is making sure we can keep up with that.
[00:22:11] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. In terms of the actual platform that you're building at Privacera, can you give some details about the architecture and implementation of how you have approached this problem and some of the ways that you're able to scale and integrate with the various systems that customers are trying to manage and secure with your platform?
[00:23:05] Unknown:
So just taking a step back, the problem we're trying to solve, the aspect of data governance is around the security privacy aspect. We are solving a problem of how to get visibility into the data and also make sure right people have access to right data and how to make sure all of these policies and workflows and associated governance with that is done in a central place. And like I was talking about, the origins of this made some of the architectural bets early in the Apache Ranger project of building governance in the modern distributed systems, going away with the paradigm of traditional security models, which used to be you put a proxy between the user and the application or data, and you being the middle man, you can control who gets access to what you can do in a lot of these controls. And we fundamentally moved away from that saying that's not the right approach. And Ranger was built in a very distributed architecture model where the governance rules and policies are maintained, workflows are maintained in 1 place, and the policies are enforced as close to where the data is in a very distributed way in a way that we are not coming between the users and the database or the data as part of it. So that fundamental architecture bet is where we have built Privacera on, and we have extended even further to go beyond what Apache Ranger was built on to cover any data source, any database, any data storage on premise or across the cloud. The last part of it, which gives rise to and that architecture is hard. It's easier to build a proxy and put it in front of everything than building the architecture that we are building. So it has taken years of work, but that's the right thing to do for customers.
And that's what the customers and the community is saying is the right architecture for them is 1 that provides consistent coverage, but not impedes the experience users have with their data. Not impedes. So you're not rewriting an application. You don't have to throw out anything by putting a governance layer in. So how we can fit in into an existing paradigm is what we fundamentally bet on. And we're extremely proud of that work that has gone in in the community, and that's a fundamental differentiator for us. And so in their architecture, it really means that the governance it's a distributed architecture where governance and the policies and the workflows associated with that. If you need to request access to data, you need to get reporting. It happens at the central place.
And but the enforcement, actually, the decision of who gets access to what gets close to as close to data bay the data, which means we are enforcing this in every data source. Either natively, in many cases, those are native enforcement, so which the database are built. Our own enforcement, which just gets closer to where the data is. So instead of doing it all this in 1 place, we have decoupled architecture bet. And in many cases, done natively by the data platforms, which ensures best experience, which ensures best performance and scalability.
And especially in the cloud, where cloud is boundaryless in terms of the adoption here. You're not sitting behind a physical firewall. And so in the cloud world, architectures are even better where the enforcement and the administrations are happening in 2 different places. And it can be, you know, architected in whichever way the customers have architected the cloud part of it. So it's very flexible. It's very dynamic. But it's been built on years of experience working with lots of enterprise sizes and network types. And over and over, this architecture has been proven to be right. Given that you are building on top of the Ranger project and basing a lot of your
[00:27:04] Unknown:
sort of experience on your work from doing that at Hortonworks and carrying it forward into private Sarah. I'm curious, what are some of the additional layers or sort of user experience improvements or just sort of operations and maintenance aspects that you've had to build around it to be able to scale the Privacera platform and make it user friendly and accessible and adaptable to these enterprise clients?
[00:27:30] Unknown:
So Rager is a fundamental there were a lot of plumbing built in from an architecture point of view, the scalability and the reliability part and as part of it. When we were looking at adopting it for the cloud, we really looked at it from how do you build a solution rather than a specific tool. Like, how do we solve the problem? Right? So you really have to think through the entire user workflow of their journey, what happens, and build a solution. So we spent quite a bit of time really thinking about the higher level layer sitting on top of Ranger. How do we build a governance layer that is as easy to use as anything else? And how it's embedded into this modern way of doing things, this modern data world where users are more and more demanding of getting frictionless access to the data, but we have to always maintain this governance and security and privacy part of it. So the world is changing where you no longer go and go and say, I'll block this data. You can't access it Or, you know, you can access it without any rules. The world is changing to how do you maintain that.
But there's a lot of nuances around reducing that friction. So we have to really think about that experience again. And because in the shift to cloud, those needs have changed from IT administrators controlling a lot to this being business driven or end user driven paradigm. They are in more control. Building the solution, we're introducing a paradigm of if even business users taking control of their own data and managing it. So the shift of paradigm is from IT managing everything to a more delegated model where IT does manage and security does manage some rules, but business teams can manage their own part of it. So we introduced something called governed data sharing towards that extent where it goes to that paradigm of how, a, we can make it easier for business teams to go and manage their own governance and for their own data. Business teams are not very sophisticated. They don't understand the innards of Snowflake or Amazon infrastructure to know, you know, which tables, which files that they need to get at. They understand business taxonomy. They understand business glossary as part of it. So we really built the layer which does that translation for them. Can we give an experience which is as easy to share data as you would do in Box or Dropbox, where if you are in a Box or Dropbox, you're you can go and easily take a file and say, I want to share with John Doe, and I want John Doe to read or write. And that's simple as that.
And we want to have a similar experience when it comes to data. That's part of it. So so we built in this governed data sharing where we are bringing that more simplified experience, the notion of business users having access to their own decision making. Data Stewards can make that, which is a more scalable 1. But doing it in a way where doing it abstracted business glossary level or taxonomy level, then not being able to have to live with how Snowflake is architected or how Databricks is architected. It is something we can take care of. We can take care of the end of the mile connection. So we are doing the translations part of it. So that's the part where we are really focused on when we built on top of Ranger is how we bring that experience in, to enhance the experience of
[00:30:52] Unknown:
doing work in a modern cloud. As you mentioned, some of the people who are interacting with this system are more of the business user side, and you were saying that their kind of familiar interface is in this data dictionary or business glossary. I'm wondering if you can talk to the different user personas that you have had to design for as you've been building out Privacera and some of the ways that you have thought through and implemented the collaboration patterns for these different personas as the different stages of the life cycle of data progress through these various systems?
[00:31:27] Unknown:
Yeah. So if you think about any enterprise, there are a variety of stakeholders here. So that's 1 of the fundamental bets we have made when we founded Privateeris, how we can build this for as an enterprise tool rather than a tool for a specific persona. Even though many of the cases, the starting point is an IT IT administrator view, we have to take care of the people in the corporate. And so in the corporate part, that means compliance, security, privacy. How we can provide give them a view of a privacy team to say, get visibility on who's doing what. We can give ability for security teams to come in and add in global rules.
Security teams are really looking at the entire corporate, and they will have global rules around how certain things can happen. So if you're, multi country, you can say, hey, you know, users living in Germany cannot access these segments of data in the US or vice versa. These are global rules, corporate policies that can apply to any data, any application. So we give ability for these corporate teams to come in and add that. And then there's day to day of, hey, we need access to this data for running this project, which traditionally starts from IT, and now we have moved into making sure even business data stewards can get access to it. So today, we address a gamut of, personas from the corporate teams, giving them visibility to key administrators as well as business data storage. So it's a very much a multi user view of the tool. Because in our fundamental belief, we don't believe governance is is ownership is owned by 1 single department and 1 single persona.
It's a very distributed responsibility, and that's the way the most modern organizations are thinking about it. And we are also thinking about in a similar way is when it comes to governance, it's such a complex problem that it can't be owned by 1 single person. There's different levels of ownership, different levels of accountability that comes in, and our tool is built for that. But even in the early days of your building range, you're always building with our architecture in mind of how we can have these different level of ownerships and how we can carve out within the tool specific areas of ownership so that we can have the collaboration together.
Ultimately, this collaboration is what drives scalability. This collaboration is drive what drives organization to go take to the next level in terms of their governance needs.
[00:33:59] Unknown:
In terms of the actual workflow of getting Privacera set up in an organization and defining these policies and just the day to day interactions that end users have. I'm wondering if you can talk through that flow and some of the context and communication that's necessary to implement it effectively as people go about their day to day tasks.
[00:34:24] Unknown:
The most enterprise customers are are following a governance journey. And they call it a governance journey. The initial part of the journey is where the controls and everything are done manually by through administrators in their own silos. So every administrator is managing rules within every data source and every application, and it's fairly siloed journey. So that's the starting point of where we usually come in. And the first order is how we can centralize that into a centralized UI or centralized APIs, the way of working at it. So when we go in, the first order is how we can go and implement a more centralized version of the solution and build those connectors, put those connectors in into the data sources, and start managing these rules and policies in a central place rather than individual silos and implementing workflows and systems to do that part.
That's the journey that most organizations start with, and that involves the tools, that involves changing processes, and in some cases, in building more automation. Earlier, maybe users are creating those requests manually and say, hey, I need access to this data. And you're being able to provide an automation layer using our workflows or ability for using our APIs to do that. So in most cases, the centralization part is a for the automation centralization is the step 1. And it takes a while for organizations to align to that. And then the step 2 is how we can take this model and make it more dedicated, which means how we can start giving ownership of certain aspects of data to to the business towards or to to more distributed tenants here, being able to go and manage these day to day policies in a more scalable way.
And we have some customers who have crossed that threshold of being from centralized governance to now delegated governance. But it's hard to jump to the delegated part on day 1. So we always recommend customers because most customers are in this day 0 journey of running things manually, that we recommend a path or journey for them. And our customers will fall in their journey today. But it's it's not a small journey. It takes some work to not only align processes into it and adopt the tool, but make sure we can put together a program the end state where they need to be in terms of scalability and governance. So our customers view this as a critical part of their mission of democratizing data and enabling that responsible use of that data being part of their journey as part of it, but we are scratching the surface on that. Many of our customers are still in their journey. They have not reached that end state, which is going to be a fully delegated model of doing governance.
[00:37:11] Unknown:
As far as the progression that these organizations make when they start adopting governance and they start to go through this process of understanding what are the policies that we need to enforce, what are the rules that we need to be aware of, access controls. I'm curious what you see as some of the either gaps in knowledge or maybe gaps in understanding of the business requirements that are common as they start to explore and codify and formalize these rules that maybe have been implicit or not sort of tribal knowledge up to this point? Yeah. That's a great point because
[00:37:49] Unknown:
in most cases, when you're looking at day 0, is day 0 is the very manual way of doing things, very manual way of managing policies. And when you're managing policies in every silos, there's inconsistencies. There's redundancies. And it's sort of a legacy way of working things that most administrators have worked through. So when we move into a more centralized model and then eventually into a delegated model, a lot of these redundancies go away. Even we have seen the reduction in number of policies because, you know, we are not doing this in silos. We are doing it in a more optimized way. There are functionalities that the tool provides to go and make it easier to go and do it more global policies as part of it. But also changes the way people think about policies because in many cases, it's been built, you know, from when users were managing databases in the on premise world and now move to the cloud. There's a lot of legacy tribal knowledge built in on how do you manage that that we go and when you present a little bit newer paradigm, takes a while for that transition to make from a process and a training point of view. What we have seen is a whole scale adoption from that part. They're saying most customers, most administrators come to believe, hey, this is the right way of doing things. It's much more easier, much more scalable.
And, you know, they can spend their energy on doing more additional work for their own internal customers rather than really going and managing policies on a day to day basis. So there is a broader reception once you start showing some of this value as part of it. But when you go and put a tool in, you have to really look at the processes as well and the people. And we spend a lot of time making sure we bring that value in in terms of making sure regarding in terms of best practices. We're guiding in terms of a newer way of doing things. So a lot of this legacy way of doing this goes away now once you start centralizing it and eventually moving into a delegated governance. So I'd say most of our customers have done their journey. Your set of customers who are going through the journey, we have done this quite a bit even when we were doing Ranger project, which was, again, a different approach than what traditionally most administrators and users have been accustomed to. So we built a lot of these cookbooks and best practices and knowledge base that we can help customers with. So but in 2022, there's even more broader reception from from enterprises to do things the right way.
And so we are excited to be part of that journey. Yes. But it's not slammed up. It's not something you put a tool in and walk away. We have we have to really work in making sure the processes and the people are aligned as well.
[00:40:34] Unknown:
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. Posthog is your all in 1 product analytics suite including product analysis, user funnels, feature flags, experimentation, and it's open source you can host it yourself or let them do it for you. You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog. In terms of the actual data assets that you are applying these controls to, 1 of the things that I know that you have as a capability is being able to do some measure of discovery of data assets, which in the cloud and particularly as organizations go through migration processes can become quite messy. And maybe there was a Skunk Works project that somebody spun up that you need to be able to find and add some of these controls to. I'm curious what you see as some of the challenges of being able to discover these resources and some of the ways that you have built sort of the heuristics and the discovery patterns to be able to find these assets and bring them under maintenance of the Privacera platform.
[00:41:54] Unknown:
Yeah. Vikram, a great point is where as cloud has grown, the span of control of what gets implemented is also wide right now. So just more and more decision making that is being available with the business teams and business users. So and there's a lot of power of doing that. So you're spinning up, you know, new resources, new projects, new other part of it at a tremendous scale. And when it comes to governance, you have to at first start with visibility. Right? You have to really know what is happening in the world, and then you can govern that part. So for us, data discovery has been very much some of the day 0 use cases and of understanding data, of really understanding. In many cases, customers assume they know what is happening because they know where PII is stored, and they get surprised when they run this tool to find out areas where they have not imagined before.
And that's a ROI immediately. Like, really getting a complete visibility of where their data is spread out and understanding of, you know, where sensitive data is in the first place. Because now you can start implementing controls to govern that or protect that part of it. And in many cases, discovery has brought a level of surprise to many, many of our customers because they assume they are on top of it, and they get surprised. It's finding the needle in the haystack, and they will find new haystacks and new needles. Didn't know they would sort of it's buried in it. But that's the way of doing the mod part of it is implementing that level of observability, that implementing that level of visibility is extremely important in the data governance world. So in many cases, customers have approached data discovery as that day 0 part of it. In many cases, they have added on after they have added on controls, and there's no right or wrong answer. Right? It really depends on the organization maturity, their way of doing controls. We see a discovery is is most customers will be they get surprised of areas where they felt they know, but they found out that they actually did not know that existed.
[00:43:57] Unknown:
And once you have discovered a particular data asset or a data storage location, I'm curious what the workflow is or if there are any sort of automated heuristics that you're able to apply to understand what is the appropriate metadata to associate with these assets to be able to know which controls and which regulations they need to be governed by and just being able to sort of manage that onboarding process of new data assets, whether it's through this automated discovery process or through actually introducing a new data asset or new storage system to sort of governance platform.
[00:44:34] Unknown:
Put a lot of thought in building, reducing friction and automation right now. So the automation is built in in terms of, you know, for us in a new data source or, you know, resources created, there's a lot of visibility we get from being able to go and then discover what's in that and automatically applying policies on top of it. So the solution is built in that way and where we can introduce things like classification based rules. So we can build policies at the data classification level, which means that, you know, policies which says PI data can be accessed by or cannot be accessed by a set of people.
And it doesn't matter if the PI data is in Snowflake or is in Amazon or is in Azure. The moment we can discover something or tag something as PII or we can get that information from some other source, it automatically becomes part of the policy. So it goes to the paradigm of create policy once and continuously applied or every data versus the older paradigm of every time a new data resource gets created, you have to go and make sure, you know, you build a policy for that. So we've taken away a lot of these friction points and applied it to our modern world, the way we have built the structures. And so if a user creates a new table, we can automatically discover it.
We can automatically apply a policy on top of it without any manual intervention as part of it. So those are automation that gives value to customers because it's really, really hard to keep up with the pace that the data is being used by a central administrative team. In most cases, they don't have visibility. And these resources end up being part of an ungoverned data asset, which introduces risk to companies part of it. So by having a comprehensive tool like Privacera, you can get a both visibility as well as governance. And we can do that with automation of making sure we are constantly doing that without requiring a manual intervention.
[00:46:33] Unknown:
In terms of your experience of building Privacera and working with your customers and seeing some of the ways that they're applying it to their problems, I'm curious, what are some of the most interesting or innovative or unexpected ways that you've seen it employed? I would say our customers are testing the boundaries for it.
[00:46:51] Unknown:
We had initially built Ranger and Privileged to handle internal use cases of, you know, internal to companies. We are really thinking of this as where internal users wanna access data and how do you pull the governance as part of it. But many of our customers started expanding this to their own data marketplace and where they have external parties coming in and how do you imply governance into that. So which was something where we have learned with along with the customers as part of it of how it doesn't matter if there are internal stakeholders or external. You know, you can apply the same level of governance around it. So, we are fascinated to see external data marketplaces where there are external parties coming in and interacting all in the same platform, in the same data, and how you wanna make sure you have segregation of duties of who's doing what so they're not touching each other's data. And being able to do that using Privacera's tool was fascinating to see as part of it. And so this is a fun part of working in the enterprises is our customers are also pushing the boundaries around how the tool can be adopted, how governance can be applied. So the data marketplace ability to leverage for governance for external parties was something we learned through the process. This is something, you know, I would say we thought through initially, but it's been fascinating to see that. And goes back to saying, hey. Privacy governance is kind of a universal applying layer, and it doesn't matter if the users are internal or external.
You'll have to apply the same standards or even more rigorous standards in some cases.
[00:48:29] Unknown:
1 of the other things that I'm interested in is because of the fact that you are adding this layer of control and visibility and sort of compliance to the underlying data sources. I'm curious what you see as the sort of potential for actors within the organization to circumvent the controls of Privacera by going directly to the source systems and maybe any of the ways that you have sort of worked in to be able to prevent that or mitigate that in these organizations? Those are
[00:49:05] Unknown:
core things that when you design a tool or a privacy tool, we think about right now is think about ways it can be mitigated or users can be mitigated. And so we built in our mechanisms in a way that we can detect any changes in the data sources that happens. And if there's something changes, we'll override with the source of truth that is there in privacy. And and so privacy will always have the source of truth. We are the single source of truth. And we are constantly updating the data sources with that source of truth if you need to. Right? So that's 1 way why, you know, we always maintain that consistency is being able to even if by any means, you know, a user can go and update a database, underlying data source with something, it will detect that and will con will refresh it you know, with the source of truth that we have. So it that's okay. So it'll overwrite that part of it. And so the data source will always have consistent view, consistent level of policies that is maintained in privacy as part of it. So we maintain that. We build these connections with that paradigm in mind. And there are other ways. I mean, there are other redundancies that have been built in into the system to make sure that it's not circumvented. Again, goes back to these are all the learnings that we have built even in the Apache Ranger project. And when we initially built that over a big data systems, which had multiple ways for users to get access to the data, We have to really think about every access path and make sure those are governed, and that's been part of the promise.
But we wanna do it in a way which the experienced users have of leveraging data and having frictionless access to the data. So we have tried to keep that balance and the architecture
[00:50:52] Unknown:
Privacera and working in the space of data governance and access control and security. I'm curious what are the most interesting or unexpected or challenging lessons that you've learned?
[00:51:02] Unknown:
I think the biggest challenge we would see is most most, when we are implementing governance, there is a, I would say, a lack of awareness among enterprise customers around their journey it takes to get governance in in terms of people, processes, and tools. And that's some of the things we always try to bring in early on when working with a customer, is to bring in how they should think about journey. It's not just about privacy. Governance is a very broad spectrum. It includes quality and metadata, and we are solving 1 aspect to it. And governance approaches need a lot of collaboration within companies between their enterprise teams and their data teams and business users to put together a structure and a mode of communication to do that. Otherwise, just buying a tool in itself will not serve the purpose of the business unless they they really think about the process and the team aspect to it, which I think the industry can do a little bit better job and industry including us right now. So we we are spending our time and energy in making sure we are educating the customers on the journey.
In some cases, a multiyear journey. You have to really think about making sure that you think about day 0, day 1, and day 2 use cases. Right? So you don't boil the ocean on day 0. Pick and choose, you know, a smaller subset of use cases. Roll that out, incrementally add more people to it, and build the program part of. And what we feel is many, many inside customers buy a governance tool, and then they just go away. And they bought a tool, and they believe the tool can solve that. With any other tool, you have to align the processes. You have to align the team and make sure all of them are aligned and thought through. And and that takes time investment as well. So that's I would say I won't say surprise, but it's something we are seeing motes many enterprises getting better. But it's still it's a lot of work to be done by the industry is to making sure we can keep educating our customers on the governance journey is sometimes a journey. It's it's not an overnight success.
You have to really think about day 0, day 1, and killer use cases that you can do upfront.
[00:53:13] Unknown:
For people who are starting down the path of introducing data governance and access control and data security, and they're starting to explore the sort of requirements around the sort of policies that they wanna put in place around their data. What are the cases where Privacera is the wrong choice and they might be better suited either using Apache Ranger directly or a different commercial vendor or building their own in house systems and just some of the limitations of Privacera that would lead to it being, you know, not the most optimal choice for a situation?
[00:53:48] Unknown:
So as I said, there's 2 sides I would point out. 1 is, hey, data governance is a broad spectrum. And if the focus is finding data, metadata, and focus is consistency of data in terms of quality of data, Those are things we don't do. If customers are looking for a data quality tool, we are not the choice. Right? So we are the choice for access governance, understanding, you know, what sensitive data they have, and making sure right people have access to right data. So customers have when they think about the governance journey, they also have to think about various aspects of governance, you know, what those journeys look like. And if the journey is primarily around quality or even metadata, you know, privacy is not the right choice. There are we are partners that can really do that job. We'll go and recommend those partners in as part of it. But if the use cases are on access governance and primarily around that, our focus has been primarily around the cloud. Right? So the customers are completely on premise and don't have anything in the cloud. We may not be a fit. This is mostly from our go to market side.
It's not like solution got fitted. It's our focus is on customers who are in the cloud, embarking on the cloud journey, or already in the public cloud. And we believe that's where, you know, the data gravity is. So we tend to focus on that part of it. In some cases, if you have only 1 data source and nothing else, if you're small enough that you're running everything in 1 data source, you can technically get away just maintaining policies in that and not needing a tool on top of it. So our value add comes in when, you know, you have diversity in the ecosystem. Most customers have different ingestion tools to different data platforms to different BI tools.
But there could be certain enterprise customers who are just using 1 data source and nothing else, and they're small enough that you can technically manage that, you know, without needing in or layout tool like PremiseRide. So those are cases that we we always recommend our customers to do the best thing as part of it.
[00:55:47] Unknown:
As you continue to iterate on Privacera and build out the business and the platform? What are some of the things you have planned for the near to medium term?
[00:55:56] Unknown:
Yeah. So a lot of exciting stuff right now. So we are continuing investing in our SaaS platform of ability to deliver what we do in a SaaS version where customers don't have to go and manage the software and the life cycle around that. They can get the functionality. They can get governance as a service. So there's a lot of investments going towards making that part of it. We continue to expand all parts of the product. Like I said, there's a governance layer of getting visibility of data that govern data sharing, the workflows. Those are heavy investment areas. And we continue to invest in coverage and making sure we can cover as much data estate as quickly as possible for the customers because, you know, customers are really looking for a universal single pane of glass where they can manage.
And it means that we need to cover as many data sources as possible. So there's a lot of emphasis in delivering fast connectors and more universal coverage towards that part of it. So a big emphasis on that, big emphasis on go to market side, which is going to be continuing working in making sure we can educate the customers with our message and also expanding our reach across North America and Europe regions.
[00:57:09] Unknown:
1 of the other things that I'm interested in briefly visiting is, I'm curious what are some of the ways that your work at Privacera has been able to get fed back into the Ranger project?
[00:57:22] Unknown:
A lot of it. So we stand behind the Apache Ranger community. As I pointed out, this is where our origins to the story was. We continue to invest in the community where we put back learnings we have, you know, when we go and test the boundaries of governance in the cloud. We continue to put back some of the core learnings back into the Apache Ranger part of it. So we try to make the, commitment to keep the community healthy and being able to use the functionality in the Apache Ranger itself. Not all of our solution in privacy row is in Ranger. We are not a 100% open source company. So we have tried to keep layers which are very enterprise user experience in the enterprise part of the private solution. But the core engine and the libraries, we have tried to keep updated within the range of parts. So it goes to our belief of maintaining a very open core and open standards.
The other pieces of it is how, you know, we continue to work with partners at the open source level. If there are ways for our partners to work at open source, we continue to engage them at the community level, which means their their contributions can also be put in. And that has been true from the Presto community, for example, with Trina and both Presto TV and Trino communities. They've been able to put back, the changes as those communities have grown. Those changes have come back into the Apache Ranger part of it. So a lot of collaboration happening at community level, and we continue to stand behind in making sure this community stays healthy and engaged.
And we continue to build those open standards that will eventually help every enterprise.
[00:58:59] Unknown:
Are there any other aspects of the work that you're doing at Privacera or the overall space of data governance and security and access control that we didn't discuss yet that you'd like to cover before we close out the show? So, Tubais, first of all, thank you for having me. I think we covered a lot of ground today.
[00:59:15] Unknown:
Right from if you need to summarize is data governance, the rules, policy, standards about how you handle data is becoming real, covers a wide variety of spectrum from quality metadata to security. And those are when it comes to security and privacy, it's a real need in the market. There's a real friction between business use of data and those corporate mandates around how the data should be used. And that's something where comes in and helps helping customers in their journey. And in many cases, customers are have a journey of going from manual to being centralized governance to eventually being delegated governance. And we're excited and we're very proud of many of our customers and being part of the community driving those changes. And so I think we covered wide variety of things. Obviously, there's a lot of details behind that.
We're happy to help anybody listening in with any questions. So you can always reach out to us through our website
[01:00:11] Unknown:
or send us an email, and we'd be happy to respond back. Alright. For anybody who does want to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:00:29] Unknown:
There's a lot of siloed approaches still around data management we see in a way the upload logistic approaches. It's a fairly noisy market, and everybody has come up with their own approaches. I think the ones who'll win is where we can think about data is going to be multi cloud and hybrid cloud tomorrow. And the approaches we'll win is which makes it agnostic to wherever the data lives or stored and being able to do that. So being decoupled from the underlying platform, any approaches all the approaches, we believe, operations, data management, governance, security will be horizontal place in the future. And the players would truly take an horizontal view across any cloud, any application will then.
[01:01:13] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Privacera and your work on Apache Ranger both prior to that and ongoing. So I appreciate all of the time and energy that you and your team have been putting into making data governance and security more accessible and scalable and adaptable to our evolving environments. So I appreciate all of that, and I hope you enjoy the rest of your day. Thank you so much for having me, and appreciate everybody listening in. Thank you so much. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Balaji Ganesan's Background and Early Career
Focus of Privacera and Data Governance in the Cloud
Lessons from Apache Ranger and Open Standards
Cloud-Specific Challenges and Innovations
Privacera's Architecture and Implementation
User Personas and Collaboration Patterns
Workflow and Implementation of Privacera
Data Discovery and Automation
Customer Use Cases and Unexpected Applications
Challenges and Lessons Learned
When Privacera is Not the Right Choice
Future Plans for Privacera
Contributions Back to Apache Ranger
Closing Remarks and Contact Information