Off The Shelf Data Governance With Satori

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Macy. And today, I'm interviewing Yoav Cohen about Satori, a data access service to monitor, classify, and control access to sensitive data. So, Yoav, can you start by introducing yourself?

Sure. Thank you for having me, Tobias.

So I'm the co founder and CTO of Satori

Cyber.

And do you remember how you first got involved in data management?

It all started when I joined a cybersecurity

startup about 10 years ago. The data technologies we all have today were either less common back then, or we just didn't have the budget for anything off the shelf. So we basically built our own data lake without even calling it that way.

And since then, I've been involved in building and operating global scale data processing

And so you mentioned that some of your background is in cybersecurity.

I'm wondering if you can give a bit of a description about what it is that you're building at Satori and some of the story behind how the product and the company got started?

Sure. So Satori builds a universal data access service that provides visibility and control on data access and helps data teams spend less time implementing

security and privacy controls,

and more time working with data.

It's a transparent data proxy that analyzes queries for data and the results sits and builds a comprehensive view of what data you have, where it's located, and how it's being used, and provides tools to enforce security and privacy policies on data access.

Now, the story behind building Satori was, as I mentioned about going back 10 years ago, 1 of the biggest challenges we were facing with our homegrown data lake was how to ensure proper use of the data in an environment where there are different stakeholders for the data.

You have regulation, you have compliance requirements,

and all that at a very large scale of dozens of data centers and a ton of data. Conventional tools were either up for the scale we were operating in, or were too expensive for us to use as a small startup.

And so we ended up spending a lot of engineering cycles and resources to make that work. And when LDAD, my co founder and CEO, and I started thinking about what we wanted to do next,

we came back to these challenges and found out things have only gotten worse in the industry since

then. Well,

today, it's easier to operate data at scale. And for 99% of the companies,

you can do that pretty easily with today's tools.

Governing data at scale

with techniques that were invented decades ago is not a good fit anymore.

And when we started talking to data driven organizations, we realized this was no longer the problem of just the big tech giants. It was everybody's problem.

And so in terms of

the overall capabilities

for being able to handle

access control and governance for data. You mentioned that a lot of the practices

and platforms for being able to do that are predate the

current iteration of how we think about data and how we process it. So I'm wondering if you can just give a bit of an overview about

your perspective

on the state of the industry for governance and how Satori compares to some of those other platforms and tools and practices in the industry

for being able to manage these governance practices?

Satori's approach is different in that we understand companies don't really want to re architect their data environment just to get the security and privacy capabilities

they need today.

So we've offered the choice. Most organizations will adopt something that is easy to implement,

and that became our mission. As opposed to other solutions out there in which you need to register all your users to the solution, model all of your data sources and datasets

before you can actually start using the tool. With Satori, all that upfront investment doesn't exist.

Once deployed, you immediately get in-depth visibility on data access in your environment

and can start enforcing security and privacy policies on your critical data flows.

To summarize, I think Satori's approach is very much pragmatic, while other solutions out there are trying to boil the data governance ocean, so to speak.

And so for organizations

that are trying to establish the policies for data governance and be able to manage

actually enforcing them and defining them in a repeatable way and in a way that's not going to be

influenced by varying interpretations, so in a way that is very clear as to the intent and purpose and steps necessary to enforce those policies.

What are some of the biggest challenges that organizations are facing because of the current state of the data ecosystem?

I think the biggest challenge in enforcing policies

is the overhead that process generates on data engineering teams. The reason for that is that, in most cases, data access policies

are defined using database objects like permissions, grants, or views.

Modifying

and managing these objects is a very technical process, and it's also very risky.

A lot can go wrong when you

change these things.

That's why it's being handled by very technical folks, the data engineers.

What we're hearing from these companies we talk to is that on

average, 30% of the time spent in data engineering teams goes to managing access and permissions.

And when you talk to data engineers,

they much rather have the business take care of that. They have the right context to do that. And so they can be more focused on generating

new datasets to help the business.

And then as far as the goals of the Satori product and platform, what are the primary use cases

that you're looking to enable? And what are the elements of data governance that you are

currently deciding not to try and tackle as part of the platform that you're

offering? So our main goals for Satori is to be easy to implement in a complex environment and to provide

elegant solutions to problems that today are being addressed by manual work.

For example,

take how dynamic data masking is implemented in Satori as opposed to other solutions. Usually, what you see in dynamic masking solutions is a policy defining

how to mask each column based on the role of the person viewing the data. That's pretty much the way it goes. That is really hard to maintain

at scale because as new data is introduced and existing data changes,

someone has to keep that policy up to date. With Satori,

because we classify data as it's being accessed,

our masking policy is defined at the level of data types and not columns.

So a single masking policy can be used across

different types of data stores. And it doesn't have to be updated when new data is introduced.

As for use cases that we enable, it's mostly around

automatically discovering and classifying sensitive data,

running analytics and data science without compromising sensitive data,

and mitigating security and compliance risks by auditing data access and enforcing policies on data access.

Particularly in the scope of compliance issues and potential

regulatory

areas that companies might be subject to, what are the risks associated

with getting data governance wrong or

overlooking certain aspects of it or certain capabilities?

And

what are some of the potential ramifications

of either ignoring or overlooking those aspects of data governance?

There are 3 main

risks organizations take on themselves when their data governance programs

are not meeting today's requirements.

The first 1 is the risk of a data breach. Sensitive data is leaked outside of the organization that can lose to

a loss of consumer trust that can lead to

regulatory fines on organizations.

The second 1 is

data quality issues.

A lot of organizations today are data driven, striving to be data driven. And when data governance is not working correctly,

that can lead to data quality issues, and that can lead to

incorrect decisions based on incorrect data.

And in terms of the actual Satori product

and the platform that you've built there, can you give an overview about how it's architected and some of the design goals that you had at the start and how

those assumptions

or intentions of the platform have changed or evolved since you first began working on it, as you started to onboard more customers and work with them to address their design challenges?

So the Satori platform is comprised of 2 main components.

The data access controller, This is where we have our proxy, our classification engine, and our policy engine.

The data access controller can either be consumed as a service or deployed in the customer environment.

The second 1 is the management console. This is a SaaS application

where you manage your data stores and policies on the Satori service.

When a new data store is added to the service, Satori generates an alternative host name to access that data store. And that host name leads to the Satori data access controller when accessed.

So from a data consumer perspective, the only thing that's changed is the URL of the data store. And in many cases, that change can also be applied centrally, so it becomes invisible to data consumers.

Once data consumers

start accessing their data via Satori,

Satori analyzes the queries

and classifies the results sets to understand what type of data is being accessed

and where it's located

within the data store. We aggregate all that metadata,

and we generate a data inventory where you can see

the schemas of all of your data and the tagging that we provide on top of that schema.

All of that happens

without having to do any configuration

or set

up on the system. It's all automatic.

It's almost like we crowdsource

the creation of that data inventory,

which is kept up to date

continuously

based on what we observe from

data access.

All these context that we collect and generate is provided as input into our attribute based access control policy engine,

which customers can use to implement any type of data access policy

they need.

So we asked about things that changed since we started building the platform.

I must say, not a lot has changed. But 1 thing that we learned was very important, which we didn't realize early on,

was analyzing the queries and not just focusing on result sets.

And in order to create that data inventory I was talking about,

it's imperative that we understand

in which data store locations

queries are accessing.

And the only way to do that in a high quality way is by analyzing queries, and most queries are non trivial. So that has been a challenge, which we had to overcome and we didn't anticipate at first.

Or proxy oriented nature as compared to other systems that might rely on a kind of push based system where maybe the ETL pipelines will publish the metadata to the data governance system or might crawl the metadata from the data storage layer and just some of the benefits and trade offs of using this proxy approach versus a

maybe more active or more heavyweight integration process for being able to pull in that information?

The advantage of the proxy approach is that,

first of all, it's focused on the data that's actually being accessed, as opposed to data that's, you know, just sits there and is not accessed.

2nd are

under the hood, we actually do both.

So we look at data access, but we also query the data store to get schemas of tables.

And we incorporate that information

into

the inventory.

So you can say that we actually do both, but we do it in an ongoing way, instead of doing it like, once a week, once a day. So it's ongoing. It's kept up to date. That's 1 of the big advantages. Whenever

new data

is accessed,

it's automatically being classified.

So the data inventory

is kept fresh and up to date all the time.

And I'm curious if you have seen cases where

because you're only populating the catalog when people are querying for certain records, if that leads to

maybe certain blind spots in terms of the types of assets that the company has and some additional

able to

identify and locate those

unaccounted for records or assets or data locations?

So we haven't encountered that as a significant challenge.

Whenever

a query is being sent to data that Satori might not have seen before, we automatically update that

our inventory

on the fly. And so that becomes available into the policy engine. It becomes available in our visibility and analytics tools.

And as far as the

data platform components and the aspects of data infrastructure that you're working with,

because of the fact that you're proxying, I'm wondering

how that either simplifies or complicates the work of being able to integrate with different types of databases or storage systems?

So actually, what we've learned is that early on, we thought this is going to be a main challenge. And we call that the coverage, data storage coverage. How are we going to support multiple types of systems?

Actually, what we've learned is that and we built a system to support that is that the only thing that we need to do to support a new type of data store is understand how

clients of that data store communicate with the data

data store communicate with the data stores to to understand the actual network protocol used.

After we do that, many data stores share the same behavior.

And so for us today, adding support for a new data store

can be something that we can do in just

like 2, 3 weeks, and we're up and running. There are a lot of similarities

between how different data stores behave.

And modern data stores

usually rely on HTTP protocols.

So

it really

alleviates a lot of work that we have to do at the network level to parse the data store specific protocol.

Digging more in to the actual deployment and integration piece of getting Satori set up, particularly on the point of these HTTP protocols,

I'm curious if you've seen any difficulties

in being able to act as a middleman for those connections because of things like TLS and

needing

to be able to decrypt the communications to understand what the intent is and then proxy it back to the back end and what some of the common patterns are for how those TLS layers are segregated across the network. So do you act as the TLS terminator? And is it common that it will then go unencrypted to the back end? Or is it common to re encrypt to the back end and just some of the kind of deployment considerations that play into that kind of setup?

So it's a good question.

All of the data traffic that goes through Satori is encrypted end to end. That's important to understand.

We do act as a TLS terminator.

And because we need to analyze a query, and we need to analyze the results.

The way we accomplish that is by

providing

a Satori generated hosting for data consumers to access the data. So traffic between data consumers and Satori

is encrypted by

a TLS

session that is established by our proxy. And traffic from our proxy to the data store

is encrypted by a TLS session that is established by the data store. So you get end to end encryption

all the way.

Some of the more

older data platforms

have proprietary

ways of handling encryptions.

And so we have to build support

for those protocols. And we have,

but we never degrade

the security

or privacy

of the environment.

Going back to what you were saying at the beginning about a lot of your background being being in the area of cybersecurity,

I'm curious if there are any other ways that that background has influenced

your design and thinking about data governance and how to architect the platform for being able to be secure and scalable.

Have

a

have a very long background in cybersecurity

and in building cybersecurity solutions.

We

incorporate

our background in building global scale proxy networks in our previous company

into Satori,

and operating

a SaaS

service.

Our background in cybersecurity also informed

how we approach data governance.

And the way we see it, there are many similarities

between these 2 fields.

I think the first 1 is

the need for deep visibility

into the activity of the actors in the system.

In cybersecurity,

these could be hackers or bots.

And in data governance, these could be

scripts or applications,

or analysts or even malicious insiders.

By providing deep visibility into how these actors operate in the environment,

we

provide the necessary information to organizations

on building the right controls and applying the right controls.

The second

thing that we're bringing from our background in cybersecurity is the need to provide operational value,

not just to the team

that bought the solution, but

for adjacent teams as well. So for example, if a Satori sale is being championed by the chief data officer,

officer

because of a need to simplify the process of creating a data inventory,

the privacy team can also enjoy the fact that they can now generate data access reports for compliance purposes.

And it's important

to get that buy in of all of the teams that are involved and provide

value to all of them.

In terms of the overall life cycle of data, so Satori acts as a secure means of access control and cataloging and being able to do things like dynamic masking.

I'm curious

how it sits in the overall life cycle of data management and some of the other ways that it integrates with the broader data platform or ways that users of Satori think about the product in the overall scope of their entire data governance strategy?

So in an overall data governance strategy,

Satori's role is to be that enforcement point for

data access policies.

And the way we

fulfill that role is by

analyzing the queries, analyzing the results. It's building all that metadata so the policies can be very much informed.

What we don't do, for example, and we delegate into the environment is authentication,

as an example.

Satori does not authenticate users.

We rely on existing authentication capabilities that are deployed in the environment, whether it's username or passwords

or things like SAML or OAuth or LDAP,

to do that job for us.

Other integrations that we have,

which are very important, are with the identity provider system.

Satori is not a, as I mentioned, not an authentication or a user management solution.

But user context is very important in order to

enforce data access policies.

So we connect into

your Okta or your Active Directory or your PingOne.

And we pull information about

users. So for example, which organization groups they are part of or which roles

they are assigned to. So we can use that as context, as input into our policy engine.

And so in terms of the overall space of data governance, you mentioned at the beginning that a lot of the current challenges are because of the fact that governance policies and capabilities haven't evolved at the same rate as the underlying data platforms and the ways that it's being used. And I'm wondering if you think that that is simply because of the

ways that those 2 pieces have evolved or if it's inherent to the problem and that's what contributes so much to the complexity

and the fact that it is such an underserved market?

So I think it's mostly how the industry has evolved. If you think about it, each database vendor

operated in a silo, building their own access control systems,

leading

to a lot of inconsistency

in how authorization is modeled and implemented in each platform.

Even simple concepts like role based action control

are not implemented the same way in all of the systems.

What we at Satori are envisioning is a different model, which is based on an attribute based access control system instead of role based access control, and in which, instead of granting

people to data, you grant data to people. What we mean by that is that, you know, given a dataset,

the data access policy should specify what data consumers need to do to get access to the data

and the scope of the access that they get and not just rely on, you know, what Active Directory group they have, and then they get

unrestricted access to data.

And I'm wondering

if you have seen any trends in the governance space along the lines of what's happening in other areas of technology of trying to standardize

around common patterns and common interfaces

for

a particular problem domain so that different vendors and different open source products can innovate based on the

specifics of their internals while still making it simpler for the overall ecosystem to interoperate

without having to have all of these

point to point solutions where you have an n times m problem and just coalesce around the kind of default set of baseline capabilities with maybe some

additional protocol enhancements for specific use cases, you know, particularly around things like attribute or role based access control?

So, unfortunately, we haven't seen that yet. I really hope we can converge as an industry to something like that. Obviously, with authentication,

I think that is pretty much

standardized, but you would imagine that

it would be implemented the same way in every platform. And since we operate in that space, and we have to

deal with these complexities, we see different implementations of SAML and OAuth and other protocols in every type of data source. So unfortunately,

I'm not the bearer of good news in that area. I think the complexity

of

data access

and permissions and policies

is going to be with us for the foreseeable future.

And another area that we could potentially see some

common practices and common approaches might be in terms of labeling

and tagging of data to be able to easily identify whether something is PII and whether it falls under something like GDPR enforcement versus CCPA enforcement versus HIPAA and some of these various regulatory regimes to make it easier to have some kind of prebuilt policy packs that somebody can say, okay, now I need to be PCI compliant, so I need to pull in this pack of policies to apply to this data that has these common labels? Or I'm wondering if you've seen anything like that.

This is an interesting

direction.

And I think, to some extent, it's possible to provide organizations with templates of policies on how

they can

be compliant to different types of regulations.

And

there's still a space

between

the definition of the template and how that's being implemented

under the hood in different platforms. But, yeah, definitely,

I think

that adopting best practices

is something that we're seeing happening, and we'll see more of that.

And so for people who are interested in

implementing Satori, what is the actual process for getting it deployed and integrated into their existing data platform?

The first step is to pick 1 of your data stores. And that basically, should be the 1 that you want to get visibility on. Maybe it's your Snowflake or your Redshift or BigQuery.

Once you register the host name of that data store with Satori, we provide you with an alternative host name to access the data store. And as as mentioned before, that's the host name that routes traffic into our data access controller.

From then on, every access via the new host name go through Satori, and all the features and capabilities

kick in.

If you need to move everyone to access the data store via Satori,

usually this happens by

updating your single sign on app. You change the URL, and then you have everyone going through Satori.

Once this happens, this is where the magic happens. The Satori dashboard and data inventory,

they start filling up with information about your data access, and it's like flipping on the light switch.

In fact, this is

the meaning of the word Satori in Japanese,

which is sudden enlightenment. And that is exactly

the experience that we're looking to provide our customers with.

And in terms of users of Satori, what are some of the most interesting or innovative or unexpected ways that you've seen it being used?

So 1 of the most impactful ways we've seen Satori being used is by a b to b to c marketing platform

company

where PII

of their customers' customers would flow into their data stores

in ways and places they

cannot anticipate.

They're using Satori to first make sure they have a good handle on where that PII is.

And when new PII is introduced into the system,

they can quickly get a good handle on it, deploy

access policies around it, and understand what's going on.

And as far as being able to identify those new sources of PII, I'm curious

what you have found to be useful heuristics for

identifying

these different patterns where I know that there are certain commonly structured aspects such as credit card numbers or social security numbers

and addresses, but it's also difficult to be able to just do, like, a brute force regular expression approach or

just simple pattern matching because it's possible for maybe the column definitions

to

be changed or for some of that information to be split across multiple

columns.

I'm wondering what you have found to be some of the challenges of automatic discovery of these types of records and some of the useful

strategies that companies can employ to make sure sure that they don't fall between the gaps?

What we do is a combination of 3 types of algorithms.

We have dictionary based classifiers

for things like

blood type,

salutation,

state codes, and country codes, etcetera.

We have pattern

based classifiers for things like

email addresses,

usernames,

encrypted passwords, and so on.

And for more complex data types that have a more free form, we have a set of machine learning based classifiers.

When you combine

all these 3, you get a pretty good handle of PII. The next complexity

the level of complexity

after you do all that is supporting PII in different types of languages.

And that usually involves curating datasets in in different languages, training sets to be able to train the models to identify PII in those different languages.

Data classification has always been a tough problem to solve.

I think that the combination

of

80%

automating that work for you,

maintaining a low rate of false positives,

and providing customers with an easy interface to update classification decisions

or complement classification decisions with their own

internal information about the business and the data

is the right approach. There's no silver bullet in solving that problem.

In terms of your own experience of

building and growing the Satori product and business, what are some of the most interesting

or unexpected or challenging lessons that you've learned in that process?

So coming from a SaaS background,

we have a deep appreciation

for the benefits customers get from vendor operated solutions.

However, in the data domain, we expected most customers

would want a self hosted solution.

So we designed our system to be able to support both deployment options.

Turns out that the factor that determines the desired deployment option was the fact that Tori is a data governance solution. It was the customer's propensity to deploy and operate vendor solutions

or consume

the value as a service. And so today, we have about 50% of our customers on the SaaS solution

and 50% hosting it themselves.

We expect to see

more usage in the SaaS platform as we make more progress.

And for people who are

evaluating data governance options and understanding how they want to manage access control, what are the cases where Satori is the wrong choice?

I think that Satori,

quote unquote, would be the wrong choice if customers are only looking for

data discovery and classifications,

and they are not really

looking for auditing data access or enforcing policies,

In those cases, other solutions might be a good fit.

And as you look to the future of the product and the business, what are some of the plans that you have for the near to medium term?

So 1 of the main concepts we are building right now and planning to release soon are data access workflows.

For example, if a data analyst

queries a dataset he or she has never queried before,

Satori can block that query and instead send a Slack message to the user with a link where he or she can submit a request to access the data from the data owner,

basically

distributing

the work that today is centralized on the data engineering team with that 30% spent on granting and revoking access to data, distributing that load across the organization

to people who have

the right context to approve those requests.

And we expect workflows to become a very important part of our story, but not just for us, but for the industry as a whole.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So I think the biggest gap used to be

how to process, store inquiry huge datasets.

That is largely solved today.

The next area of focus for the industry, and that is where Satori is focused on, is how to use all that data in the way that is secure, compliant, responsible, and ethical.

I believe there is still a lot of room for new and innovative ways to overcome these challenges, and we'll see a lot of these

innovations

coming out in the in the next few years.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Satori. It's definitely a very interesting approach to the problem of data governance and access control, which is, as we've discussed, a very important area and an area that still has a lot of room for innovation. So thank you for all the time and effort you're putting into that, and I hope you enjoy the rest of your day. You too. Thank you for having me, Tobias.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links