Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Legacy CDPs

charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visit rudderstack.com/legacy

to take control of your customer data today.

Your host is Tobias Macy, and today I'm interviewing Rishab Podar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data. So, Rishabh, can you start by introducing yourself?

Absolutely. Thanks for having me here, Tobias, today with you. It's a pleasure. I'm Rishabh. I'm the CEO and cofounder of Wake Systems.

We were a startup born out of research and open source at UC Berkeley.

And in a nutshell,

we

provide a platform, a confidential computing platform for collaborative analytics and AI at scale.

And do you remember how you first got started working in beta?

Yes. Absolutely. So I have always been passionate about data privacy and security starting from undergrad days. Starting from research in undergrad,

then coming to Berkeley Computer Science for

my PhD.

I always knew that cryptography and system security was something that I was interested in. And especially

the key question that we were looking to answer as part of our research as well. Where I at UC Berkeley, I met my PhD adviser, Raluca Bopa. She's a world renowned cryptographer and system security expert. And the problem that we wanted to address

was how do you enable

computation

on confidential data

while keeping the data confidential?

Right now, if we take a step back and look at the state of data protection that exists in the world today, as an industry,

we have solutions that can protect or encrypt data at rest. When it's stored in the disk or on the cloud, you can encrypt using standard mechanisms.

We also know how to encrypt data in transit when it's being sent over the network from a source to a destination. We can encrypt it, protect it once again using standard, encryption protocols like TLS, HTTPS.

What we don't have solutions for widely deployed today is

encryption or protection for data in use.

Right now, when data needs to be processed by software on the machine on a machine, it needs to be unencrypted.

And this makes it a point of vulnerability

and failure.

It makes it susceptible to attackers, bad actors, and so forth. So the larger question was how do you enable insights and computation from confidential data? How do you activate confidential data without compromising its confidentiality?

And this was the broader problem that we were looking to address. And as part of my PhD at Berkeley, we were part of this lab called the RISE lab, which has been a hotbed for a lot of successful research and open source. At the risk of something immodest.

And our work was funded by government grants, but also tech firms like Google, Microsoft, Facebook, Amazon,

including the more traditional Fortune 500 type companies. And the work we did was always informed in close partnership with industry. Problems that we saw would be super important 5 years later down the line. In conversations with them, we realized that a key pinpoint here

was the inability

to share confidential

data, even within organizations,

let alone outside of organization boundaries.

So as part of that, we built the MC 2, open source that we are now productizing at OPIC.

So as far as the OPIC systems

platform, I'm wondering if you can give a bit more detail about what it is that you're building and trying to put on offer to make this to address this challenge of working with confidential data in a manner that is safe and privacy respecting?

The key

technology, right, the key problem, the inability to share confidential data. The key requirement here is, can we keep the data protected and encrypted when it's being used? So if we could do that, then we could share this confidential data to with different teams within my organization or share this data with other entities within my business ecosystem.

And if I could do that, then I could collaborate with other data owners, jointly run analytics, or jointly train models

on the collective combined data while keeping it protected. So no 1 gets to see it. Not the other data owners, not even Opaque, not even the cloud platform where opaque might be running.

Throughout the life cycle, the computation,

the data remains protected and encrypted,

but you can still collaborate on it. You can still run big data analytics. You can still train models on the combined data towards some mutually beneficial end. For example, we see a lot of excitement and urgency in use cases here in in finance and health care and advertising technologies. Simply

by enabling

multiple data owners to collaborate on their confidential data assets.

What we provide is a platform that facilitates

this data collaboration

for analytics and AI while keeping the data confidential.

So as a data scientist or a data analyst,

you should not have to be an expert

in the underlying confidential computing technology.

You should still be able to run the same workflows that you are currently used to for getting insights on the confidential data

while

making use of that confidential data. So you can activate your confidential data as easily as if it were regular vanilla data, while still maintaining the confidentiality protections and requirements.

Really making it frictionless

for data scientists and data analysts to collaborate with our potential data.

In terms of

the practical implementation of this, I'm wondering if you can provide some detail and nuance about what you're building with MC 2 and opaque systems compares with the topic of homomorphic encryption, which I know saw a lot of popularity and hype a few years ago and was always, you know, about to be put into production, but never quite made it there.

That's a very, very good question, Tobias. In addition to homomorphic encryption, right, there are other protocols as well, like secure multiparty computation.

These are beautiful cryptographic protocols

that use purely software based mathematical cryptographic approaches

that

at once allow you to keep the data encrypted

while still running programs or software or or operations on it.

Now part of my research was also on those topics, to be honest. Again, beautiful technologies have my PhD thesis was in that topic. My cofounder and PhD adviser,

she is an expert in the space too. But 1 thing that became very clear to us and is also a reason as to why these technologies haven't

seen the kinds of proliferation

that we had hoped in the last decade. Was this far too resource intensive still.

Computations that should take seconds or minutes can take hours or days depending on the nature of the computation.

There are orders of magnitude slower typically than regular computation, which makes them rather unsuitable, in my opinion, for the kinds of workloads that we're looking to address.

Users of this technology, for people who want to collaborate in confidential data, need their systems or solutions to be fast and performant and scalable.

And we're not there yet with technologies like homomorphic encryption or secure multiparty computation.

So

instead,

the approach we took was to base opaque confidential computing technology.

It's different from a purely cryptographic software based approach like homomorphic encryption.

It's rooted in secure

and trusted hardware

that was pioneered by Intel in the last decade and only became

available on the clouds in the last couple of years.

What this technology allows you to do is, at a very high level,

essentially create a trusted execution environment, sort of like a secure black box within the CPU hardware itself.

You can take security critical pieces of code and data and put it inside this black box, and the hardware ensures that no software outside

this trusted execution environment,

not even privileged software like the operating system or the hypervisor

or an attacker who breaks in and gains root access or system

administrators, no 1 can penetrate this black box and look inside.

The best that they can do is look at memory,

but the hardware ensures that the memory is always encrypted.

And the only way to get access to unencrypted data practically is to physically attack the CPU chip. But then you'll end up destroying process, and you need physical access as well.

So this is a very powerful revolutionary paradigm in my opinion. And what

we do at a big provide the software ecosystem

that can power this hardware capability

for analytics and machine learning workloads.

Fun fact here. So intel actually gave us access to this hardware in the lab at Berkeley even before it was commercially available in the clouds. So that's really allowed us to spearhead

the development of frameworks and drive adoption and disseminate the technology

as well.

As far as the application of these secure enclaves and secured computation

in the data

analytics and ML ecosystem,

what are some of the core problems that you see organizations struggling with where this is the solution that they're looking for?

So there are 2 parts to it, really. 1 is,

it really allows

you you can use this technology at its core in a variety of ways. 1,

you can with the adoption of confidential computing, you can now accelerate your digital transformation.

Lots of organizations

have confidential data,

locked down on premises. Right? They can't share this data across teams even within the same organization, lines of business within banks, for example. Let alone share this data

outside organization boundaries or with other entities or even move to the cloud. With the adoption of this technology,

you can now accelerate that transformation to your migration to the cloud as well. Because now you can keep your data protected at all times on the cloud without having to

necessarily trust the platforms or software running on the cloud by running these whatever software and applications you have within confidential computing environments.

So really moving away from

institutional trust to programmatic trust.

2nd, it enables data collaboration

as well, which really unlocks many use cases. Things that organizations have struggled

to achieve

become possible as a result. Because multiple data owners, they can now each individually encrypt their data, pool it together in the cloud, combine it in encrypted form, and then jointly analyze it or jointly train models on it towards some mutually beneficial aim.

And this is a particularly

powerful paradigm that we see,

requirements for across industries.

For example,

banks can collaborate

towards identifying human

or money launderers.

Health care institutions can collaborate and share data to train better disease prediction models, run better patient profiling.

In the advertising world, publishers and advertisers

can combine their datasets to identify common audiences and or user behavior. A rich variety of analytics and machine learning based use cases that were not possible as a result on confidential data now become available to organizations as well.

And 3rd, all of this happens in a way while making it easy to comply with privacy laws and regulations as well. Privacy until the last decade, privacy and security have kind of been an afterthought for most organizations.

But now

we are seeing the emergence of the GDPR and the world is following suit. Newer privacy laws and regulations as well, which are increasingly controlling

how confidential data can be used

by third parties and software processors.

And this technology, it really makes it easier as a result to comply with those laws and regulations while still

enabling insights from.

In fact, I would argue that you get better utility from your data as a result because now you can use datasets

that you weren't able to use before because of confidentiality restrictions

as a result of this technology.

In terms of the application, some of the other techniques that come to mind as we're talking are things like differential privacy or data obfuscation.

And also,

it brings up the question of, if the data is in the secure enclave,

what are some of the ways that you protect against data exfiltration or some of these re identification attacks that these kind of obfuscated datasets are subject to, and just some of the broader space beyond just encryption, but of kind of these data security questions.

You've actually hit upon a key point here, and this problem becomes more pronounced

in the context of data collaboration as well. Because if you are the only 1 using your dataset,

sure, you can enforce controls in a more reliable governed way. But if I am collaborating with you,

you should not be allowed to do whatever you want with my data. You should only be allowed to do what I permit you to do with my data. So this exacerbates in some sense the ability to collaborate on data. It opens up new challenges

around

governance and policy enforcement as well.

To answer your questions, in the absence of this technology, right, we have been relying. Our industry has relied on approaches like obfuscation,

data masking, tokenization, anonymization, protection

to enforce controls on the data.

The problem

with these approaches on their own is that, 1, there has been a lot of research that has shown that they're not

really secure approaches because

tokenization

or data masking can be reversed.

And you can if you have access to auxiliary datasets,

if you have more fields available to you that are not masked or not tokenized, then you can learn some information. Then you can learn information about the underlying data.

Also,

to be able to make use of the tokenized

or masked data. What these approaches do is they typically

map the data, the confidential fields, to deterministic

values.

And because they're deterministic, you can now maybe join 2 datasets on the tokenized field.

But because it's deterministic, it's also insecure. Because if I know the tokens to my confidential data, then I also know what the tokens to your confidential data map to. So

standard approaches like obfuscation

or masking don't really quite work. They're not really secure.

What you really need

is randomized encryption. But the problem with randomized encryption has been that, well, because it's randomized, you can you now can't combine these 2 datasets together because your fields map to some different random value in my field, like some different random value. How do I bridge that gap? And this is where confidential computing comes in and sort of allows you to get the best of both worlds.

In fact, you can now also add additional data fields that you previously wouldn't have wanted to share as part of your dataset as well because everything remains improved by default. But you can still combine datasets together. You can still run operations in them and get insights from them.

Key to all of this is the ability as far as collaboration is concerned, the ability to enforce policies.

Because again, you should not be allowed to do whatever you want in my data. If you run a SQL query, for example, that says select star, it truly gives you access to all of my data, completely violating

the guarantee that you sought in the first place.

So 1 key value that our platform also provides is the ability to enforce policies

around who is allowed to do what with the data. And this goes beyond

traditional mechanisms of policy control,

like role based policies or data access policies. You can now also specify policies around how the data can be used and what results you're allowed to see. So that is a key part of it as well. Touched upon differential privacy as well. And differential privacy is also a very exciting technology that in some sense is complimentary

to confidential computing or and even technologies like homomorphic encryption, multi party computation.

So all these different technologies fall under the privacy enhancing technologies umbrella. And I think as an industry, we need to disentangle

the properties that these different approaches provide because some of them

are alternatives to each other, but some work in a complementary fashion.

Differential privacy is complementary because what differential privacy does is it basically

is a way to prevent leakage

from the results of the analysis that you're doing.

At a very high level how that works, it allows you to add some noise,

some mathematically computed noise to the results of your aggregate analysis. So for example, if you want to learn average age of everyone in your dataset,

then instead of giving you the exact average,

differentially private solutions will add some noise to it so that you get a noise average.

And the

key property that this provides is as a result, you,

the analyst, is not able

to pinpoint whether a particular

data item exists in the dataset or not. So they want to know if your information exists in the data set or not or if my information exists that it's in the data set or not because of the addition of this noise.

What it does not provide

is protection or confidentiality for the data while the analysis is being run on top of it. The computation happens on encrypted regular data. It's only that the recipient of the results

gets some noise information.

You could run differential privacy differential private solutions within confidential computing as well to get confidential computing. So getting protection for data in use while also getting preventing leakage from the results of the analysis. So in in that sense, these 2 technologies are complementary,

I would say. A lot of interesting stuff to dig into there. Before we get too far into the weeds on the kind of technical implementation of it, some of the other interesting aspects of

applying the solution are around the

performance

impact that has typically been associated with managing the encryption of data, and then also to your point of being able to define and apply policies on what operations are allowed on a given dataset, particularly in that question of a collaborative

data agreement where maybe multiple organizations have different

datasets that they want to be able to combine together

is the question of who is empowered and to be able to define what those policies are and the enforcement of it? And what does that negotiation process look like, and what are some of the technical controls

that are available for them to be able to maybe compose together different constraints that they want to apply?

These are drill issues that we deal with, and I think more work needs to be done to provide

solutions that

are as user friendly as possible. I think we've come a long way, but I think more work needs to be done. But let's talk about performance and scalability issues first with data encryption, and what is the overhead now adopting technologies like confidential computing or others

on your workload.

This was, by the way, a key reason as to why technologies like homomorphic encryption and purely cryptographic approaches

not suitable for large datasets because of this extra overhead that comes with working on encrypted data. Because traditional standard encryption, like AES based encryption, for example, is actually now very fast

because modern hardware

contains special modules and instructions

that can execute

the encryption operations in the hardware itself as opposed to implementing them purely in software.

So standard encryption as a result is very, very fast. Encryption at rest, encryption in transit,

all of which you standardize encryption protocols.

The problem comes when you really need,

software based encryption because that is much slower

and the problem gets exacerbated or compounded with specialized cryptographic protocols because intuitively you need to maintain some structure

within the data itself

while also

obfuscating

that structure. And this literally leads to a blow up of the ciphertext or the underlying data.

And that in some sense, what the problem with the purely cryptographic approach like homomorphic encryption or secure multiparty computation is, they truly blows up the amount of resources that you need as a result. It blows up the amount of time it takes to compute on the data as a result.

With confidential computing,

the good thing is that this encryption is still happening in hardware.

So when data is in memory, it's encrypted.

When data is loaded inside the secure enclave in the CPU,

only there is a decrypted at that boundary.

But this encryption decryption happens in hardware itself. When the data is moved back from the CPU die back to memory, it's reencrypted again. So,

yes, there is overhead. The overhead is determined by how frequently data needs to be moved in and out of the environment.

And solutions that power confidential computing or secure on case analysis need to be cognizant

of this over. They need to be architected in a way that minimizes this flow of data in and out of the art. So they need to be designed in a way that optimizes

the data movements and is aware of the architecture of confidential computing.

Also,

as far as Intel's version of confidential computing is concerned, Intel SGX

initial versions of Intel SGX

had limited

memory available

to the enclave. So it was restricted to a little over a 100 megabytes.

That further compounded the problem because now the less memory you have, the more frequent this data movement in and out of that's restricted memory is, which also compounded the overhead that was incurred by confidential computing based solutions.

The second problem has gone away

because now,

newer generation Intel machines

have several gigabytes of memory available to the enclave to process its workloads. So that part of the overhead has gone away. We also now have confidential virtual machines where the entire virtual machine is effectively an enclave. And AMD's SCBS and e technology provides a version of confidential virtual machines. Intel has also announced the tdx solution that also provides confidential virtual machine. In some sense, these confidential VM approaches

provide,

slightly weaker security guarantees because now the operating system and other software are also included within the trusted computing base. But on the flip side, they're much more flexible.

You can run arbitrary programs effectively inside the confidential VM while taking advantage of the entire memory of the disposal.

Also making these work was much, much faster.

But the upshot of all of this is, yes, with confidential computing, you do have performance

overhead.

Solutions built on confidential computing need to be aware of the architecture so as to minimize this overhead in our benchmarks. So we ran a benchmark of opaque against the TPCH

benchmark, which is industry standard benchmark for for SQL based analytics. And we found the overhead to be on the from a few percentage points to a few tens of percent,

making it much, much more tractable

and practical today.

So performance has not been an issue for us, as far as

customers are concerned.

Yes, there is some overhead, but the overhead is a minimal small price to pay.

Unlike purely cryptographic approaches where the overhead is on the order of tens of 1, 000 or 100 of 1, 000 even. So that is on the topic of scalability and performance.

The second part of what you mentioned was what about policy enforcement and governance and who was responsible for that? That is a very good question. And the answer to that, I suppose, depends on

the specific use case and the organization itself.

Fundamentally,

what we need to guarantee

is that

data owners permit the computation that runs on their data.

Now you can can establish the policy enforcement mechanism can be very flexible because, essentially, you can run whatever you want inside the complex scripting environment.

On the 1 hand, you can have policy white lists,

which essentially

is a white list of the queries or scripts that you want to write. So let's say if you want to process my data using some script, you can submit that script for my approval. If I approve it, then you're allowed to execute that script on our collective combined data. If we are in a consortium or a collaboration with multiple data on us, then as long as everyone permits that particular script, then you're allowed to execute on that data. You can also

specify maybe generalized policies for structured languages like SQL.

For example,

as long as your query is operating on a certain number of data items or as long as your query is a type of an aggregate query that I am fine with, then any query that conforms to those specifications

is allowed to run on that data as well. Policy enforcement can be granular

as well as generic. You can also specify policies around. You can only run computations on certain data rows that meet a certain criteria.

And what is that criteria? That's determined by a UDF or an expression

of certain of some sort. You can also have policies and columns in a similar spirit. So you can define a rich variety of policies and what policies make sense would depend on the specific use case. So for example, for healthcare, the policies need to adhere to HIPAA requirements.

And HIPAA has some requirements around what the results are allowed to reveal and what they're not allowed to reveal. In other domains, the policies may may have different sets of requirements, for example. So there's a rich space policy enforcement that is possible, but the exact policies depend on the specifics of the use case or the business problem

at hand and the regulatory regime, by under which it's it's covered.

Who

controls or who specifies the policies fundamentally is the data owner. But again but by the data owner, it can be a separate organization

which may want its own controls around or maybe the government's team is responsible for vetting and approving policies. And that's all separated

from the roles of the analysts or the data scientists.

But that becomes more of an operational question

as opposed to a technological limitation or technological question.

Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration.

All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data.

SQL Lake supports a broad set of transformations,

including high cardinality

joins, aggregations,

upserts, and window operations.

Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose.

Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast .com/upsolver

today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs.

To your point of when you mentioned, you know, if you're talking about languages like SQL, that also brings up the question of

the ways that opaque systems

integrates into the overall data platform of an organization. So is it presented as a data warehouse? Is it just you can you know, we manage orchestration of these secure enclaves for applying to the actual compute operations that are being run on your Spark cluster or your Trino cluster or

your Snowflake environment,

and just some of the overall kind of platform integration

and data modeling questions that go along with

how to incorporate

your confidential computing capabilities into

operations that are already present in an organization. Yes.

And this is key for the ability to integrate with existing analytics and AI workflows is key for making it for ease of use and making it frictionless to for data scientists, data analysts. Right? They should be able to use the same scripts or the same workflows that they're currently used to without having to be

an expert in a new framework or a new technology or having to learn a new language that works with confidential computing.

That is the gold standard.

What we provide, the way it's architected right now is we provide a data plan

that allows you to run big data frameworks within continental computing environments. So for example, 1 part of our solution is for Intel SGX

secure enclaves.

Now Intel SGX enclaves, they require the application, which is the big data framework, for example, be it Spark or be it something else like PyTorch or TensorFlow. It needs those frameworks to be architected against the Enclave's APIs.

An analogy here is if you want to make use of GPUs

for accelerating your machine learning workloads, You need to program those applications

against the GPU's interface using CUDA or something else. A similar analogy applies for Intel's enclaves for confidential computing. The application that you want to run inside confidential computing needs to be programmed against those APIs as well. So what we have is our own modified version

of Apache Spark, Spark SQL that speaks the language of the enclave.

Now it's not that the entire framework is running inside the enclave. No. Because that would be a nightmare from the problem from a maintenance perspective.

It also increases the amount of code that you need to run inside the enclave, which has implications on the amount of code you need to trust and verify, but also performance. So instead, what we did was and this is available in the MC 2 open source, is we identified

the core operators

that need to process the data. So the core SQL operators, for example, that need to process the data. And only those portions of the framework

are running inside.

The rest of Spark runs outside

the the the part of Spark that does not need to directly

see or process the data. So for for example, the cluster management, the query planning, the

distribution framework, all of that can run outside the enclaves. Only the limited there's a small number of operators that your query is mapped to. Only those need to be run inside the app.

In that sense, it's not like you can take your existing Spark deployment and magically make that secure.

What you can do is you can use Opaque's platform.

Opaque can run within your environment. If you have a cloud environment on Azure, for example, Opaque can run-in that cloud environment. The only requirement is that whatever environment Opaque is running in, you have those physical servers available that have confidential capabilities.

Once the platform is running within your environment,

we provide a client

that you can use to encrypt and upload data to a location of your choice. For For example, you can encrypt your data and upload it to Azure Blob Storage.

And then you can use the client to communicate

with the data platform that's running on the cloud. You can submit jobs,

same spark jobs in Scala

and SQL or using PySpark.

And those jobs

get executed on the data. The data gets loaded inside the opaque platform inside context creating enclaves and are then processed using the script. So from that perspective, you don't need to

modify

your analytic queries or scripts. You can run those same scripts as before. But now the processing happens within

a fixed cluster, within a fixed platform, a fixed version of Spark or other machine learning frameworks that you want to make use of. So in that sense, we can integrate

with your existing workflows. All we need to do is be able to pull that pull the data from its source,

load it inside

opaque, and then the results can be shared with the analyst directly.

Our ecosystem is growing as far as data platforms and data warehouses are concerned.

So,

yeah, stay tuned for more, I suppose,

on that front.

Digging a bit deeper now on the opaque systems platform itself, I'm wondering if you can talk to some of the

architecture

and design aspects

of how you're thinking about building it so that it is

a frictionless experience for teams to be able to integrate it into their existing run times. And some of the ways that the design and scope of the product have changed since you first started working on it and have been working with some of your early customers to figure out what are the actual

challenges that they're trying to overcome and some of the sharp edges that they run up against from the kind of initial formulation of the solution.

Absolutely. And, I mean,

the product has undergone significant refinements, I think, as opposed from our open source test. So when we started working on the open source at Berkeley on MC 2, the design of the open source and the research was informed by discussions with industry partners. For example, we were already doing POCs with many of our sponsors and partners and, collaborators

in the lab. And that laid the foundations for the kinds of capabilities that they wanted from the platform. What kinds of analytics they wanted to run? What kinds of machine learning they wanted to run? Something that surprised me back in the day was a lot of the times when organizations

talked about AI and machine learning.

My first reaction was okay. They want to be able to execute deep neural networks and things like that on their content and datasets. But a lot of those times, organizations want predictability and explainability.

So decision trees, regression models were the kinds of tools

and capabilities that they wanted. So what we decided to support in the open source was informed by some of those discussions. As we worked more closely with partners, we realized that, okay, just merely

protecting data and keeping that encrypted is not enough.

You now need to support 4 policies because without policies, everything is useless.

And so the whole policy enforcement space was something that's evolved over time as we worked with early customers and early adopters.

Depending on

the sector, the vertical you're in, the kinds of capabilities are different. So for example, in the ad tech space, simple SQL based analytics is enough. But in many financial or healthcare use cases, people want more sophisticated capabilities. People want to be able to support

their own machine learning pipelines.

They don't want to be restricted

to a certain set of libraries.

And that has bearing on the underlying architecture and the choice of confidential computing frameworks.

What that means is perhaps Intel SGX in those cases

is insufficient because it is not as flexible as confidential VM based approaches. Because with SGX,

the framework needs to be programmed against the Enclave APIs whereas with confidential VMs, you don't have that requirement. You can run frameworks of your choice

within the environment,

but you still need the other ecosystem of tools. You still need to be able to work with and decrypt files or datasets in different keys. You still need to be able to integrate with key management solutions.

How do you make this distributed?

How do you ensure that when it's distributed, the communication that's taking place between machines is also protected and is also secure.

A key requirement that opens up is around attestation.

How do I verify

that the environment has been securely set up and that I am actually using confidential computing machines and not regular machines?

So all of this tooling

that's required

from

an enterprise ready solution are aspects and capabilities that evolved and grew

and over time in conversations and working with design partners as well. Really, that is what we provide. We provide the entire ecosystem

software that makes it possible to power confidential computing

for analytics

and machine learning. And our machine learning capabilities are growing and evolving, and we'll have a rich support available on that front shortly.

For example,

GPU enclaves are now available as well. So Azure recently announced

the availability

of GPUs as part of the confidential computing

product suite.

Once those become available, then we can offer richer capabilities with higher performance guarantees.

Because for now, you're still restricted to processing

workloads on CPUs only

because that technology doesn't exist publicly just yet. So looking forward to further enhancing and broadening the scope of the art of the possible as far as AI and analytics are concerned.

And in terms of the

types of data that are

feasible to use in these

confidential computing environments

and some of the data modeling considerations

that go along with how to think about building your computation.

I'm wondering what are some of the kind of constraints that are imposed and some of the ways that you're working to

smooth the operation

of being able to either impose those constraints on existing datasets or

being able to open up the degree of constraints so that more types of workloads and more types of data can be processed with these

tools. This is 1 key point as far as machine learning in particular is concerned. Right? Because machine learning, it's not like you magically

have a model that can now run on the combined data center. There is this whole data engineering and data exploration phase that is at odds with data privacy and teleconferencialty.

If I can't see the data, how do I know how to

train the model?

What fields to use? How to engineer those features?

So that is a

constraint to speak of or any environment that enables collaborative machine learning. It needs to take into account.

What the best approaches I think we're learning more and industry as a whole is evolving as well. But some ways in which you can do this is,

1, you can have an insecure

or simulation mode, for example,

that allows

you to share some data with me. And I can see that data

and combine it with my own and do my, feature engineering or whatever data exploration I need to do to

develop the actual model training scripts that will get deployed in production.

And once I have done that exploration engineering,

then I can flip the switch and start the secure mode, so to speak.

And at that point on,

whatever

data is being joined and combined and models are being trained on remains protected and no 1 can see it. That is 1 way to achieve

or enable

constraints

while still enabling data exploration and preprocessing and so forth. Another

way to do this is to

use synthetic data

or obfuscated data as part of that insecure data exploration mode. But you're right. This this part of the pipeline

is something that

needs to

be articulated, I suppose,

more clearly. What may work for 1 organization, for example, they may not have a way of sharing any data at all as part of that exploration phase. So in that case, a synthetic data could be

a worthwhile

approach

or using some sort of obfuscated data as part of that exploration phase could be a worthwhile approach. Another way to do this is to use the policy framework

to impose constraints

on what the data analyst or data scientist is allowed to do as part of that exploration.

In that case,

as long as you approve the kinds of pre processing operations or the kinds of feature engineering operations

on that data,

Then as long as the scripts are

compliant with that policy regime, they can be executed, they can be enforced. But that can also come with certain amount of friction as opposed to interactivity

as concerned. Right now, the way we are thinking about this is having that insecure mode in which there is no

confusion from a UX perspective.

That's okay. This data is whatever I

want to share with you for you to do your data exploration

will be visible to you. Either choose to share some subset

of my data

or I can choose to share

some fake information that mimics the schema. That data that allows you to decide

how your scripts might be written.

But once that exploration is done, we enable the secure mode and everything thereon happens only on confidential protected data.

That brings it into the space of managing kind of the

preproduction versus production or kind of CICD aspects of it as well where, you know, okay. In our nonproduction environment, you can do this exploration. You can iterate and build your model, or you can, you know, do some exploration of the data to build your analysis. And then once you say, okay. This is ready to go to production. That goes to a different operating environment

that has the kind of fully secured runtime enabled so that you don't have any capability of exploring the data and then being able to give the parties engaged the controls to say, you know, this is the data for this stage. This is the data for that stage. But then that also brings into question

whether or not the data that they're using in the preproduction stage is truly representative of the actual live data, which is a broader question in machine learning and analytics in general. So

Yes. I agree. I mean, these are all operational challenges that 1 does need to look up. With regard to the first part with the preproduction versus production,

you don't necessarily need to use different environments.

You could make use of the same platform and the same environment

with just the enclave protections turned off, probably.

You don't even need to turn the enclave protections off, to be honest. But you may want to turn them off because maybe you get better performance as a result. But you could use the same environment

while just

from a user interface perspective,

I am

saying that this is now preproduction.

And therefore, whatever data is shared, there is a way for you to visualize it as well. Or there is a way for you to see it as well. But I can show it to you through that same environment as well without actually having to work in a separate environment. It's just that once you turn on the production switch,

then that capability of being able to for me to look at your data goes away. And the confidential computing environment enforces that essentially that you now if you try to look at my data, you will only see data or you will not be able to see anything at all. So that aspect of it

does not need to be a distinction between the environment for preproduction and production. What was the second part of your question?

Challenges

of the kind of testing dataset being truly representative of the live dataset.

That is a challenge with synthetic data as well

or with any fake data that you do create.

That challenge does not entirely

go away, to be honest. The 1 way to

overcome that potentially could be that you

to be honest, maybe that challenge exists if you and I have datasets

that have non overlapping

fields, if you have certain fields that I don't have in my dataset,

then I need to be able for me to do some any kind of training, I need to be able to know what those fields are as well or see what those fields are as well. And that's when the problem manifests. But if the dataset is horizontally partitioned, for example, the schemas are the same, then maybe I could train

the model on my data, set and then refine that iteratively

once combined with your training sets as well. So depending on the orientation

of data, depending on who holds what pieces of data,

the problem

may or may not manifest.

Another option is to use an orchestrator in some settings. So for example, in some deployments,

banks, for example, who want to work with each other. It's operationally much easier for us to work with single entity who acts as an orchestrator or facilitator of the consortium. And that entity provides the data science and data analytics expertise

and data owners, which is the banks in this case, can bring their own data to it. So you may permit the data orchestrator.

You may have looser restrictions around what the data orchestrator can do or see as far as data is concerned as opposed to other members of the consortium or collaboration. So there are a few ways in which this can be architected.

This problem is something that customers or adopters need to be aware of as far as machine learning and AI goes. It's a lesser of a concern as far as analytics is concerned in my experience. Because as long as you know the schema of the dataset,

then that in and of itself is often sufficient for you to get the kinds of insights that you want. For example, you may want to do some aggregate analytics

to identify

how many

users we have in common and what their purchase patterns are or what their demographic distribution is. And for this, you don't necessarily need to look at the data itself. You only need to be aware of the schema of the data. And at that point, it's sufficient. So it's not so much of a concern as far as analytics and query processing goes. But, yes, absolutely.

Around machine learning and AI, this is a broader challenge.

Yeah. There are a number of other avenues that I think would be fun to explore more deeply, 1 being the kind of

validity of schemas matching but having different underlying semantics of what that data means. That's a whole another problem that is kind of outside of your control or concern, but interesting to explore nonetheless when you are engaging with multiple parties bringing their own data.

1 simple example here is in 1 case, 1 particular data owner had, ZIP codes that were 5 digits, and the other data owner had ZIP codes that were 5 digits followed by hyphen followed by something else. Exactly. And those are all problems that do crop up.

That's not that, yeah. And just knowing the schema,

you also need to know the format.

So you're absolutely right.

Yes. Or if you have an integer field, but those integers are just kind of categorical,

not actually pure integers. And, you know, what does that integer map to for a given category? So

lots of ways that it can go weird.

Another interesting aspect is what you were saying earlier about from a policy enforcement perspective saying, okay. I need you to provide me with the script or the code that you want to execute against my data. You know, then there's the question of, you know, what level of complexity is acceptable because I don't necessarily have time to read through 10, 000 lines of code to make sure that it's doing what I think it's going to be doing. And so then you're into static analysis and validation and whole other categories of problems.

And then another aspect too from a kind of

performance and data modeling and tuning perspective

is because you're operating in these secure enclaves

and you want to try to optimize for

chunks of data that fit within the CPU cache so it doesn't have to get shuffled out to memory too many times,

wondering what are some of the ways that you're able to help people with kind of managing that data segmentation question of, you know, up to a certain point, everything's able to fit into the cache on the die, but then as soon as this record goes from, you know, 52

to 58 characters, now all of a sudden we have to, you know, swap to memory every time we wanna try to match these 2 numbers together and just some of that question.

And that's a deeper architectural question, to be honest. So the approach that we have taken on that particular front is we abstract that away from the user. So the user does not have to worry about data segmentation and data sizes.

We have

architected

the platform

for, let's say, Spark and for, for example, in a way that intuitively streams the data structures

through the CPU. So the algorithms have been developed in a way that at most suited to this model of computation.

But the user should not have to worry about it. Who know? Maybe in the future, we'll we'll need to expose this capability to the user as well. Who knows? We should we'll see. This problem does go away with confidential PBMs

to some degree

because the entire all of memory is available to you. So the model of computation that you have is the same as a regular vanilla workload.

But, yeah, these are all deeper architectural questions that we didn't have to grapple with in order to design a platform that was as efficient and as fast as possible. Because if it's not fast, then people are not gonna get value out of it.

Another topic that's always fun to explore

with companies that are building on top of an existing open source project is how you're managing the

balance of effort and engineering that goes into the open source versus the commercial aspect or whether the open source was just an initial proof of concept and then effectively abandoned in favor of a purely commercial

product that is based on those underlying principles and just some of the ways that you're thinking about that governance and sustainability aspect of MC 2 compared with the work that goes into opaque systems?

To be honest, we have been since launching opaque, right, we have been focused on getting the product and tech off the ground and,

packaging MC 2 into something that's enterprise ready and has all the capabilities that enterprise need. I mean, there is a major push that's coming to the open source in Jan, and we intend to be now at a at a size where we have the resources

available to us to maintain

and grow the community going forward.

So we completely intend to keep making progress and giving back to the community as well around open source. So that anything that drives adoption

and helps the community adopt confidential computing is a win for the entire space, given that it's a young emerging category. So open source is key for us, and that's something that we will

do more and more of going forward. It does bring up the question, okay, around how is the closed source different from the open source? What do you keep in the closed source? How do you think about that? And, honestly, that is an evolving question, and that that is something that we have

passionate debates about internally.

The fundamental principle

that we try to espouse is anything that helps with adoption should be part of the business.

But capabilities

that are necessary

for

enterprises

who would want support.

Those things are part of the closed source product.

For example,

disaster recovery, high availability,

ecosystem integrations, for example, into key management solutions or data sources,

things like that. Our current thing is to retain them in the closest. Anything that is required for enterprises for who want support and who don't want to deploy the open source themselves is are things that we would keep in the closed source. But anything that helps with adoption or anything that hinders adoption

should not be closed source, and should be part of the open source.

Operationally,

it would be nice

if we could draw a clean line

that does not allow the 2 code bases to diverge too much. So that it becomes much easier from the perspective of maintaining the open source as well because we do want to have streamlined processes for pushing from the closed source to the open source and vice versa.

And if the code base is diverged, then that becomes operational nightmare.

How successful

we'll be on that front

remains to be seen. But

eventually,

at least as far as opaque versus n c 2 is concerned, concerned, we we do have dedicated resources to maintain that 2 way communication between the open and closed sources.

But, yeah, that's an evolving

maintenance question. 1 decision about this impacted us was instead of having multiple repositories,

we should combine everything into a giant 1 repo

because that makes it much easier to maintain. And not only internally, but from the perspective of open source as well. Because if you have multiple repositories all speaking to each other, then this problem gets compounded,

significantly as well.

But I think we do have a blog about that as well. Our journey to a a mono repo and what impact that had.

And so in terms of your experience of

building the opaque platform

and working with your customers and early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

My favorite, to be honest, is around human trafficking use case. So we're working on with on human trafficking with a group of banks.

And the problem is that our money laundering, for example, it's part of the same financial crime detection space. But the problem is that to identify

perpetrators of financial crime right now, the best that each bank can do is look at its individual transaction data,

identify that for patterns.

And once there is suspicious transaction,

then that gets flagged and analyst files suspicious transaction report.

But

any analysis they're doing as a result has very high false positives

because they're limited in their view, of data movement of or on the movement of funds because criminals hide their traces across multiple banks.

So in order to be able to

detect financial crime more effectively, banks need to be able to collaborate with each other. Banks need to be able to share data with each other. Each bank has its own way

or has its own models and own

analytics processes for flagging suspicious transactions.

But there is no effective sharing of intelligence that's taking place. With OPEC, this problem becomes tractable because each bank could keep this data encrypted,

pulled it together, but still jointly run analytics on the collected data to identify cycles, for example, around the movement of funds or train more robust models on their collected data.

So this particular problem was something that I'm deeply passionate about is because, of course, organizations like ourselves, we exist

to ultimately drive revenue and make money. But if you can help the world in any small way, while doing it, then that's okay. And it was also a reason why we were researchers first and become.

Another cool example is and we didn't work on this particular 1, but sometimes I like to wonder if we could have made a dent into the problem is around COVID. Right? And there's 1 thing we learned during COVID is that contact tracing could have helped solve the problem to a large extent.

But the moment we tried to deploy contact tracing solutions,

we realized we needed to be able to combine data from various patient repositories and patient silos.

And the moment we tried to do

that, all these patient confidentiality

concerns came to the forefront,

thwarting many of those efforts to a large extent.

Not to get ahead of myself, but had opaque been around, then maybe we could have helped address that problem. So while at once preserve patient confidentiality

while also helping with contact tracing efforts.

We are working with health care institutions around better patient profiling, enabling

health care data owners to combine their traditional electronic health care data

against consumer data sources

to identify patient behaviors and patient profiles so that they can be more accurately

diagnosed. And and those results can be used to train better models for predicting diseases and patient behavior and so forth. These are a few examples that I find

very exciting. There are other examples as well in the ad tech space. Cookies are going away. The industry is terming that as the cookie apocalypse.

And cookies have been away, right, so far for publishers and advertisers to be able to identify common audiences. For example, if I am Nike and I wanna advertise on CNN,

I wanna know whether CNN's audience

sufficiently overlaps with my target base.

How do I do that? I need to be able to combine my data with yours, intersect my data with yours to identify

if we have common audiences,

what the segmentation of that audience is, and so forth. Can we do that using

alternate forms of information

like account IDs or email addresses and IP addresses without actually having to divulge that information to the other stakeholders.

This opens up a world of possibilities as well. So just a few examples of use cases that I find fairly interesting and exciting.

Absolutely.

The advertising 1 too is interesting

because of the fact that, you know,

companies have gotten used to the web oriented world of advertising where you do have all of this rich information and customer profiles that you can build up and obviously has some privacy issues to go along with it, but

there are challenges with how that maps to other

kind of distribution mechanisms. Podcasts, in particular, come to mind because I've been running podcasts for a number of years, and it doesn't have those same attribution capabilities because it is effectively an anonymous

distribution channel

unless you're using something like a Spotify or a platform where you own the entire experience and can start collecting some of that other information. So it'll be interesting to see how companies adapt to this world of not being able to have that very rich and detailed visibility

into kind of individual customer profiles.

Yes. Couldn't agree more with you.

In your experience of

doing the research and building the MC 2 project and now turning that into opaque systems, what are some of the most interesting

or unexpected or challenging lessons that you've learned in the process?

1 thing that comes to mind is

there is

a gap between

what technologists

see

as solutions

for maintaining privacy and confidentiality,

and what regulators

see as solutions for maintaining privacy and confidentiality guarantees.

For example, an open area

of judicial interpretation is is encryption

a sufficient mechanism

for providing de identification of data?

Technologically, yes. Because unless you have the key, it provides much stronger

security properties

than traditional anonymization

or pseudonymization techniques.

But in some sense, it is reversible

if you have the key.

Whereas

other technologies

may not fit that regulatory interpretation

reversibility.

So there is a gap between

the technologist's worldview and the regulators worldview.

And as an emerging category, I think as an industry, we do more around educating

the market. Customers,

adopters, including regulators,

around the promise of privacy enhancing technologies like confidential computing. So we need to do more around the education front, and we need to keep doing more around it. Because people often don't know what is possible as a result of the stuff. Because it's not like we're going to companies and telling them that, oh, you've been solving this problem using solution a. Come use opaque or adopt a privacy technology because we can solve the problem better. No. We're often telling them that things that you have not been able to do so far, things that you did not know were possible

become achievable as a result of this technology. So you not only get stronger security,

but you at once get higher utility from your data as well. So it's really a win win.

But this requires more education

and more dissemination and knowledge requires the development of more market events. And we intend to keep continuing to do that. Forum, your podcast is 1 such way as well, that we hope to educate the audience around the promise of confidential computing and the things that you can do. But that's a continuous work progress

and, something we intend to keep to evolve.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as airflow and dbt and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with

DataFold.

And so for people who are looking for ways to be able

to perform computation on data that has issues of sensitivity or regulation or for organizations that are looking for avenues to open up collaboration

either across departments within the business or between businesses? What are the cases where Opaque Systems is the wrong choice?

Some cases, I would say, you may not need,

something like so in some

intra company collaboration

scenarios,

it's maybe sufficient for you to have

a platform

that only enforces governance.

That may suffice for your workloads.

In that case, I would not say opaque is a wrong choice because it still significantly

improves your security posture.

It still mitigates the threat of bad actors or insiders getting access to unauthorized data.

It may be a solution that you may not need in some respect. So I wouldn't say it's the wrong choice in those scenarios. It provides a much stronger security guarantees, but ultimately, it's a question security is often a question about economics.

Right? Like, do you what are you giving away for higher security? Maybe you're losing something around terms of performance over or the kinds of capabilities that platform affords

you. So those are questions that organizations would need to think about. For cross organization

scenarios,

you absolutely do need something like this because

we really need to move away from institutional trust to programmatic trust. You shouldn't have to trust a third party through a piece of paper to be a good custodian of your data. You should get that assurance, that technological assurance from the platform itself. So in multi organization scenarios,

I would say it's an absolute must because attackers are getting more and more sophisticated. So it just allows you to

increase your security posture overall as well. But in some intra organization scenarios,

you may not need

as strong

guarantees, especially if you have an environment that's hosted on prem.

In which case, maybe traditional mechanisms

of governance may may suffice.

But if you're moving to the cloud or moving it outside premises, then then absolutely I would.

It does not need to be a peg, but you do need to think about incorporating privacy enhancing technologies

within your suite.

And as you continue to build out and evolve the opaque platform, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to explore?

You hit upon it during our conversation

with AI and machine learning. There is

more that we need to do around making it super frictionless.

Vector data exploration.

How do you enable policies that are not super complex and don't require me to bet each and every line of code? What if I miss something there? So as our AI ML offering evolves and grows and more to come on that front, These are some capabilities that I'm excited to add to the mix as well. Just making it the ultimate aim is to make

it frictionless for data scientists data analyzation. It should be as simple for you

to use confidential data as it is right now to work with,

regular vanilla data. And, yeah, a key part of that puzzle is around making it simple in the data exploration analysis phase and also making it simple for policies reinforced

in keeping with those

with your existing pipelines.

Are there any other aspects of the work that you're doing at Opaque Systems or the question of confidential computing and its application to analytics and machine learning that we didn't discuss yet that you'd like to cover before we close out the show?

I think we touched upon

a lot of it. I mean, the 1 key thing is

it is now possible. Or Appaik makes it possible for you to collaborate on confidential data. Our focus is on analytics

and machine learning workloads.

But, ultimately, I think

the vision that is shared by my colleagues in the industry is we need to be in a world

where encryption or protection of data in use is the default.

The

way encryption at rest and encryption in transit have been standardized.

The next frontier really, and this is the third leg of the data protection stool, The protection life cycle to me is enabling encryption of protection for data in use as a default.

We've made a lot of progress towards achieving that vision, But I'm looking forward to the day where we actually achieve that as a whole, where everything is always protected

by default. You don't have to worry about data being exposed at any point in the life

cycle. But for now, analytics and AI remains the key focus for us.

Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are up to, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The biggest gap, Tobias, has to be around

the lack of protection for data use.

That is the fundamental problem.

So far in data attacks and data breaches have been growing exponentially over the years. Attackers are getting more and more sophisticated.

A lot of the attacks

often rely on misconfigurations

or

just not keeping data protected,

when it's being stored somewhere in a proper way. But as organizations get more sophisticated,

the threat vector will will naturally become more pronounced around

stealing data or getting access to data while it's being used. So that is the key gap. And that not only in terms of existing security

postures of organizations, but also around

enabling new possibilities and enabling new use cases because that is a key requirement for you to be able to collaborate on confidential data with other entities. Without protection for data and use, you can't collaborate effectively without giving some third party access to your data. And this confidential computing is a technology

fulfills or bridges that gap really and provides that missing level of protection. Of course,

from a holistic perspective,

you still need other capabilities. You still need things like policy enforcement and remote attestation and verifiability and auditability.

But the key to all of this is being the ability to protect data.

We will see more and more of that as a technology evolves and matures.

And the more the market gets aware and educates about the availability of these technologies.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the MC 2 project and how you're building on top of that with opaque systems. It's definitely a very interesting and kind of fascinating area to discuss and explore. It's great to see you and your team working on making this a more tractable

problem for people to be able to

build better and more secure data systems and analytics and unlocking the collaborative potential for

intra and inter organizational

analysis. So thank you again for the time that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thanks again for having me. It's been a pleasure. Super excited about what's happening.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links