Summary
Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
- Your host is Tobias Macey and today I'm interviewing Rishabh Poddar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are building at Opaque Systems and the story behind it?
- What are the core problems related to security/privacy in data analytics and ML that organizations are struggling with?
- What do you see as the balance of internal vs. cross-organization applications for the solutions you are creating?
- comparison with homomorphic encryption
- validation and ongoing testing of security/privacy guarantees
- performance impact of encryption overhead and how to mitigate it
- UX aspects of not being able to view the underlying data
- risks of information leakage from schema/meta information
- Can you describe how the Opaque Systems platform is implemented?
- How have the design and scope of the product changed since you started working on it?
- Can you describe a typical workflow for a team or teams building an analytical process or ML project with your platform?
- What are some of the constraints in terms of data format/volume/variety that are introduced by working with it in the Opaque platform?
- How are you approaching the balance of maintaining the MC2 project against the product needs of the Opaque platform?
- What are the most interesting, innovative, or unexpected ways that you have seen the Opaque platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Opaque Systems/MC2?
- When is Opaque the wrong choice?
- What do you have planned for the future of the Opaque platform?
Contact Info
- Website
- @Podcastinator on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Opaque Systems
- UC Berkeley RISE Lab
- TLS
- MC²
- Homomorphic Encryption
- Secure Multi-Party Computation
- Secure Enclaves
- Differential Privacy
- Data Obfuscation
- AES == Advanced Encryption Standard
- Intel SGX (Software Guard Extensions)
- Intel TDX (Trust Domain Extensions)
- TPC-H Benchmark
- Spark
- Trino
- PyTorch
- Tensorflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Upsolver: ![Upsolver](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/aHJGV1kt.png) Build Real-Time Pipelines. Not Endless DAGs! Creating real-time ETL pipelines is extremely time-consuming and engineering intensive. Why? Because when we attempt to shoehorn a 30-year old batch process into a real-time pipeline, we create an orchestration hell that makes every pipeline a data engineering project. Every pipeline is composed of transformation logic (the what) and orchestration (the how). If you run daily batches, orchestration is simple and there’s plenty of time to recover from failures. However, real-time pipelines with per-hour or per-minute batches make orchestration intricate and data engineers find themselves burdened with building Direct Acyclic Graphs (DAGs), in tools like Apache Airflow, with 10s to 100s of steps intended to address all success and failure modes, task dependencies and maintain temporary data copies. Ori Rafael, CEO and co-founder of Upsolver, will unpack this problem that bottlenecks real-time analytics delivery, and describe a new approach that completely eliminates the need for orchestration, so you can remove Airflow from your development critical path and deliver reliable production pipelines quickly. Go to [dataengineeringpodcast.com/upsolver](dataengineeringpodcast.com/upsolver) to start your 30 day trial with unlimited data, and see for yourself how to avoid DAG hell.
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) Datafold helps you deal with data quality in your pull request. It provides automated regression testing throughout your schema and pipelines so you can address quality issues before they affect production. No more shipping and praying, you can now know exactly what will change in your database ahead of time. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI, so in a few minutes you can get from 0 to automated testing of your analytical code. Visit our site at [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold.
- Linode: ![Linode](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/W29GS9Zw.jpg) Your data platform needs to be scalable, fault tolerant, and performant, which means that you need the same from your cloud provider. Linode has been powering production systems for over 17 years, and now they’ve launched a fully managed Kubernetes platform. With the combined power of the Kubernetes engine for flexible and scalable deployments, and features like dedicated CPU instances, GPU instances, and object storage you’ve got everything you need to build a bulletproof data pipeline. If you go to: [dataengineeringpodcast.com/linode](https://www.dataengineeringpodcast.com/linode) today you’ll even get a $100 credit to use on building your own cluster, or object storage, or reliable backups, or… And while you’re there don’t forget to thank them for being a long-time supporter of the Data Engineering Podcast!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visit rudderstack.com/legacy to take control of your customer data today. Your host is Tobias Macy, and today I'm interviewing Rishab Podar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data. So, Rishabh, can you start by introducing yourself?
[00:01:21] Unknown:
Absolutely. Thanks for having me here, Tobias, today with you. It's a pleasure. I'm Rishabh. I'm the CEO and cofounder of Wake Systems. We were a startup born out of research and open source at UC Berkeley. And in a nutshell, we provide a platform, a confidential computing platform for collaborative analytics and AI at scale.
[00:01:40] Unknown:
And do you remember how you first got started working in beta?
[00:01:43] Unknown:
Yes. Absolutely. So I have always been passionate about data privacy and security starting from undergrad days. Starting from research in undergrad, then coming to Berkeley Computer Science for my PhD. I always knew that cryptography and system security was something that I was interested in. And especially the key question that we were looking to answer as part of our research as well. Where I at UC Berkeley, I met my PhD adviser, Raluca Bopa. She's a world renowned cryptographer and system security expert. And the problem that we wanted to address was how do you enable computation on confidential data while keeping the data confidential?
Right now, if we take a step back and look at the state of data protection that exists in the world today, as an industry, we have solutions that can protect or encrypt data at rest. When it's stored in the disk or on the cloud, you can encrypt using standard mechanisms. We also know how to encrypt data in transit when it's being sent over the network from a source to a destination. We can encrypt it, protect it once again using standard, encryption protocols like TLS, HTTPS. What we don't have solutions for widely deployed today is encryption or protection for data in use.
Right now, when data needs to be processed by software on the machine on a machine, it needs to be unencrypted. And this makes it a point of vulnerability and failure. It makes it susceptible to attackers, bad actors, and so forth. So the larger question was how do you enable insights and computation from confidential data? How do you activate confidential data without compromising its confidentiality? And this was the broader problem that we were looking to address. And as part of my PhD at Berkeley, we were part of this lab called the RISE lab, which has been a hotbed for a lot of successful research and open source. At the risk of something immodest. And our work was funded by government grants, but also tech firms like Google, Microsoft, Facebook, Amazon, including the more traditional Fortune 500 type companies. And the work we did was always informed in close partnership with industry. Problems that we saw would be super important 5 years later down the line. In conversations with them, we realized that a key pinpoint here was the inability to share confidential data, even within organizations, let alone outside of organization boundaries.
So as part of that, we built the MC 2, open source that we are now productizing at OPIC.
[00:04:14] Unknown:
So as far as the OPIC systems platform, I'm wondering if you can give a bit more detail about what it is that you're building and trying to put on offer to make this to address this challenge of working with confidential data in a manner that is safe and privacy respecting?
[00:04:33] Unknown:
The key technology, right, the key problem, the inability to share confidential data. The key requirement here is, can we keep the data protected and encrypted when it's being used? So if we could do that, then we could share this confidential data to with different teams within my organization or share this data with other entities within my business ecosystem. And if I could do that, then I could collaborate with other data owners, jointly run analytics, or jointly train models on the collective combined data while keeping it protected. So no 1 gets to see it. Not the other data owners, not even Opaque, not even the cloud platform where opaque might be running. Throughout the life cycle, the computation, the data remains protected and encrypted, but you can still collaborate on it. You can still run big data analytics. You can still train models on the combined data towards some mutually beneficial end. For example, we see a lot of excitement and urgency in use cases here in in finance and health care and advertising technologies. Simply by enabling multiple data owners to collaborate on their confidential data assets.
What we provide is a platform that facilitates this data collaboration for analytics and AI while keeping the data confidential. So as a data scientist or a data analyst, you should not have to be an expert in the underlying confidential computing technology. You should still be able to run the same workflows that you are currently used to for getting insights on the confidential data while making use of that confidential data. So you can activate your confidential data as easily as if it were regular vanilla data, while still maintaining the confidentiality protections and requirements. Really making it frictionless for data scientists and data analysts to collaborate with our potential data.
[00:06:18] Unknown:
In terms of the practical implementation of this, I'm wondering if you can provide some detail and nuance about what you're building with MC 2 and opaque systems compares with the topic of homomorphic encryption, which I know saw a lot of popularity and hype a few years ago and was always, you know, about to be put into production, but never quite made it there.
[00:06:40] Unknown:
That's a very, very good question, Tobias. In addition to homomorphic encryption, right, there are other protocols as well, like secure multiparty computation. These are beautiful cryptographic protocols that use purely software based mathematical cryptographic approaches that at once allow you to keep the data encrypted while still running programs or software or or operations on it. Now part of my research was also on those topics, to be honest. Again, beautiful technologies have my PhD thesis was in that topic. My cofounder and PhD adviser, she is an expert in the space too. But 1 thing that became very clear to us and is also a reason as to why these technologies haven't seen the kinds of proliferation that we had hoped in the last decade. Was this far too resource intensive still.
Computations that should take seconds or minutes can take hours or days depending on the nature of the computation. There are orders of magnitude slower typically than regular computation, which makes them rather unsuitable, in my opinion, for the kinds of workloads that we're looking to address. Users of this technology, for people who want to collaborate in confidential data, need their systems or solutions to be fast and performant and scalable. And we're not there yet with technologies like homomorphic encryption or secure multiparty computation. So instead, the approach we took was to base opaque confidential computing technology.
It's different from a purely cryptographic software based approach like homomorphic encryption. It's rooted in secure and trusted hardware that was pioneered by Intel in the last decade and only became available on the clouds in the last couple of years. What this technology allows you to do is, at a very high level, essentially create a trusted execution environment, sort of like a secure black box within the CPU hardware itself. You can take security critical pieces of code and data and put it inside this black box, and the hardware ensures that no software outside this trusted execution environment, not even privileged software like the operating system or the hypervisor or an attacker who breaks in and gains root access or system administrators, no 1 can penetrate this black box and look inside.
The best that they can do is look at memory, but the hardware ensures that the memory is always encrypted. And the only way to get access to unencrypted data practically is to physically attack the CPU chip. But then you'll end up destroying process, and you need physical access as well. So this is a very powerful revolutionary paradigm in my opinion. And what we do at a big provide the software ecosystem that can power this hardware capability for analytics and machine learning workloads. Fun fact here. So intel actually gave us access to this hardware in the lab at Berkeley even before it was commercially available in the clouds. So that's really allowed us to spearhead the development of frameworks and drive adoption and disseminate the technology as well.
[00:09:48] Unknown:
As far as the application of these secure enclaves and secured computation in the data analytics and ML ecosystem, what are some of the core problems that you see organizations struggling with where this is the solution that they're looking for?
[00:10:05] Unknown:
So there are 2 parts to it, really. 1 is, it really allows you you can use this technology at its core in a variety of ways. 1, you can with the adoption of confidential computing, you can now accelerate your digital transformation. Lots of organizations have confidential data, locked down on premises. Right? They can't share this data across teams even within the same organization, lines of business within banks, for example. Let alone share this data outside organization boundaries or with other entities or even move to the cloud. With the adoption of this technology, you can now accelerate that transformation to your migration to the cloud as well. Because now you can keep your data protected at all times on the cloud without having to necessarily trust the platforms or software running on the cloud by running these whatever software and applications you have within confidential computing environments.
So really moving away from institutional trust to programmatic trust. 2nd, it enables data collaboration as well, which really unlocks many use cases. Things that organizations have struggled to achieve become possible as a result. Because multiple data owners, they can now each individually encrypt their data, pool it together in the cloud, combine it in encrypted form, and then jointly analyze it or jointly train models on it towards some mutually beneficial aim. And this is a particularly powerful paradigm that we see, requirements for across industries.
For example, banks can collaborate towards identifying human or money launderers. Health care institutions can collaborate and share data to train better disease prediction models, run better patient profiling. In the advertising world, publishers and advertisers can combine their datasets to identify common audiences and or user behavior. A rich variety of analytics and machine learning based use cases that were not possible as a result on confidential data now become available to organizations as well. And 3rd, all of this happens in a way while making it easy to comply with privacy laws and regulations as well. Privacy until the last decade, privacy and security have kind of been an afterthought for most organizations.
But now we are seeing the emergence of the GDPR and the world is following suit. Newer privacy laws and regulations as well, which are increasingly controlling how confidential data can be used by third parties and software processors. And this technology, it really makes it easier as a result to comply with those laws and regulations while still enabling insights from. In fact, I would argue that you get better utility from your data as a result because now you can use datasets that you weren't able to use before because of confidentiality restrictions as a result of this technology.
[00:12:57] Unknown:
In terms of the application, some of the other techniques that come to mind as we're talking are things like differential privacy or data obfuscation. And also, it brings up the question of, if the data is in the secure enclave, what are some of the ways that you protect against data exfiltration or some of these re identification attacks that these kind of obfuscated datasets are subject to, and just some of the broader space beyond just encryption, but of kind of these data security questions.
[00:13:27] Unknown:
You've actually hit upon a key point here, and this problem becomes more pronounced in the context of data collaboration as well. Because if you are the only 1 using your dataset, sure, you can enforce controls in a more reliable governed way. But if I am collaborating with you, you should not be allowed to do whatever you want with my data. You should only be allowed to do what I permit you to do with my data. So this exacerbates in some sense the ability to collaborate on data. It opens up new challenges around governance and policy enforcement as well.
To answer your questions, in the absence of this technology, right, we have been relying. Our industry has relied on approaches like obfuscation, data masking, tokenization, anonymization, protection to enforce controls on the data. The problem with these approaches on their own is that, 1, there has been a lot of research that has shown that they're not really secure approaches because tokenization or data masking can be reversed. And you can if you have access to auxiliary datasets, if you have more fields available to you that are not masked or not tokenized, then you can learn some information. Then you can learn information about the underlying data. Also, to be able to make use of the tokenized or masked data. What these approaches do is they typically map the data, the confidential fields, to deterministic values.
And because they're deterministic, you can now maybe join 2 datasets on the tokenized field. But because it's deterministic, it's also insecure. Because if I know the tokens to my confidential data, then I also know what the tokens to your confidential data map to. So standard approaches like obfuscation or masking don't really quite work. They're not really secure. What you really need is randomized encryption. But the problem with randomized encryption has been that, well, because it's randomized, you can you now can't combine these 2 datasets together because your fields map to some different random value in my field, like some different random value. How do I bridge that gap? And this is where confidential computing comes in and sort of allows you to get the best of both worlds. In fact, you can now also add additional data fields that you previously wouldn't have wanted to share as part of your dataset as well because everything remains improved by default. But you can still combine datasets together. You can still run operations in them and get insights from them.
Key to all of this is the ability as far as collaboration is concerned, the ability to enforce policies. Because again, you should not be allowed to do whatever you want in my data. If you run a SQL query, for example, that says select star, it truly gives you access to all of my data, completely violating the guarantee that you sought in the first place. So 1 key value that our platform also provides is the ability to enforce policies around who is allowed to do what with the data. And this goes beyond traditional mechanisms of policy control, like role based policies or data access policies. You can now also specify policies around how the data can be used and what results you're allowed to see. So that is a key part of it as well. Touched upon differential privacy as well. And differential privacy is also a very exciting technology that in some sense is complimentary to confidential computing or and even technologies like homomorphic encryption, multi party computation.
So all these different technologies fall under the privacy enhancing technologies umbrella. And I think as an industry, we need to disentangle the properties that these different approaches provide because some of them are alternatives to each other, but some work in a complementary fashion. Differential privacy is complementary because what differential privacy does is it basically is a way to prevent leakage from the results of the analysis that you're doing. At a very high level how that works, it allows you to add some noise, some mathematically computed noise to the results of your aggregate analysis. So for example, if you want to learn average age of everyone in your dataset, then instead of giving you the exact average, differentially private solutions will add some noise to it so that you get a noise average.
And the key property that this provides is as a result, you, the analyst, is not able to pinpoint whether a particular data item exists in the dataset or not. So they want to know if your information exists in the data set or not or if my information exists that it's in the data set or not because of the addition of this noise. What it does not provide is protection or confidentiality for the data while the analysis is being run on top of it. The computation happens on encrypted regular data. It's only that the recipient of the results gets some noise information.
You could run differential privacy differential private solutions within confidential computing as well to get confidential computing. So getting protection for data in use while also getting preventing leakage from the results of the analysis. So in in that sense, these 2 technologies are complementary,
[00:18:30] Unknown:
I would say. A lot of interesting stuff to dig into there. Before we get too far into the weeds on the kind of technical implementation of it, some of the other interesting aspects of applying the solution are around the performance impact that has typically been associated with managing the encryption of data, and then also to your point of being able to define and apply policies on what operations are allowed on a given dataset, particularly in that question of a collaborative data agreement where maybe multiple organizations have different datasets that they want to be able to combine together is the question of who is empowered and to be able to define what those policies are and the enforcement of it? And what does that negotiation process look like, and what are some of the technical controls that are available for them to be able to maybe compose together different constraints that they want to apply?
[00:19:24] Unknown:
These are drill issues that we deal with, and I think more work needs to be done to provide solutions that are as user friendly as possible. I think we've come a long way, but I think more work needs to be done. But let's talk about performance and scalability issues first with data encryption, and what is the overhead now adopting technologies like confidential computing or others on your workload. This was, by the way, a key reason as to why technologies like homomorphic encryption and purely cryptographic approaches not suitable for large datasets because of this extra overhead that comes with working on encrypted data. Because traditional standard encryption, like AES based encryption, for example, is actually now very fast because modern hardware contains special modules and instructions that can execute the encryption operations in the hardware itself as opposed to implementing them purely in software.
So standard encryption as a result is very, very fast. Encryption at rest, encryption in transit, all of which you standardize encryption protocols. The problem comes when you really need, software based encryption because that is much slower and the problem gets exacerbated or compounded with specialized cryptographic protocols because intuitively you need to maintain some structure within the data itself while also obfuscating that structure. And this literally leads to a blow up of the ciphertext or the underlying data. And that in some sense, what the problem with the purely cryptographic approach like homomorphic encryption or secure multiparty computation is, they truly blows up the amount of resources that you need as a result. It blows up the amount of time it takes to compute on the data as a result.
With confidential computing, the good thing is that this encryption is still happening in hardware. So when data is in memory, it's encrypted. When data is loaded inside the secure enclave in the CPU, only there is a decrypted at that boundary. But this encryption decryption happens in hardware itself. When the data is moved back from the CPU die back to memory, it's reencrypted again. So, yes, there is overhead. The overhead is determined by how frequently data needs to be moved in and out of the environment. And solutions that power confidential computing or secure on case analysis need to be cognizant of this over. They need to be architected in a way that minimizes this flow of data in and out of the art. So they need to be designed in a way that optimizes the data movements and is aware of the architecture of confidential computing.
Also, as far as Intel's version of confidential computing is concerned, Intel SGX initial versions of Intel SGX had limited memory available to the enclave. So it was restricted to a little over a 100 megabytes. That further compounded the problem because now the less memory you have, the more frequent this data movement in and out of that's restricted memory is, which also compounded the overhead that was incurred by confidential computing based solutions. The second problem has gone away because now, newer generation Intel machines have several gigabytes of memory available to the enclave to process its workloads. So that part of the overhead has gone away. We also now have confidential virtual machines where the entire virtual machine is effectively an enclave. And AMD's SCBS and e technology provides a version of confidential virtual machines. Intel has also announced the tdx solution that also provides confidential virtual machine. In some sense, these confidential VM approaches provide, slightly weaker security guarantees because now the operating system and other software are also included within the trusted computing base. But on the flip side, they're much more flexible.
You can run arbitrary programs effectively inside the confidential VM while taking advantage of the entire memory of the disposal. Also making these work was much, much faster. But the upshot of all of this is, yes, with confidential computing, you do have performance overhead. Solutions built on confidential computing need to be aware of the architecture so as to minimize this overhead in our benchmarks. So we ran a benchmark of opaque against the TPCH benchmark, which is industry standard benchmark for for SQL based analytics. And we found the overhead to be on the from a few percentage points to a few tens of percent, making it much, much more tractable and practical today.
So performance has not been an issue for us, as far as customers are concerned. Yes, there is some overhead, but the overhead is a minimal small price to pay. Unlike purely cryptographic approaches where the overhead is on the order of tens of 1, 000 or 100 of 1, 000 even. So that is on the topic of scalability and performance. The second part of what you mentioned was what about policy enforcement and governance and who was responsible for that? That is a very good question. And the answer to that, I suppose, depends on the specific use case and the organization itself.
Fundamentally, what we need to guarantee is that data owners permit the computation that runs on their data. Now you can can establish the policy enforcement mechanism can be very flexible because, essentially, you can run whatever you want inside the complex scripting environment. On the 1 hand, you can have policy white lists, which essentially is a white list of the queries or scripts that you want to write. So let's say if you want to process my data using some script, you can submit that script for my approval. If I approve it, then you're allowed to execute that script on our collective combined data. If we are in a consortium or a collaboration with multiple data on us, then as long as everyone permits that particular script, then you're allowed to execute on that data. You can also specify maybe generalized policies for structured languages like SQL.
For example, as long as your query is operating on a certain number of data items or as long as your query is a type of an aggregate query that I am fine with, then any query that conforms to those specifications is allowed to run on that data as well. Policy enforcement can be granular as well as generic. You can also specify policies around. You can only run computations on certain data rows that meet a certain criteria. And what is that criteria? That's determined by a UDF or an expression of certain of some sort. You can also have policies and columns in a similar spirit. So you can define a rich variety of policies and what policies make sense would depend on the specific use case. So for example, for healthcare, the policies need to adhere to HIPAA requirements. And HIPAA has some requirements around what the results are allowed to reveal and what they're not allowed to reveal. In other domains, the policies may may have different sets of requirements, for example. So there's a rich space policy enforcement that is possible, but the exact policies depend on the specifics of the use case or the business problem at hand and the regulatory regime, by under which it's it's covered.
Who controls or who specifies the policies fundamentally is the data owner. But again but by the data owner, it can be a separate organization which may want its own controls around or maybe the government's team is responsible for vetting and approving policies. And that's all separated from the roles of the analysts or the data scientists. But that becomes more of an operational question as opposed to a technological limitation or technological question.
[00:26:47] Unknown:
Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration. All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQL Lake supports a broad set of transformations, including high cardinality joins, aggregations, upserts, and window operations.
Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose. Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast .com/upsolver today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs. To your point of when you mentioned, you know, if you're talking about languages like SQL, that also brings up the question of the ways that opaque systems integrates into the overall data platform of an organization. So is it presented as a data warehouse? Is it just you can you know, we manage orchestration of these secure enclaves for applying to the actual compute operations that are being run on your Spark cluster or your Trino cluster or your Snowflake environment, and just some of the overall kind of platform integration and data modeling questions that go along with how to incorporate your confidential computing capabilities into operations that are already present in an organization. Yes.
[00:28:40] Unknown:
And this is key for the ability to integrate with existing analytics and AI workflows is key for making it for ease of use and making it frictionless to for data scientists, data analysts. Right? They should be able to use the same scripts or the same workflows that they're currently used to without having to be an expert in a new framework or a new technology or having to learn a new language that works with confidential computing. That is the gold standard. What we provide, the way it's architected right now is we provide a data plan that allows you to run big data frameworks within continental computing environments. So for example, 1 part of our solution is for Intel SGX secure enclaves.
Now Intel SGX enclaves, they require the application, which is the big data framework, for example, be it Spark or be it something else like PyTorch or TensorFlow. It needs those frameworks to be architected against the Enclave's APIs. An analogy here is if you want to make use of GPUs for accelerating your machine learning workloads, You need to program those applications against the GPU's interface using CUDA or something else. A similar analogy applies for Intel's enclaves for confidential computing. The application that you want to run inside confidential computing needs to be programmed against those APIs as well. So what we have is our own modified version of Apache Spark, Spark SQL that speaks the language of the enclave.
Now it's not that the entire framework is running inside the enclave. No. Because that would be a nightmare from the problem from a maintenance perspective. It also increases the amount of code that you need to run inside the enclave, which has implications on the amount of code you need to trust and verify, but also performance. So instead, what we did was and this is available in the MC 2 open source, is we identified the core operators that need to process the data. So the core SQL operators, for example, that need to process the data. And only those portions of the framework are running inside.
The rest of Spark runs outside the the the part of Spark that does not need to directly see or process the data. So for for example, the cluster management, the query planning, the distribution framework, all of that can run outside the enclaves. Only the limited there's a small number of operators that your query is mapped to. Only those need to be run inside the app. In that sense, it's not like you can take your existing Spark deployment and magically make that secure. What you can do is you can use Opaque's platform. Opaque can run within your environment. If you have a cloud environment on Azure, for example, Opaque can run-in that cloud environment. The only requirement is that whatever environment Opaque is running in, you have those physical servers available that have confidential capabilities.
Once the platform is running within your environment, we provide a client that you can use to encrypt and upload data to a location of your choice. For For example, you can encrypt your data and upload it to Azure Blob Storage. And then you can use the client to communicate with the data platform that's running on the cloud. You can submit jobs, same spark jobs in Scala and SQL or using PySpark. And those jobs get executed on the data. The data gets loaded inside the opaque platform inside context creating enclaves and are then processed using the script. So from that perspective, you don't need to modify your analytic queries or scripts. You can run those same scripts as before. But now the processing happens within a fixed cluster, within a fixed platform, a fixed version of Spark or other machine learning frameworks that you want to make use of. So in that sense, we can integrate with your existing workflows. All we need to do is be able to pull that pull the data from its source, load it inside opaque, and then the results can be shared with the analyst directly.
Our ecosystem is growing as far as data platforms and data warehouses are concerned. So, yeah, stay tuned for more, I suppose, on that front.
[00:32:41] Unknown:
Digging a bit deeper now on the opaque systems platform itself, I'm wondering if you can talk to some of the architecture and design aspects of how you're thinking about building it so that it is a frictionless experience for teams to be able to integrate it into their existing run times. And some of the ways that the design and scope of the product have changed since you first started working on it and have been working with some of your early customers to figure out what are the actual challenges that they're trying to overcome and some of the sharp edges that they run up against from the kind of initial formulation of the solution.
[00:33:18] Unknown:
Absolutely. And, I mean, the product has undergone significant refinements, I think, as opposed from our open source test. So when we started working on the open source at Berkeley on MC 2, the design of the open source and the research was informed by discussions with industry partners. For example, we were already doing POCs with many of our sponsors and partners and, collaborators in the lab. And that laid the foundations for the kinds of capabilities that they wanted from the platform. What kinds of analytics they wanted to run? What kinds of machine learning they wanted to run? Something that surprised me back in the day was a lot of the times when organizations talked about AI and machine learning.
My first reaction was okay. They want to be able to execute deep neural networks and things like that on their content and datasets. But a lot of those times, organizations want predictability and explainability. So decision trees, regression models were the kinds of tools and capabilities that they wanted. So what we decided to support in the open source was informed by some of those discussions. As we worked more closely with partners, we realized that, okay, just merely protecting data and keeping that encrypted is not enough. You now need to support 4 policies because without policies, everything is useless.
And so the whole policy enforcement space was something that's evolved over time as we worked with early customers and early adopters. Depending on the sector, the vertical you're in, the kinds of capabilities are different. So for example, in the ad tech space, simple SQL based analytics is enough. But in many financial or healthcare use cases, people want more sophisticated capabilities. People want to be able to support their own machine learning pipelines. They don't want to be restricted to a certain set of libraries. And that has bearing on the underlying architecture and the choice of confidential computing frameworks.
What that means is perhaps Intel SGX in those cases is insufficient because it is not as flexible as confidential VM based approaches. Because with SGX, the framework needs to be programmed against the Enclave APIs whereas with confidential VMs, you don't have that requirement. You can run frameworks of your choice within the environment, but you still need the other ecosystem of tools. You still need to be able to work with and decrypt files or datasets in different keys. You still need to be able to integrate with key management solutions. How do you make this distributed?
How do you ensure that when it's distributed, the communication that's taking place between machines is also protected and is also secure. A key requirement that opens up is around attestation. How do I verify that the environment has been securely set up and that I am actually using confidential computing machines and not regular machines? So all of this tooling that's required from an enterprise ready solution are aspects and capabilities that evolved and grew and over time in conversations and working with design partners as well. Really, that is what we provide. We provide the entire ecosystem software that makes it possible to power confidential computing for analytics and machine learning. And our machine learning capabilities are growing and evolving, and we'll have a rich support available on that front shortly.
For example, GPU enclaves are now available as well. So Azure recently announced the availability of GPUs as part of the confidential computing product suite. Once those become available, then we can offer richer capabilities with higher performance guarantees. Because for now, you're still restricted to processing workloads on CPUs only because that technology doesn't exist publicly just yet. So looking forward to further enhancing and broadening the scope of the art of the possible as far as AI and analytics are concerned.
[00:37:13] Unknown:
And in terms of the types of data that are feasible to use in these confidential computing environments and some of the data modeling considerations that go along with how to think about building your computation. I'm wondering what are some of the kind of constraints that are imposed and some of the ways that you're working to smooth the operation of being able to either impose those constraints on existing datasets or being able to open up the degree of constraints so that more types of workloads and more types of data can be processed with these
[00:37:49] Unknown:
tools. This is 1 key point as far as machine learning in particular is concerned. Right? Because machine learning, it's not like you magically have a model that can now run on the combined data center. There is this whole data engineering and data exploration phase that is at odds with data privacy and teleconferencialty. If I can't see the data, how do I know how to train the model? What fields to use? How to engineer those features? So that is a constraint to speak of or any environment that enables collaborative machine learning. It needs to take into account. What the best approaches I think we're learning more and industry as a whole is evolving as well. But some ways in which you can do this is, 1, you can have an insecure or simulation mode, for example, that allows you to share some data with me. And I can see that data and combine it with my own and do my, feature engineering or whatever data exploration I need to do to develop the actual model training scripts that will get deployed in production.
And once I have done that exploration engineering, then I can flip the switch and start the secure mode, so to speak. And at that point on, whatever data is being joined and combined and models are being trained on remains protected and no 1 can see it. That is 1 way to achieve or enable constraints while still enabling data exploration and preprocessing and so forth. Another way to do this is to use synthetic data or obfuscated data as part of that insecure data exploration mode. But you're right. This this part of the pipeline is something that needs to be articulated, I suppose, more clearly. What may work for 1 organization, for example, they may not have a way of sharing any data at all as part of that exploration phase. So in that case, a synthetic data could be a worthwhile approach or using some sort of obfuscated data as part of that exploration phase could be a worthwhile approach. Another way to do this is to use the policy framework to impose constraints on what the data analyst or data scientist is allowed to do as part of that exploration.
In that case, as long as you approve the kinds of pre processing operations or the kinds of feature engineering operations on that data, Then as long as the scripts are compliant with that policy regime, they can be executed, they can be enforced. But that can also come with certain amount of friction as opposed to interactivity as concerned. Right now, the way we are thinking about this is having that insecure mode in which there is no confusion from a UX perspective. That's okay. This data is whatever I want to share with you for you to do your data exploration will be visible to you. Either choose to share some subset of my data or I can choose to share some fake information that mimics the schema. That data that allows you to decide how your scripts might be written.
But once that exploration is done, we enable the secure mode and everything thereon happens only on confidential protected data.
[00:41:03] Unknown:
That brings it into the space of managing kind of the preproduction versus production or kind of CICD aspects of it as well where, you know, okay. In our nonproduction environment, you can do this exploration. You can iterate and build your model, or you can, you know, do some exploration of the data to build your analysis. And then once you say, okay. This is ready to go to production. That goes to a different operating environment that has the kind of fully secured runtime enabled so that you don't have any capability of exploring the data and then being able to give the parties engaged the controls to say, you know, this is the data for this stage. This is the data for that stage. But then that also brings into question whether or not the data that they're using in the preproduction stage is truly representative of the actual live data, which is a broader question in machine learning and analytics in general. So
[00:41:53] Unknown:
Yes. I agree. I mean, these are all operational challenges that 1 does need to look up. With regard to the first part with the preproduction versus production, you don't necessarily need to use different environments. You could make use of the same platform and the same environment with just the enclave protections turned off, probably. You don't even need to turn the enclave protections off, to be honest. But you may want to turn them off because maybe you get better performance as a result. But you could use the same environment while just from a user interface perspective, I am saying that this is now preproduction.
And therefore, whatever data is shared, there is a way for you to visualize it as well. Or there is a way for you to see it as well. But I can show it to you through that same environment as well without actually having to work in a separate environment. It's just that once you turn on the production switch, then that capability of being able to for me to look at your data goes away. And the confidential computing environment enforces that essentially that you now if you try to look at my data, you will only see data or you will not be able to see anything at all. So that aspect of it does not need to be a distinction between the environment for preproduction and production. What was the second part of your question?
[00:43:06] Unknown:
Challenges of the kind of testing dataset being truly representative of the live dataset.
[00:43:12] Unknown:
That is a challenge with synthetic data as well or with any fake data that you do create. That challenge does not entirely go away, to be honest. The 1 way to overcome that potentially could be that you to be honest, maybe that challenge exists if you and I have datasets that have non overlapping fields, if you have certain fields that I don't have in my dataset, then I need to be able for me to do some any kind of training, I need to be able to know what those fields are as well or see what those fields are as well. And that's when the problem manifests. But if the dataset is horizontally partitioned, for example, the schemas are the same, then maybe I could train the model on my data, set and then refine that iteratively once combined with your training sets as well. So depending on the orientation of data, depending on who holds what pieces of data, the problem may or may not manifest.
Another option is to use an orchestrator in some settings. So for example, in some deployments, banks, for example, who want to work with each other. It's operationally much easier for us to work with single entity who acts as an orchestrator or facilitator of the consortium. And that entity provides the data science and data analytics expertise and data owners, which is the banks in this case, can bring their own data to it. So you may permit the data orchestrator. You may have looser restrictions around what the data orchestrator can do or see as far as data is concerned as opposed to other members of the consortium or collaboration. So there are a few ways in which this can be architected. This problem is something that customers or adopters need to be aware of as far as machine learning and AI goes. It's a lesser of a concern as far as analytics is concerned in my experience. Because as long as you know the schema of the dataset, then that in and of itself is often sufficient for you to get the kinds of insights that you want. For example, you may want to do some aggregate analytics to identify how many users we have in common and what their purchase patterns are or what their demographic distribution is. And for this, you don't necessarily need to look at the data itself. You only need to be aware of the schema of the data. And at that point, it's sufficient. So it's not so much of a concern as far as analytics and query processing goes. But, yes, absolutely.
Around machine learning and AI, this is a broader challenge.
[00:45:32] Unknown:
Yeah. There are a number of other avenues that I think would be fun to explore more deeply, 1 being the kind of validity of schemas matching but having different underlying semantics of what that data means. That's a whole another problem that is kind of outside of your control or concern, but interesting to explore nonetheless when you are engaging with multiple parties bringing their own data.
[00:45:54] Unknown:
1 simple example here is in 1 case, 1 particular data owner had, ZIP codes that were 5 digits, and the other data owner had ZIP codes that were 5 digits followed by hyphen followed by something else. Exactly. And those are all problems that do crop up. That's not that, yeah. And just knowing the schema, you also need to know the format. So you're absolutely right.
[00:46:16] Unknown:
Yes. Or if you have an integer field, but those integers are just kind of categorical, not actually pure integers. And, you know, what does that integer map to for a given category? So lots of ways that it can go weird. Another interesting aspect is what you were saying earlier about from a policy enforcement perspective saying, okay. I need you to provide me with the script or the code that you want to execute against my data. You know, then there's the question of, you know, what level of complexity is acceptable because I don't necessarily have time to read through 10, 000 lines of code to make sure that it's doing what I think it's going to be doing. And so then you're into static analysis and validation and whole other categories of problems. And then another aspect too from a kind of performance and data modeling and tuning perspective is because you're operating in these secure enclaves and you want to try to optimize for chunks of data that fit within the CPU cache so it doesn't have to get shuffled out to memory too many times, wondering what are some of the ways that you're able to help people with kind of managing that data segmentation question of, you know, up to a certain point, everything's able to fit into the cache on the die, but then as soon as this record goes from, you know, 52 to 58 characters, now all of a sudden we have to, you know, swap to memory every time we wanna try to match these 2 numbers together and just some of that question.
[00:47:39] Unknown:
And that's a deeper architectural question, to be honest. So the approach that we have taken on that particular front is we abstract that away from the user. So the user does not have to worry about data segmentation and data sizes. We have architected the platform for, let's say, Spark and for, for example, in a way that intuitively streams the data structures through the CPU. So the algorithms have been developed in a way that at most suited to this model of computation. But the user should not have to worry about it. Who know? Maybe in the future, we'll we'll need to expose this capability to the user as well. Who knows? We should we'll see. This problem does go away with confidential PBMs to some degree because the entire all of memory is available to you. So the model of computation that you have is the same as a regular vanilla workload.
But, yeah, these are all deeper architectural questions that we didn't have to grapple with in order to design a platform that was as efficient and as fast as possible. Because if it's not fast, then people are not gonna get value out of it.
[00:48:41] Unknown:
Another topic that's always fun to explore with companies that are building on top of an existing open source project is how you're managing the balance of effort and engineering that goes into the open source versus the commercial aspect or whether the open source was just an initial proof of concept and then effectively abandoned in favor of a purely commercial product that is based on those underlying principles and just some of the ways that you're thinking about that governance and sustainability aspect of MC 2 compared with the work that goes into opaque systems?
[00:49:15] Unknown:
To be honest, we have been since launching opaque, right, we have been focused on getting the product and tech off the ground and, packaging MC 2 into something that's enterprise ready and has all the capabilities that enterprise need. I mean, there is a major push that's coming to the open source in Jan, and we intend to be now at a at a size where we have the resources available to us to maintain and grow the community going forward. So we completely intend to keep making progress and giving back to the community as well around open source. So that anything that drives adoption and helps the community adopt confidential computing is a win for the entire space, given that it's a young emerging category. So open source is key for us, and that's something that we will do more and more of going forward. It does bring up the question, okay, around how is the closed source different from the open source? What do you keep in the closed source? How do you think about that? And, honestly, that is an evolving question, and that that is something that we have passionate debates about internally.
The fundamental principle that we try to espouse is anything that helps with adoption should be part of the business. But capabilities that are necessary for enterprises who would want support. Those things are part of the closed source product. For example, disaster recovery, high availability, ecosystem integrations, for example, into key management solutions or data sources, things like that. Our current thing is to retain them in the closest. Anything that is required for enterprises for who want support and who don't want to deploy the open source themselves is are things that we would keep in the closed source. But anything that helps with adoption or anything that hinders adoption should not be closed source, and should be part of the open source.
Operationally, it would be nice if we could draw a clean line that does not allow the 2 code bases to diverge too much. So that it becomes much easier from the perspective of maintaining the open source as well because we do want to have streamlined processes for pushing from the closed source to the open source and vice versa. And if the code base is diverged, then that becomes operational nightmare. How successful we'll be on that front remains to be seen. But eventually, at least as far as opaque versus n c 2 is concerned, concerned, we we do have dedicated resources to maintain that 2 way communication between the open and closed sources. But, yeah, that's an evolving maintenance question. 1 decision about this impacted us was instead of having multiple repositories, we should combine everything into a giant 1 repo because that makes it much easier to maintain. And not only internally, but from the perspective of open source as well. Because if you have multiple repositories all speaking to each other, then this problem gets compounded, significantly as well.
But I think we do have a blog about that as well. Our journey to a a mono repo and what impact that had.
[00:52:11] Unknown:
And so in terms of your experience of building the opaque platform and working with your customers and early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:52:23] Unknown:
My favorite, to be honest, is around human trafficking use case. So we're working on with on human trafficking with a group of banks. And the problem is that our money laundering, for example, it's part of the same financial crime detection space. But the problem is that to identify perpetrators of financial crime right now, the best that each bank can do is look at its individual transaction data, identify that for patterns. And once there is suspicious transaction, then that gets flagged and analyst files suspicious transaction report. But any analysis they're doing as a result has very high false positives because they're limited in their view, of data movement of or on the movement of funds because criminals hide their traces across multiple banks.
So in order to be able to detect financial crime more effectively, banks need to be able to collaborate with each other. Banks need to be able to share data with each other. Each bank has its own way or has its own models and own analytics processes for flagging suspicious transactions. But there is no effective sharing of intelligence that's taking place. With OPEC, this problem becomes tractable because each bank could keep this data encrypted, pulled it together, but still jointly run analytics on the collected data to identify cycles, for example, around the movement of funds or train more robust models on their collected data. So this particular problem was something that I'm deeply passionate about is because, of course, organizations like ourselves, we exist to ultimately drive revenue and make money. But if you can help the world in any small way, while doing it, then that's okay. And it was also a reason why we were researchers first and become.
Another cool example is and we didn't work on this particular 1, but sometimes I like to wonder if we could have made a dent into the problem is around COVID. Right? And there's 1 thing we learned during COVID is that contact tracing could have helped solve the problem to a large extent. But the moment we tried to deploy contact tracing solutions, we realized we needed to be able to combine data from various patient repositories and patient silos. And the moment we tried to do that, all these patient confidentiality concerns came to the forefront, thwarting many of those efforts to a large extent.
Not to get ahead of myself, but had opaque been around, then maybe we could have helped address that problem. So while at once preserve patient confidentiality while also helping with contact tracing efforts. We are working with health care institutions around better patient profiling, enabling health care data owners to combine their traditional electronic health care data against consumer data sources to identify patient behaviors and patient profiles so that they can be more accurately diagnosed. And and those results can be used to train better models for predicting diseases and patient behavior and so forth. These are a few examples that I find very exciting. There are other examples as well in the ad tech space. Cookies are going away. The industry is terming that as the cookie apocalypse.
And cookies have been away, right, so far for publishers and advertisers to be able to identify common audiences. For example, if I am Nike and I wanna advertise on CNN, I wanna know whether CNN's audience sufficiently overlaps with my target base. How do I do that? I need to be able to combine my data with yours, intersect my data with yours to identify if we have common audiences, what the segmentation of that audience is, and so forth. Can we do that using alternate forms of information like account IDs or email addresses and IP addresses without actually having to divulge that information to the other stakeholders. This opens up a world of possibilities as well. So just a few examples of use cases that I find fairly interesting and exciting.
[00:56:16] Unknown:
Absolutely. The advertising 1 too is interesting because of the fact that, you know, companies have gotten used to the web oriented world of advertising where you do have all of this rich information and customer profiles that you can build up and obviously has some privacy issues to go along with it, but there are challenges with how that maps to other kind of distribution mechanisms. Podcasts, in particular, come to mind because I've been running podcasts for a number of years, and it doesn't have those same attribution capabilities because it is effectively an anonymous distribution channel unless you're using something like a Spotify or a platform where you own the entire experience and can start collecting some of that other information. So it'll be interesting to see how companies adapt to this world of not being able to have that very rich and detailed visibility into kind of individual customer profiles.
[00:57:07] Unknown:
Yes. Couldn't agree more with you.
[00:57:10] Unknown:
In your experience of doing the research and building the MC 2 project and now turning that into opaque systems, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:57:23] Unknown:
1 thing that comes to mind is there is a gap between what technologists see as solutions for maintaining privacy and confidentiality, and what regulators see as solutions for maintaining privacy and confidentiality guarantees. For example, an open area of judicial interpretation is is encryption a sufficient mechanism for providing de identification of data? Technologically, yes. Because unless you have the key, it provides much stronger security properties than traditional anonymization or pseudonymization techniques. But in some sense, it is reversible if you have the key.
Whereas other technologies may not fit that regulatory interpretation reversibility. So there is a gap between the technologist's worldview and the regulators worldview. And as an emerging category, I think as an industry, we do more around educating the market. Customers, adopters, including regulators, around the promise of privacy enhancing technologies like confidential computing. So we need to do more around the education front, and we need to keep doing more around it. Because people often don't know what is possible as a result of the stuff. Because it's not like we're going to companies and telling them that, oh, you've been solving this problem using solution a. Come use opaque or adopt a privacy technology because we can solve the problem better. No. We're often telling them that things that you have not been able to do so far, things that you did not know were possible become achievable as a result of this technology. So you not only get stronger security, but you at once get higher utility from your data as well. So it's really a win win.
But this requires more education and more dissemination and knowledge requires the development of more market events. And we intend to keep continuing to do that. Forum, your podcast is 1 such way as well, that we hope to educate the audience around the promise of confidential computing and the things that you can do. But that's a continuous work progress and, something we intend to keep to evolve.
[00:59:34] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and dbt and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with
[01:00:28] Unknown:
DataFold.
[01:00:29] Unknown:
And so for people who are looking for ways to be able to perform computation on data that has issues of sensitivity or regulation or for organizations that are looking for avenues to open up collaboration either across departments within the business or between businesses? What are the cases where Opaque Systems is the wrong choice?
[01:00:52] Unknown:
Some cases, I would say, you may not need, something like so in some intra company collaboration scenarios, it's maybe sufficient for you to have a platform that only enforces governance. That may suffice for your workloads. In that case, I would not say opaque is a wrong choice because it still significantly improves your security posture. It still mitigates the threat of bad actors or insiders getting access to unauthorized data. It may be a solution that you may not need in some respect. So I wouldn't say it's the wrong choice in those scenarios. It provides a much stronger security guarantees, but ultimately, it's a question security is often a question about economics. Right? Like, do you what are you giving away for higher security? Maybe you're losing something around terms of performance over or the kinds of capabilities that platform affords you. So those are questions that organizations would need to think about. For cross organization scenarios, you absolutely do need something like this because we really need to move away from institutional trust to programmatic trust. You shouldn't have to trust a third party through a piece of paper to be a good custodian of your data. You should get that assurance, that technological assurance from the platform itself. So in multi organization scenarios, I would say it's an absolute must because attackers are getting more and more sophisticated. So it just allows you to increase your security posture overall as well. But in some intra organization scenarios, you may not need as strong guarantees, especially if you have an environment that's hosted on prem.
In which case, maybe traditional mechanisms of governance may may suffice. But if you're moving to the cloud or moving it outside premises, then then absolutely I would. It does not need to be a peg, but you do need to think about incorporating privacy enhancing technologies within your suite.
[01:02:48] Unknown:
And as you continue to build out and evolve the opaque platform, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to explore?
[01:03:00] Unknown:
You hit upon it during our conversation with AI and machine learning. There is more that we need to do around making it super frictionless. Vector data exploration. How do you enable policies that are not super complex and don't require me to bet each and every line of code? What if I miss something there? So as our AI ML offering evolves and grows and more to come on that front, These are some capabilities that I'm excited to add to the mix as well. Just making it the ultimate aim is to make it frictionless for data scientists data analyzation. It should be as simple for you to use confidential data as it is right now to work with, regular vanilla data. And, yeah, a key part of that puzzle is around making it simple in the data exploration analysis phase and also making it simple for policies reinforced in keeping with those with your existing pipelines.
[01:03:54] Unknown:
Are there any other aspects of the work that you're doing at Opaque Systems or the question of confidential computing and its application to analytics and machine learning that we didn't discuss yet that you'd like to cover before we close out the show?
[01:04:07] Unknown:
I think we touched upon a lot of it. I mean, the 1 key thing is it is now possible. Or Appaik makes it possible for you to collaborate on confidential data. Our focus is on analytics and machine learning workloads. But, ultimately, I think the vision that is shared by my colleagues in the industry is we need to be in a world where encryption or protection of data in use is the default. The way encryption at rest and encryption in transit have been standardized. The next frontier really, and this is the third leg of the data protection stool, The protection life cycle to me is enabling encryption of protection for data in use as a default. We've made a lot of progress towards achieving that vision, But I'm looking forward to the day where we actually achieve that as a whole, where everything is always protected by default. You don't have to worry about data being exposed at any point in the life cycle. But for now, analytics and AI remains the key focus for us.
[01:05:09] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are up to, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:05:24] Unknown:
The biggest gap, Tobias, has to be around the lack of protection for data use. That is the fundamental problem. So far in data attacks and data breaches have been growing exponentially over the years. Attackers are getting more and more sophisticated. A lot of the attacks often rely on misconfigurations or just not keeping data protected, when it's being stored somewhere in a proper way. But as organizations get more sophisticated, the threat vector will will naturally become more pronounced around stealing data or getting access to data while it's being used. So that is the key gap. And that not only in terms of existing security postures of organizations, but also around enabling new possibilities and enabling new use cases because that is a key requirement for you to be able to collaborate on confidential data with other entities. Without protection for data and use, you can't collaborate effectively without giving some third party access to your data. And this confidential computing is a technology fulfills or bridges that gap really and provides that missing level of protection. Of course, from a holistic perspective, you still need other capabilities. You still need things like policy enforcement and remote attestation and verifiability and auditability.
But the key to all of this is being the ability to protect data. We will see more and more of that as a technology evolves and matures. And the more the market gets aware and educates about the availability of these technologies.
[01:06:58] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the MC 2 project and how you're building on top of that with opaque systems. It's definitely a very interesting and kind of fascinating area to discuss and explore. It's great to see you and your team working on making this a more tractable problem for people to be able to build better and more secure data systems and analytics and unlocking the collaborative potential for intra and inter organizational analysis. So thank you again for the time that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thanks again for having me. It's been a pleasure. Super excited about what's happening.
[01:07:44] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Rishab Podar and Opaque Systems
Challenges in Data Protection
Opaque Systems Platform Overview
Comparison with Homomorphic Encryption
Core Problems Addressed by Secure Enclaves
Policy Enforcement and Governance
Integration with Existing Data Platforms
Design and Architecture of Opaque Systems
Challenges in Data Collaboration
Open Source vs. Commercial Product
Interesting Use Cases
Future Plans and Enhancements