Summary
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use
Interview
- Introduction
- How did you get involved in the area of data security?
- Can you start by explaining what your mission is with Enveil and how the company got started?
- One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?
- What are some of the challenges associated with scaling homomorphic encryption?
- What are some difficulties associated with working on encrypted data sets?
- Can you describe the underlying architecture for your data platform?
- How has that architecture evolved from when you first began building it?
- What are some use cases that are unlocked by having a fully encrypted data platform?
- For someone using the Enveil platform, what does their workflow look like?
- A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?
- What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)
- What do you have planned for the future of Enveil?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data security today?
Links
- Enveil
- NSA
- GDPR
- Intellectual Property
- Zero Trust
- Homomorphic Encryption
- Ciphertext
- Hadoop
- PII (Personally Identifiable Information)
- TLS (Transport Layer Security)
- Spark
- Elasticsearch
- Side-channel attacks
- Spectre and Meltdown
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And go to the data engineering podcast.com
[00:00:41] Unknown:
website to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macy, and today I'm interviewing Ellison Ann Williams about Envail, a pioneering data security company protecting data in use. So could you start by introducing yourself? Yeah. So Elison Ann Williams, I'm the CEO and founder of a company called Enveil. And do you remember how you first got involved in the area of data security? Yeah. Absolutely. So I originally started as a mathematician.
[00:01:08] Unknown:
So a pure mathematician went through and got a PhD. At that time, when I was completing the PhD, the the job, possibilities were somewhat limited for a pure mathematician. So you could take a postdoc, which I had several opportunities to do. You could go into our research lab, or you could go work for the National Security Agency, which is the largest employer of mathematicians in the world. So I decided to go into NSA, and, of course, at that point, got very involved in security in general, data security in particular.
[00:01:39] Unknown:
And in your time at the NSA, you ended up doing some work on protecting data in use, which, you then ended up commercializing in NVAIL. So can you discuss a bit about your mission with Envail
[00:01:54] Unknown:
and how the company itself got started? So Envail secures data when it's being used or processed exclusively as opposed to focusing on the other 2 more traditional areas of the data security triad, which are securing data at rest on the file system. So that's your standard file based encryption or securing it in transit when it's moving through the network. So that encompasses all of your transport security. So those 2 areas of the data security triad are what you most often see productized or solutioned across the commercial space.
So the last gap in data security when it's being used or processed is where we focus completely. So if you think about it, the way people most often meaningfully use or process data is by running a search or an analytic over it. So we're concerned with the security posture of that search of that analytic as it's being performed. So what does that mean? It means that we can take a search or an analytic that folks want to perform over data and we can encrypt that. And then we can run that encrypted search or analytic, which could contain all kinds of sensitive indicators around things like PII, PHI, EU resident identifiers under GPR, IP, competitive advantage types of information, and we can take that encrypted search analytic and run it over massive amounts of data without ever decrypting anything.
So we never decrypt that search or analytic itself. And if that underlying data happens to be encrypted, we don't have to decrypt it either. So we're able to really achieve a never decrypt security posture for the processing of data and really close that last gap in data security, which is securing it when it's being used or processed.
[00:03:41] Unknown:
And so when the query or analytical request is sent to the back end for running against the encrypted data. You say that that process gets encrypted, so I'm assuming thinking about it in terms of how the computation occurs is that the process that's executing that search or the analysis against the data at some level has a a clear text view of what that information is, but that it never leaves the server
[00:04:14] Unknown:
or the sort of memory space where it's being accessed. Is that an accurate sort of way to think about that, or am I completely off base on that? So we are far more secure and powerful than that. So our security assumptions of that processing environment, in other words, where the data is that's being used or processed, are com completely 0. So we assume that it's completely untrusted, completely had. There's an attack or someone with malicious intent able to see every single bit of operation in memory and on disk in that environment. So because our security assumptions are 0 of that data space and of that processing environment, we never decrypt anything during processing.
So when we encrypt that search and go run it over data, that data could be unencrypted data. So think in the case of utilizing a third party data services or data in a shared data lake or it could be encrypted data. Either way, we never decrypt anything during the processing. So that math magic, is powered by a special type of encryption that we leverage called homomorphic encryption. Makes all of that impossible,
[00:05:24] Unknown:
really possible. And can you discuss a bit further what the principles are that are involved in homomorphic encryption that allows for being able to process the underlying encrypted information without exposing the actual individual records to any any of the processes involved along the way. So homomorphic encryption
[00:05:46] Unknown:
is not a new area of encryption. It's been around 30 or so years. If people have heard of it, they've often heard a couple of things about it. So the first is that it's often considered to be the holy grail of crypto because it allows you to perform operations on encrypted data or more importantly from a technical perspective in ciphertext space as if it's unencrypted or plain text. That's that holy grail aspect of it. The second thing that people have heard, about this type of encryption, if they've heard anything, is that over its lifespan for the last 30 or so years, it's been very computationally intensive. In other words, slow. So possible just not practical. That's the crux of the major breakthrough that we had in this arena, moving that kind of technology from the realm of the computationally intensive and an extremely slow into the practical and commercially viable.
[00:06:39] Unknown:
And for people who want to learn more about some of the principles involved in homomorphic encryption, I'll ask you to add some links to the show notes that they can follow-up with. And in terms of being able to scale the capabilities of running workloads against, this encrypted data using homomorphic encryption. I'm wondering what some of the challenges therein happen to be in your experience.
[00:07:05] Unknown:
So our applications of homomorphic encryption are extremely scalable. That's where our breakthrough, came in. You know, we had developed the core of our technology within the US government, the US intelligence community, and then we're able to bring that out, greatly expand and mature it and commercialize it into the products that Envail provides today. So in terms of scalability, we are extremely flexible. So we are able to process, data in a in a completely encrypted capacity over very small form factors. We're able to do so in highly distributed environment such as Hadoop architectures.
So the crux of that is that the algorithms that, really power our capabilities are linearly scalable. They can scale vertically. They can scale horizontally, which, like I mentioned, makes us extremely scalable and flexible to leverage all kinds of compute platforms.
[00:07:56] Unknown:
And for people who are interacting with your platform, is there any special tooling or capabilities necessary in the, way that they're interacting with the encrypted data, or are they able to just use common off the shelf tools that they might already be using to run these analytical workloads against the data that you're protecting on your platform? Yeah. So we're designed to be minimally intrusive
[00:08:23] Unknown:
plug and play in the environment. So we really function as a proxy layer to secure the usage of junk data. That means that from a user experience perspective, it's transparent to that end user that what's really happening to those operations that they're performing over the data or the usage of data is that it's it's occurring in a completely encrypted fashion. So part of being a proxy layer means that we're completely API based. So for example, we don't build a user interface. So we're designed to plug into the back end of the existing user interfaces or application workflows in the environment. Being API based, we implement, most of the, you know, common APIs. After that, we're pretty opportunistic.
API dev is easy. We put in place whatever we need to to be as plug and play as possible into the system. We also don't require people to change the way that they store their technology, the format of it, their ingest or ETL processes in any way, shape, or form. So we sit above the data storage layers in the system. We sit above the compute layers in the system, in order to perform this encrypted processing over data. And so for somebody who's starting to use your product, is it something where they would have a license and then deploy your technology into their environment, and it serves as sort of a proxy layer over existing data storage, or would they ship data off to your platform to be encrypted and protected
[00:09:45] Unknown:
using your own infrastructure?
[00:09:47] Unknown:
So they would deploy us in their environment on top of the way that they currently store and manage their data today. So we focus exclusively on that usage piece, which means we are going to be complementary to to this other 2 pillars that I talked about before, the usage and of the at rest, functionality of the data as well as the transport functionality of the data. So we are not going to ask people to change the way that they store their data, the way that they currently encrypt their data, or to trust us to productize that in any way, shape, or form. So they can continue encrypting their data the way that they do today and layer us on top to make sure that when they go use that data, it can be processed in an encrypted state. Also recall that very powerful application for us is performing encrypted computation over unencrypted data, which has a whole different set of use cases associated with it. But in which case, that underlying data at rest is is not encrypted, and we are streaming plain text into our algorithms that are processing that encrypted search over it. And so in practice,
[00:10:49] Unknown:
there is no information available from either direction of either the person sending a request as far as what the underlying data might be or the person receiving a given request as to what the given request as to what the workload is actually trying to do so that it allows for a, 0 trust interface between the party storing the data and the person trying to analyze it, which, in the process of researching for this podcast, I saw that you had mentioned that this unlocks some possibility of data brokerage where you can share data assets across organizations without having to worry about revealing things like PII or trade secrets to and then but still allowing other people to, gain value from that data. So I'm wondering if you can talk a bit more about that
[00:11:37] Unknown:
and some of the other use cases that are unlocked by this, sort of 0 trust interface. Correct. So in the case of a data lake or a shared pool of data, what we enabled is really securing private multi tenant usage of that data within that shared space. So for example, you could have, a shared data lake within a public cloud environment. You could have multiple parties accessing it. In other words, running searches or analytics over it. And because those searches or analytics are being executed in a completely fashion. There's no risk of exposing what each individual party is doing in using that data in that environment.
So other types of use cases that, open up for the NBL capabilities are things like, what we've called crown jewel data protection. So protecting the subset of most sensitive data within an organization, within their internal data lakes most often. We see cloud as a huge 1, you know, beyond data lake, allowing people to really process their most sensitive workloads out in completely public cloud environments. So there's a there there's no shortage of folks that want to utilize the public cloud for the storage of their data. The problem comes in, what happens when you go process it there? So even if you do a fantastic job of encrypting the data at rest in that public cloud environment, and there are a variety of mechanisms to do that, It's not protected when you go use it. So, for example, you take your data, you encrypt it, you put it out in the cloud. Now you wanna run a search over it. So it's very easy for you to encrypt that search as it travels through the network with transport security, say, TLS from your environment out into the public cloud space.
However, when it reaches that public cloud environment, the transport security terminates that TLS is gonna be stripped off. So that search that you worked so hard to protect in transit as it moved from your environment through the network out into the public cloud is now exposed. And, of course, that search itself could contain all kinds of, sensitive indicators that would be risk inducing or liability inducing for that organization. In order to process that search that's now exposed in plain text in that environment, the data that's sitting in the file system, perhaps encrypted at rest, must be decrypted into memory. So at this point, everything that you've worked so hard to protect at rest on the file system and in transit as it moves through the network is now completely exposed in the memory or processing layer out in the public cloud. So that becomes the point of least resistance for an attacker. So if I'm an attacker, I'm not gonna try to steal your data at rest as it's encrypted off of the file system because then I've gotta break the crypto, and that's hard to do. I'm also not gonna try to steal the data out of your search or analytic process as it's moving through the network because it's protected under transport security encryption. I'm, again, have to break the encryption, and that's hard to do. Instead, I'm gonna sit and wait out in that public cloud platform until you go process that data, and then everything is freely available to me in memory, and all I have to do is take it from that environment. So we completely flip that on its head because we can keep both the data encrypted in that public cloud environment as it's being processed as well as the search itself. So if the attacker were to steal the contents of memory using Enveil, they only obtain encrypted information. Therefore, everything is completely protected as it's processed out in the cloud. That opens the door for organizations to now move these very sensitive workloads out to the public cloud for processing. So cloud is a big 1 for us. We all see also see a lot of implication around compliance and regulations, specifically with GDPR, around use cases such as third party risk, as well as insider threat. And so when somebody deploys
[00:15:24] Unknown:
your data use protection layer into their environment, so say, for example, that they're working on a Hadoop system, is it an additional binary that gets installed on the same set of servers that the Hadoop processes are running on so that it can be invoked as part of the layer that's loading the data from disk, or is it deployed as a standalone server and relies on things like TLS for encrypting the information as it travels from the Hadoop cluster to the processing layer and then also receiving the data from the TLS encrypted network call to that processing layer? So the in the data environment,
[00:16:03] Unknown:
we deploy our Infail server application. It's simply a lightweight app, in a containerized setting. You could think about it like a Docker container. It can deploy outside of Docker natively on the OS. So it simply functions as a brain in that space to understand how to take those encrypted searches, for example, and process them over the data to which it's been granted access without ever decrypting anything. It sits above the data storage technology layer as well as the compute, So it doesn't store or process any data itself. It's going to be configured to understand various storage technologies and mechanisms in that environment as well as to leverage environmental compute, which could be anything. So in the case that you have a Hadoop architecture and you, would like to leverage that, to perform the compute, then, that Hadoop architecture, is often configured to understand how to read various, data sources out of different kind of storage technologies in the system. So you could have, for example, Spark, understanding how to read data out of Elastic search. So in that situation, we're going to then, kick off, say, a Spark job in the system if that's what we're it's available to us, and that Spark job will understand how to read that data that's appropriate for the search out of that storage technology, and then we can process it from there. So we're going to, sit on top of the way that the compute and the storage technology currently interact with each other and leverage it as it exists in the system. And as you mentioned,
[00:17:34] Unknown:
the primary attack vector right now for people who are trying to intercept data in a system is to wait for it to be decrypted in memory. And so by virtue of not having to do that when using Enveil, it eliminates that particular vector. So I'm wondering what are some of the remaining ways that people should be looking out for data compromise when they are using Envail? Yep. So a multitude of attack factors available for,
[00:18:05] Unknown:
the usage of data and for really compromising data when it's being used or processed. The vulnerability really lies in a processing layer or in the memory component of the environment in which it's being processed. So all different kinds of attack vectors are possible there. Any would think from, you know, RAM scraping, any kind of frequency analysis side channel attack, you know, just breaking into the external fence of the organization. So any kind of perimeter fencing attacks if we brought it broaden it out a little bit more, and this has become far more prevalent, with all of the,
[00:18:37] Unknown:
vulnerability releases around factor and meltdown, for example. And 1 of the other cases where data can be inadvertently leaked, even if it's maybe potentially, never revealed that the individual record layer is by, sort of, implicitly deriving information from the aggregate records. So, an example might be having geolocation data, being able to then reconstruct the locations that an individual is going to, and then from that, extrapolate who that person might be, or in health records, being able to maybe see the frequency of visits and then correlate that with somebody's geolocation data from another source. So I'm curious, what are, some of the ways that people who are using Envail to protect the actual queries and the data, also some of the additional considerations that they should have to help mitigate, information leaking?
[00:19:38] Unknown:
So for us, you know, because we are encrypting that search or that query as you described, there's there's a whole host of sensitive information contained in the search itself, from an interest and intention potential trade before executing it without ever tipping their hand. Yeah. That's a very simple example of interest and intention that could be revealed in the content of a search or a whole host of other examples along the lines of what you mentioned with different geographic identifiers, etcetera. So because that search is encrypted, that identifier is never revealed outside of the environment of the origination point. Meaning, you type in that search, it gets encrypted within your walls, and then it's sent out to the 3rd parties to process. So because it's never encrypted in that environment, there's no risk of exposure of that identifier outside of that space.
So, really, it completely removes the attack and threat vectors around processing and utilizing third party data outside of your walls and puts that back internal to the organization with standard security postures.
[00:20:45] Unknown:
And so 1 of the areas that people often run into businesses aware of how important it is to have some of these various security postures versus the amount of effort required to enact them. So I'm wondering what you have found to be some of the, common misunderstandings or issues that you've had to overcome in the process of raising awareness of the work that you're doing at Envail or onboarding customers onto the platform? So security is not an all or nothing proposition.
[00:21:24] Unknown:
So we really make sure that people understand that the different elements of data security at rest in transit and use should be complementary to 1 another, not all encompassing. So there should be a diffusion of trust there. So for example, we leverage the enterprise's trusted KMS or key management system, not only for key storage, but also for key generation purposes. We sit on top of whatever their trusted mechanisms are today for securing their data at rest, securing it in transit. So truly driving home that it should be a complimentary approach to data security.
[00:22:01] Unknown:
And what are some of the plans that you have for the future of Enveil, whether in your current products or any products that you are considering developing in addition to what you already offer? Yeah. So what we've encountered,
[00:22:16] Unknown:
in Envail is a whole range of, risk tolerance within the commercial space in various verticals. So we are broadening some of our product lines and offering a multitude of ways for people to dial up and down their risk profile as it as it pertains to the usage of data. So you'll see new product lines coming out, leveraging things like the Intel SGX and other types of capabilities where there is a little bit more risk tolerance in in the environment.
[00:22:45] Unknown:
And in the process of building and growing the business and the technological capabilities of Envail, what have you found to be some of the biggest challenges or unexpected obstacles or, unexpected things that you've learned in the process of doing that? So for us,
[00:23:06] Unknown:
we've had to do a lot of education around, what it means to secure data when it's being used or processed, and the fact that technology is now available, capabilities are now available to make that practical that have never been available before. So previously, you could only really address data security from a practical perspective in 2 forms at restaurant and transit. As I've described. There's really no practical way to deal directly with securing the usage of data. People were working around the space.
[00:23:37] Unknown:
So really opening up, the conversation and and introducing new capabilities to directly address that. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective
[00:23:53] Unknown:
on what you see as being the biggest gap in the tooling or technology for data security today. Of course, from an Inveil perspective, it's the usage. So many, many solutions exist for the other 2 areas of the data security triad. It's it's become a more commodity space, but the the big gap, and where all the attackers are going, once you close off those other 2 spaces are in that usage
[00:24:16] Unknown:
layer. So that's what we see as the largest gap right now in data security. Alright. Well, thank you very much for taking the time today to talk about your work at Enveil and some of the uses for homomorphic encryption and use cases that it enables. So I appreciate that and I hope you enjoy the rest of your day. Thanks so much.
[00:24:38] Unknown:
Much.
Introduction and Guest Introduction
Ellison Ann Williams' Background in Data Security
Mission and Focus of Envail
Understanding Homomorphic Encryption
Scalability and Practical Applications
User Interaction and Integration
Zero Trust Interface and Data Brokerage
Deployment and Configuration
Remaining Attack Vectors
Common Misunderstandings and Future Plans
Challenges and Education
Biggest Gap in Data Security