Summary
The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos.
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turn-key
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Skyflow is and the story behind it?
- What is a "data privacy vault" and how does it differ from strategies such as privacy engineering or existing data governance patterns?
- What are the primary use cases and capabilities that you are focused on solving for with Skyflow?
- Who is the target customer for Skyflow (e.g. how does it enter an organization)?
- How is the Skyflow platform architected?
- How have the design and goals of the system changed or evolved over time?
- Can you describe the process of integrating with Skyflow at the application level?
- For organizations that are building analytical capabilities on top of the data managed in their applications, what are the interactions with Skyflow at each of the stages in the data lifecycle?
- One of the perennial problems with distributed systems is the challenge of joining data across machine boundaries. How do you mitigate that problem?
- On your website there are different "vaults" advertised in the form of healthcare, fintech, and PII. What are the different requirements across each of those problem domains?
- What are the commonalities?
- As a relatively new company in an emerging product category, what are some of the customer education challenges that you are facing?
- What are the most interesting, innovative, or unexpected ways that you have seen Skyflow used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Skyflow?
- When is Skyflow the wrong choice?
- What do you have planned for the future of Skyflow?
Contact Info
- @seanfalconer on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlin as an internal tool for themselves. Atlin is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services.
And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turnkey. So, Sean, can you start by introducing yourself? Hi, everyone. Thanks for having me. My name is Sean Falconer. I'm head of developer relations at Skyflow,
[00:02:05] Unknown:
and we are a data privacy vault, delivered as an API.
[00:02:08] Unknown:
And do you remember how you first got started working in data?
[00:02:11] Unknown:
Yeah. I mean, I've kinda had a long history in engineering. I spent about a decade in university, doing 3 different degrees in computer science, so that kinda took me all over the map. But when I was still in school you know, I worked on a number of different data management projects, sort of highly sensitive data. And, of course, things were a lot different back then. It was almost 20 years ago at this point. But during my master's degree, I worked at this security company where we were actually developing software for the RCMP and police departments in Canada. And the problem that they were trying to solve was that pretty much every municipal police force, as well as the RCMP kind of had their own software for tracking things like speeding tickets and other interactions with people. Over time, you know, criminal behavior and things like that would be consolidated at a federal level, but it took time. So technically, someone could get a speeding ticket in 1 town and then, like, 2 hours later, get a speeding ticket in another town. And then 2 hours after that, get another speeding ticket. And none of those police officers would actually know that this person had this record of speeding all day long. And it would take months before that was consolidated. So the project I was working on was to try to fix this problem. We were building a system that could sort of fan out search across these different independent systems. And then later in my PhD work, I worked in information visualization and human interaction, specifically with large amounts of data in biomedicine.
And then after that, I started a company in the job space, so we're dealing all with all kinds of sensitive data around job applications, resumes, interviews, so on. And we had to build a lot of custom sort of data analytics pipelines and so forth. And now at Skyflow, you know, data management is, of course, a big part of what we do and a common use case that we help solve.
[00:03:47] Unknown:
And so digging into Skyflow, can you describe a bit about what it is and some of the story behind it and how you got involved with the project?
[00:03:55] Unknown:
Yeah. So in terms of our history, our founder and CEO is Anshu Sharma. He previously held executive roles at companies like Oracle and Sales Force, and then he was a multi time founder and investor. And he kept seeing this problem of data privacy, data security, and sort of every company was trying to solve this independently. And he'd been thinking about this for a long time, like, 8 years before founding Skyflow and wondering, like, how do you solve this problem? Like, why is this such an issue for these different companies? And, eventually, he realized that companies like Netflix and Apple actually solved this problem through pioneering a technology known as a 0 Trust PII data privacy vault. And sort of the common thread between these 2 companies that had developed this technology was they recognize that their customer data, which is, you know, core to their business is something special. And because it was special, it shouldn't be managed the same way that we manage regular application data. It's similar to how if you have important physical documents like your birth certificate, social security number for me is, you know, my children's birth certificates or something like that. You might put those in a safe in your home, not in like the kitchen dump drawer with your batteries and flashlights.
So sensitive user data in the same spirit should be kept separate from application data. So they pulled out their customer data from their existing application infrastructure into these vaults. And the big advantage there is that besides creating, like, an isolated single source of truth, it ends up completely descoping your existing application from the responsibility of data privacy, security, and compliance, really focusing the problem to the vault. So the data privacy vault solves the problem of data privacy from first principles, and it makes data privacy essentially an architectural decision. But that being said, they're really difficult to build.
Shopify for as as a sort of data point took about 3 years and contributions from almost a 100 engineers to build their version of a vault. So most companies don't have the resources and expertise to put into that. It's like most companies don't build their own database as well. I thought so, essentially, Skyflow took inspiration from those different companies and that pioneered the technology, but built the data privacy vault for everyone else, making it available as a simple API. And I think it's, you know, very much in the spirit of companies like Twilio or maybe Stripe where sending an SMS or doing a payment is a simple REST API call. And we believe that day a data privacy vault delivers an API and makes data privacy as simple as a handful of API calls. In terms of how I came involved, I had previously spent a number of years at Google building developer relations teams in a completely different area, actually, in business communications where we worked on a suite of API based products that enhance communication between businesses and consumers.
But I knew Anshu from a long time ago, and he had reached out to me. And I started learning about the company, and I got really excited about what they were doing. And I think something like data privacy is something that I had touched in my career a number of times from my own company dealing with, you know, PCI compliance at Google dealing with things like HIPAA. And I think for me, as someone who's been a long time engineer, I'm like, I didn't get into engineering to deal with things like GDPR or PCI compliance. Like, that's not the fun stuff. I wanna, like, build games and build cool applications for people. So if we can take something as complicated as that and abstract it away through an API, that's a really exciting product to be part of.
[00:07:09] Unknown:
In terms of the idea of a data privacy vault, I'm wondering if you can compare it to some of the other strategies that teams have settled on for being able to address this problem of data privacy, such as privacy engineering or data governance where you're just restricting access to some of those sensitive fields? I see the data privacy vault as really complementary
[00:07:29] Unknown:
to existing programs like privacy engineering or data governance. You know, data governance is a part of our vault architecture. So it's a component or feature of our vault. So it's certainly baked into that. And I think it significantly simplifies governance. You know, governance about collection and use and deletion and restricting access. And by placing everything in the vault, you know what you're storing, you know, when it's stored, who has access, and it minimizes the overall risk. And the other thing, I think, when it comes to, you know, traditional approaches to data privacy, whether that's through privacy engineering or some other means, is that the spirit of a lot of the privacy regulations was to get companies to really shift their thinking about data privacy to shift it left and which is 1 of the core components of privacy by design.
But I think what ended up happening for certain regulations is that it became more of like a checklist that companies were sort of self auditing and checking and saying, like, yeah, we're, you know, compliant with whatever. And with regulations like GDPR where they've actually started finding companies for not being compliant, I think that's created pressure on the industry to take it more seriously and start to rethink how they actually approach data privacy. And since a data privacy vault is really a architectural approach to solving data privacy are based around the first principles of solving this, you know, larger issue of data security, compliance, residency, and so on, it shifts the entire conversation for how you do data privacy to really the beginning of a design cycle for a piece of software.
[00:09:03] Unknown:
In terms of the core use cases that you're focused on with Skyflow, I'm wondering if you can talk to the different, maybe industry verticals or problem domains and the capabilities that you're aiming at providing to those different use cases.
[00:09:19] Unknown:
I think in the broadest sense, like our perspective is that data privacy is part of architectural design. So the ideal situation is that the same time that someone's thinking about, you know, what kind of database am I going to use? Do I need a caching system? Do I need a data warehouse and so on? A data privacy vault should be part of that conversation. Basically, anything that's sensitive to customer data that you wouldn't want to show up on the front page of the New York Times should probably go in the vault. And the users table that's traditionally been in your database just doesn't belong there. It belongs in the vault. Now that being said, there's, of course, companies that come to us with more specific problems that they wanna solve. Certainly in a very common situation is with, you know, Fintech companies or in health tech. So a common use case that might see is something called PCI lock in. So perhaps a company that's starting out, they need to process payments. They'll typically go to a company like Stripe or, you know, Braintree or 1 of the other ones.
And the advantage there is those companies, they provide a great service, simple API, great SDKs to integrate with, and you are offloading PCI compliance to those companies. So they give you PCI compliance out of the box. But over the time, you might reach a point with your company. Maybe you move internationally or you're doing a large amount of transactions, and it might not make sense for you to only work with 1 provider at that point. You might be able to get better deals in certain countries or certain regions of the world if you have the flexibility to work with those different companies. But at this point, all your customer credit card information is essentially locked into 1 of those providers.
So the way that we can solve that is you can move essentially your banking information or credit card information for customers into the vault, and then we you're offloading PCI compliance to us. And then through a feature that we provide called Skyflow Connections, you can call Skyflow connections rather than calling Stripe or Braintree or whichever provider you're using directly, and you pass a token that represents the sensitive data, and then we will automatically detokenize that, proxy that call to a third party. And that gives you the flexibility to essentially work with any provider that you want not being locked in. You know, other common use cases are, you know, HIPAA compliance. With HIPAA, you need to de identify 18 different forms of PII, things like name, address, and so forth. We can provide all that utility out of the box. We even provide a vault schema specifically for health care.
Another common use case is around analytics or, data warehouse. So a lot of times with an ETL pipeline, you end up with a situation where you're sort of pushing PII down into your data warehouse, and then it ends up in your metrics dashboards and your logs and so forth, and people wanna solve that problem because that causes all kinds of issues around, you know, potential for data leaks and data breaches. Or if you need to comply with localization regulations, it becomes a kind of a nightmare to try to disentangle that data from your application data. We can solve that problem essentially by introducing de identification at the head of the ETL pipeline so that rather than processing PII and pushing it downstream, you're essentially putting the PII in the vault and sending tokenized versions of that data that's essentially de identified data that's nonsensitive downstream into your data warehouse and still being able to do all the analytics that you would normally do. You know, finally, I think the common use case is around data residency.
That's a hard problem to solve for a lot of companies. But by moving your data into the vault, it gets a lot easier where, essentially, you can deploy vault to different regions of the world. So you could have a vault that's for your European customers, a vault for, you know, your US customers, Brazil and so forth. Because you're creating the single source of truth, you're really localizing the problem of residency to fanning up the vault to these different places.
[00:12:52] Unknown:
As far as the adoption path for something like Skyflow, I'm wondering who you see as the initial customer and then what the decision making process is around how Skyflow ends up as part of the application or data architecture.
[00:13:10] Unknown:
You know, I think naturally, you know, Fintech can help tech make up a lot of our customer base. And they typically understand that they're storing sensitive data, especially if you're doing something like credit card transactions or banking, then, you know, you need to be PCI compliant just to sort of get off to launch and you need a solution. So you're looking for those types of solutions. In terms of how the decision making process works or how customers come to us, I think I kind of see the community of engineers across 3 sort of levels. There's sort of the the CTO or the CISO, you know, at a larger company that might be the ultimate decision maker about how they're gonna solve problems about data privacy, and maybe they're ones that kinda, like, signing the contract or signing the check. Then there's the solution architect or the architect at a company that's thinking about, okay. Well, how do we bring the vault into our existing architecture? Where does it fit, and how is it going to work with these other systems? And then there's the actual application engineer that's going to be doing a lot of the integration with our APIs or with our SDKs.
And in terms of, you know, where that happens, it depends a lot on the problem that people are solving. You know, ideally, when it comes to data privacy, companies are essentially deidentifying or tokenizing data as early as possible in whatever the life cycle of that data is and then detokenizing it or reidentifying it as way as possible. So the ideal situation essentially whenever you're collecting information, say, in a form that's sensitive, rather than sending that data to your back end and downstream, you send that data directly into your vault and then building that's going downstream as tokens. And then when you need to use that data to call a third party service, for example, then you're de tokenizing in transit securely within the vault to the 3rd party.
[00:14:54] Unknown:
As far as the performance implications of moving your private data into this external storage layer, I'm wondering how you mitigate some of those potential performance impacts when you need to actually retrieve that information and maybe join across it into the application database or in analytical workflows where you do actually need to access, you know, particular fields in the process of getting statistical distribution across your aggregates or something like that, how you see either teams approaching that architecture or some of the utilities that you provide to mitigate some of those performance challenges of issuing these joins across the system boundaries?
[00:15:35] Unknown:
Yeah. So I think in part, it's it's, you know, when it comes to certain problems as, like, a new way of thinking about architecture, you also need to rethink the way that you might approach your normal analytics. It's not necessarily about doing a join between the vault in your regular database. A lot of the same types of analytics or metrics that you might be doing today, you can solve through using essentially deterministic tokens or consistent tokenization. So this means essentially the same data value coming from different systems will always generate the same token. And then in that case, you know, within your application infrastructure, whether it's like your data warehouse or or whatever you're driving analytics from, you're using those stored tokens. And you really only need a representation of the data, not necessarily the PII. So, for example, if you wanted to know, spend of your customers by country, you don't need to actually run the operation over data that has the plaintext value of the country connected to the spend of the customer. You can do that using a deterministic token that represents each country. So the representation of the country is essentially, nonsensitive tokenized form. So you're not actually needing to do a join against the vault to run an operation like that. You're just running that against the consistent token. And then if you need to display that country, you can essentially detokenize it during the display of that value. You know, I did something similar for talks I have coming up next month at MongoDB World where the use cases, customer support agents that cover certain territories and the territory essentially maps to a state. But I never store the plain text value of the state within any of the application infrastructure on the only using essentially tokenized forms of the data. And we also have record based vaults where you can reference a set of values through a single UUID as you would in 2 databases, or we support unique constraints to persist ex you know, external identifiers and use those as references.
So it depends a little bit on what you're doing. Now in terms of if you do, you know, need to retrieve data from the vault, which is, you know, can be a situation that you need to display the last 4 digits of someone's Social Security number, then you're gonna need to retrieve that from the vault. We've done a lot to really make the vault, you know, enterprise grade performance, and we have very, very high, you know, ultimate, like, QPS or QPS performance. And that is really a core feature of the way we think about the Vault architecture. You know, it's not just essentially a storage for your PII. Like, ultimately, you're storing that data for a reason so that you can use it. So you need high utility. That's really a feature of the vault. And part of high utility is having high throughput, essentially enterprise grade performance from an API.
[00:18:09] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/ DataFold today to book a demo with DataFold. Digging into the Skyflow platform itself, I'm wondering if you can talk to some of the ways that it's architected and some of the capabilities that are necessary to be able to provide this as a utility beyond just having an API in a storage location?
[00:19:17] Unknown:
If you kind of look at the way the data privacy vault and the APIs evolved at Skyflow, you know, the first part of the development process was really storage and protection. So we needed to create a rich storage layer for where people were gonna store their PII. And that meant that the team that built that had a lot of expertise in encryption. We developed a lot of technology during that point to make sure that things were secure. And 1 of those types of technologies that we developed was something called polymorphic encryption. Ever heard of homomorphic encryption? So homomorphic encryption was I think it originally was talked about in the seventies, but essentially homomorphic encryption in many ways is like the holy grail of encryption where you can actually perform any type of operation over encrypted data. But the problem with homomorphic encryption is it's just not production grade. You know, IBM, did create a homomorphic database, you know, 5 or 6 years ago. It took, like, 3 minutes to query 50 records. So it's not something that you can actually use in production.
And the key insight that we had was that the type of data that we're storing for is highly specialized And things like a Social Security number, a phone number, an email, and so forth, they're not really, you know, numbers or just strings. They're like a data structure. So and the data structure actually predetermines the types of use cases that you're going to do with it. So if you take the case of a phone number, you know, in homomorphic, you could do something where you pluck out, you know, randomly the 2 digits in the middle and you multiply them with the 2 digits of another phone number, but that's not something that you actually do with a phone number. The things that you do with a phone number is you call a phone number or maybe send a text message to a phone number or you need the last 4 digits of a phone number or perhaps you need the area code in order to determine, you know, where customers are, what regions are are, or in the over the country code. So what we developed with polymorphic encryption was this understanding that the type of data that you're storing is this data structure that has these predefined use cases So we can apply different types of encryption to each part of the data structure, allowing us to actually perform operations over encrypted data. So we support not only encryption at rest and during motion, but also execution of queries. So you can perform full operations against the vault using this technology, polymorphic encryption.
So once we built a lot of that storage layer and the sort of data security around it, the next thing is essentially, how do you control access and that's data governance. So we spent a lot of time designing the product before writing a single line of code. 1 of the challenges I think with building a technology like this is not really a move fast, break things type of technology that you can bring to market. Like the MVP doesn't really exist. It took us 2 years to build the MVP. You know, going back to the example of Shopify where it took them 3 years to build their original version of a vault. So we had to build a lot of these things based on internal expertise as well as sort of intuition of our understanding from previous companies that we worked with. So we have a lot of people who've worked previously at companies like Oracle and Salesforce and stuff that are used to kind of solving these these complex challenges when it comes to data security and data governance. So the next thing was to build governance where and 1 of the big things that we sort of created there was we need really fine grain control to how we provide access. So we wanna be able to lock down access to an application based on, you know, the vault, the column data, how they see that data, you know, whether it's mass, redacted, or plain text, even to the row level.
And if you look at existing sort of point and click UI for managing policies and roles within something like Google Cloud or AWS, Those are great tools, but they're very limiting. They don't really provide all the flexibility that we needed. And then on the other end of the spectrum, we could have done something like allowed you to write policies in JavaScript. But then it's like you're kinda taking, like, a, I don't know, sledgehammer to attack in the wall with too much power, and someone who's not an engineer wouldn't be able to do that. So what we ended up doing was inventing our own policy language that's kinda English like rules that anyone can write that gives you the flexibility that we need without being overly complicated. So you can say something like, you know, allow read on person's name with redaction equal masking where the row state equals California or something like that. So it's very easy to write these kind of SQL like policies. So we had data security. We had isolation.
We had controlled access to it. But how do we share that data with third party services in a secure way? So we built out something called Skyflow connections that allowed us to do that. You can think of it kind of like a secure cloud function that runs within the vault that can pass data to third parties and proxy it back. And then the next thing that we built was a feature called Skyflow Studio, which is essentially a web based UI that allows people to do a lot of the creation and management of the vault. So you can do it all through APIs, but you can think of the studio essentially as, like, a UI layer for the APIs. It's a way to kinda help people understand what's going on with the product and visualize the product to do some of those, you know, simple operations as well as learn how to actually use a vault within their own application.
[00:24:33] Unknown:
In terms of the integration process of adopting Skyflow and starting to integrate it into an existing application, I'm wondering what that migration process might look like where you say, I've already got an application. I already have state where my, you know, user table has addresses and demographic information maybe, and I want to push those details into the Skyflow vault so that I'm not responsible for managing and securing them anymore. I just wanna maintain those references and just some of that overall process of starting to adopt the privacy vault technology into your application architecture?
[00:25:13] Unknown:
Yeah. So it's a great question. A common question that people ask. It depends a little bit on the use case that someone's trying to solve. So if you're taking this situation that you're talking about where someone, like, essentially wants to move their users table in the vault, then they're going to create a vault schema that kind of mirrors the structure of their users table. And then wherever they're collecting PII today, the ideal situation is they're doing that from the front end. So we provide front end SDKs for web, iOS, and Android. And wherever the collection is happening, ideally, you're sending that data directly to the vault and then dealing with tokens inside your application infrastructure from there on. That doesn't always work for everybody. You know, there's situations we have a customer, for example, that uses MuleSoft, and we just announced this partnership recently, but we had now added an integration of MuleSoft. So they were using MuleSoft as their API gateway. So all their PII is essentially flowing through MuleSoft. So we built a Skyflow connector that works in that process. So it's not coming from the front end, but essentially at the API gateway level, the API is being detected, sent to Skyflow, and then everything else after that is tokenized and that's sort of the ideal situation. So you're essentially instead of having a user's table within your database storing the PII or the plain text values, you're now storing tokenized versions of that data, and that's what's being passed around. Then if you need to actually display values in your front end, you can do that from the client side SDKs where you would pull that data directly from the vault.
Or if you need to pass it to a third party, like, you say, you do an integration with Twilio, you wanna send an SMS. Typically, you would have the phone number in your users table Somewhere in your application back end, you would pull that phone number and then call Twilio directly. But then, of course, that phone number's being exposed in your back end. Maybe it ends up in your log files or something like that. And what you would do with Skyflow instead would be the phone number would be located in the vault. You would have a token that represents the phone number, and you would call Skyflow connections with that token. And then we would detokenize it in flight, essentially, and send the detokenized value over to Twilio, call Twilio on your behalf and then pass the result back to you, making it so you can do essentially secure SMS in that way. So the ideal world is that your t tokenizing is way as possible. But there are situations where things might be a little bit more constrained than that. You know, if especially if you're an application or a company has been around for a long time, you don't need to start with sort of doing everything at once. You're probably gonna attack things from where are the biggest potential problems today.
So 1 of those areas that we see is, as I mentioned earlier, so the analytics pipeline, so they wanna make it so that let's stop at least where we're consolidating PII or consolidating data within the data warehouse. Let's stop sending PII downstream into there, and let's start tokenizing that data. And that's a fairly easy problem to solve. It's really just introducing some de identification layer ahead of all of those ETL or ELT pipelines. And then another you know, the use case I mentioned earlier about PCI lock in, that's a very easy thing to solve for most companies. Usually, people who come with that problem are, you know, up and running within a matter of weeks from proof of concept all the way to live. So it's a pretty easy lift for companies to solve specific types of problems. And that's usually where people start, and then they see the value of it, and they understand how it works, and then they start to add additional use cases over time.
[00:28:37] Unknown:
In terms of that analytical workload, you mentioned being able to integrate it into the ELT, ETL workflow. I'm wondering if you can talk to some of the ways that the data modeling works out where you think about, okay. I'm going to either put the entirety of this table because it's what has the PII into the vault, or is it just I'm going to put the information from from these columns into the vault and then, you know, replace the tokens that I get into the table that I'm replicating and just some of the ways that using those tokens and using those references to the vault plays out over the different stages of the data life cycle as it goes through cleaning and analytics and aggregations and maybe even into some machine learning use cases?
[00:29:20] Unknown:
Yeah. That's a great use case to talk about. So there's typically you know, there might be a couple of things going on there. I think 1 good example is I'm speaking at the Snowflake Summit next month and about, like, an integration that I did into Snowflake with Skyflow that shows sort of this full data life cycle of an application. So in that use case, what's happening is it's for a fictitious company called Insta Bread. It's a gig economy company, essentially like Instacart but only delivers bread and the application is the shopper or gig worker signup process. So as part of that signup process, they have to give PII information. They need to give their bank account information so that they can essentially sign up to be a worker for the service. And what's happening there is in the application, the collection is happening securely on the front end and sending those values directly to Skyflow.
So we can essentially create either a secure iframe within if it's, you know, web based directly within the form that's collecting the PII or you can use something called Skyflow elements to create a text field that looks like a regular text field, but is, you know, secure field is gonna be sent over to Skyflow. And that data goes directly to Skyflow, and then, essentially, tokens are being passed into the application database. In this case, using, DynamoDB on Amazon Web Services. That kick starts an actual ETL pipeline where a Lambda function setup to should be triggered whenever new data goes into dynamodb. The packages is that data up as a Kafka message, and then that's sent to a Kafka broker. And then we use the Kafka connector into Snowflake, so the data ends up essentially in the data warehouse. But that whole surface area from essentially beyond the front end, so the application bay back end, the application database, the ETL pipeline, the data warehouse, any analytics that's coming from that is essentially a derisked surface area because it's only dealing with tokens. So even if someone stole the database, they wouldn't have access to any sensitive data. It'd just be, you know, nonsense to the tokens.
And then now it's great. You derisk everything, but how do you use it? So to use it, at least in the Snowflake example, what we're doing is Snowflake supports a feature called external functions and essentially you can have external function call an API. In this case, it calls something I set up within AWS as an AWS Lambda function that can securely call the vault on your behalf. So then you can write a query like, you know, select, detokenize, pass in the token from, you know, whatever table that you're using, and that'll actually call securely the Lambda function from Snowflake and detokenize that in flight. So then you can drive your analytics if you need the raw values. You can also run, like, a get value call where you can pass back mass data if a particular metrics dashboard. Maybe you only wanna show, like, the last 4 digits of a phone number, for example. Then you can do that as well. As far as the access control layer, that's something that can often be
[00:32:12] Unknown:
very challenging, particularly as you scale usage and you try to understand, you know, what are the different roles, what access do they give, and when people have maybe an intersection of different responsibilities, how you figure out what that Venn diagram should look like. I'm curious how you have approached some of that kind of user experience aspect of being able to understand if I give this person this role, what access is that going to give, or how should I think about defining the roles to make sure that I'm able to do the operations that I expect to be able to do? Because when you get to a certain level of granularity, you know, you just start to become overwhelmed of, well, there are too many options. I'm just going to say everybody gets everything and just figuring out what is that balancing act to make sure that people actually use the policies effectively and they actually understand what it is that they're doing.
[00:32:58] Unknown:
Yeah. I think that's a great topic because I think that's something that happens a lot. Like, you can create all the sort of security controls in the world, but if people just do what's easy and ignore them, then it doesn't matter. So some of the things that we've tried to do is that since we're storing really highly specialized data and the vault really understands kind of the underlying use cases of that data, then we can do a lot of stuff out of the box. So, for example, if you're storing a credit card, well, there's not really a lot of legitimate use cases for any user being able to see the credit card in plaintext. The only time that you need to see the credit card in plain text is probably when you're passing it to a third party merchant to do a credit card transaction. So you can essentially make it impossible to create a policy that allows someone to see a credit card in plain text.
And on top of that, you can if you understand the types of data that someone's storing and the types of use cases, they're gonna score with that. You can preconfigure a lot of the roles and policies out of the box that they would need for performing the normal set of use cases that you do with that data. We provide the flexibility to go beyond, you know, the normal set of use cases if you need that. But that's something that we can provide out of the box, which is gonna cover most situations. So something like a Social Security number is another good example where the typical operations that you do with a Social Security number is it's either, like, the service should not have access to the Social Security number at all, or they have access to the last 4 in order to do some sort of customer verification.
And then the only other time that you need it in plain text is probably when you're doing something like passing it to a credit score or 3rd credit check, 3rd party, or doing KYC or something like that with a 3rd party. Again, similar to the credit card, there's not really a typical business case for ever seeing the Social Security number in plain text. So you can essentially just make that impossible or very hard to do. So those are some of the things that we've tried to do is make the right things to do essentially the default. And if you make the right things to do the default and you make them easy, people will do them. You mentioned this a little bit earlier with the different kind of use cases that you support with health care
[00:35:07] Unknown:
and PCI and sort of the financial services. And I'm wondering if you can just talk to some of the variance in the product to be able to support those different use cases, and what are the core requirements that are shared regardless of which kind of problem vertical you're trying to work in?
[00:35:25] Unknown:
Right. So we provide a bunch of different vaults out of the box. So we call them, you know, vault templates. So we provide a, you know, PCI or sort of Fintech vault, healthcare vault, a PII Vault. So each vault essentially defines different schemas that include like tables and columns and the specific data types that you'd be storing for those particular use cases. So if you look at something with the Fintech Vault, it really has all the types of data that you would be storing for doing any kind of banking, KYC transactions, moving money, essentially. All those types of use cases are essentially provided out of the box and defined within the schema.
We also provide a vault specially for Cloud, which is a well known Fintech company that we have a partnership with. So we provide an integration to Cloud, and we provide a special vault that makes that integration really easy. So we provide a lot of these things out of the box, and it kinda goes back to what I was saying earlier where we're trying to make the sort of the right things to do to the default experience. And then it also significantly lowers the barrier to entry because this becomes not something that we're you need to think about. Okay. What are the types of fields and columns I need to store? I was just like, okay. Well, this is their use case specific or vertical specific of storing health care data. So I probably should start with the health care vault. And then you can, of course, modify it or, you know, scale it back or change it as you see fit. And additionally, we, you know, beyond just providing these things out of the box, a lot of our customers, you know, we have a solutions architect team that works with them to help build out some of the proof of concept. So it's very much a process where it's it's really a partnership between us and the customer.
[00:37:00] Unknown:
Going back to the analytics use case, another interesting element is the question of the kind of data catalog and data lineage. And I'm wondering how you factor into some of the available tooling for being able to track that information so that if I, as a data analyst, am trying to figure out how to answer a given business question and the data that I need resides at least partially inside the Skyflow Vault being able to understand, you know, this is where the data is. This is how it, you know, ended up in this final report that I'm going to produce and just being able to integrate with the kind of data discovery, metadata catalog, data lineage aspect of building out the end to end kind of analytical experience?
[00:37:42] Unknown:
We don't necessarily support an integration at that level today, at least. But we support, which is part of, you know, some of the compliance regulations and like an auditing API. So any sort of access or retrieval from the vault or anything really that happens within the vault is logged within the audit API and available. So someone could actually consume that API to be able to show that kind of lineage if they wanted to, but it's not something that we provide, at least today, an integration into, like, an existing, you know, metrics system or analytics system. Yeah. And then in terms of the deletion, you know, I think that the big way that we try to support that is really through, you know, the tokenization service that we provide. So as long as your system is deleting the mapping between or deleting the original PII that's stored within the vault, then even if you're storing tokens within your services, those tokens are now meaningless. They can't even be de tokenized to its original value because that mapping's essentially been removed. So it simplifies the deletion process significantly.
[00:38:45] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity, With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast .com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. Because of the fact that this concept of a privacy vault is fairly new in the market, so you're part of an emerging category, I'm curious how you approach the challenge of customer education to help them understand the value that you provide, the use cases where this technology is important and applicable to them, and just the overall process of understanding how to meet the customers where they are and kind of raise awareness of the general utility of this approach to data privacy?
[00:40:19] Unknown:
Yeah. I mean, that's really a big part of my job is to bring the technology of the data privacy vault to the world of engineering. That's also a very exciting part and a big part of why I was thrilled to join Skyflow. I think it's a rare opportunity that you really get to bring a new technology that's completely new to the world. It's like, you know, bringing the 1st database to the world or data warehouse or something like that. That should be an essential component of the modern stack. So, I mean, a big part of the some of the initiatives that I've started since I've joined is around, you know, content generation as well as things like, you know, events and, you know, speaking, coming on podcast, for example, to get the word out. But I think content is a big area that we're really invested in because there are problems that people are trying to solve that they might not necessarily realize that the data privacy vault is the best solution to that problem, but they are out there seeking solutions to that problem. For example, if you do a search like, how do I safely secure a Social Security number in a database, People are looking for answers for that. That question is asked on Stack Overflow, Stack Exchange, all these, you know, common communities over and over and over again. You know, some of those answers are a little scary. Some of the answers are pretty good. So I ended up writing an article that talked about all the things that you need to think about when you're storing something like a Social Security number and teaching people how to solve that problem from if they were to develop that solution themselves. But, you know, an easier way to do that is bring that into the vault, and, essentially, we take care of it. So it's really trying to educate the developer communities about how you solve these different problems starting with if you were to build it yourself and then what's it look like if you did it with Skyflow.
[00:41:55] Unknown:
In your experience of working at Skyflow and working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen the Skyflow technology applied?
[00:42:05] Unknown:
I think 1 of the really interesting use cases is using our data governance engine as a way to do, like, consent enforcement or restrict data access based on the state of a workflow. So for example, if you think about, like, a customer support agent, the worst case scenario is that customer support agent essentially sees all of the customer data in plain text and they have access to all the customer data. And then, like, a better system would restrict the columns that the support agent can see and perhaps add in, you know, some masking to certain sensitive fields. So instead of seeing so screen number, they see last 4 digits or something like that. And then a level better than that is you perhaps restrict the rows that the support agent can see based on region. So they can only see US customers or maybe even, you know, California on the customers, which reduces the risk or the scope of, like, a data leak. But even better, which is something that we've seen a customer do is to limit access based on the present queue of customers that the support agent is talking to. So that means, like, a support agent is likely only talking to a handful of customers, which means they only have access to a handful of records at a time. Even if someone socially engineered the customer support agent giving up their credentials or something like that, the scope of that breach is, you know, a handful of records that has redacted or mass date data, which really reduces, you know, the attack surfaces that hacker has. And, essentially, the records expire out of the queue once the queue is empty. So it's like a time base as well, which is really amazing thing to do. And we support APIs that allow you to do those. And some of the things that we're now trying to look at is, well, you can do those through the APIs, but let's take a use case like that and make it even easier to do it. Let's make something that is really sort of like a point and click or some of the API call and be able to set something up like that where you could have, you know, essentially, a self expiring access to a limited set of records.
[00:44:00] Unknown:
In your experience of working at Skyflow and helping to raise awareness about it and do customer education and outreach, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:12] Unknown:
I think the biggest challenge goes back to what I was saying earlier about this is like a new category. So our biggest challenge is really an educational 1, and we're sort of shifting how people and companies are thinking about data privacy. So I think historically, if somebody has been considered like a CSO problem or maybe a privacy engineering problem, we're saying that data privacy is like an architectural decision. So that's a big, you know, education hurdle that needs to happen. And we need to make people understand and see how to apply the technology within their companies.
[00:44:42] Unknown:
For people who are looking to tackle some of the challenges around data privacy or privacy engineering, what are the cases where Skyflow and the data privacy vault is the wrong approach and they're better suited, you know, just encrypting some database fields or using privacy engineering techniques to obfuscate data for data sharing. Just some of the cases where they're considering Skyflow and it actually isn't the right answer for them.
[00:45:08] Unknown:
The situations we see with customers where, you know, maybe Skyflow is not a great fit is when someone thinks about replacing their entire existing database with Skyflow. You know, it's not really designed for that. Like, you wouldn't put your couch in a safe, for example, but you probably would put your passport and maybe your birth certificate in a safe. So you really need to understand that this is for very special, like, sensitive customer data. And then I think the other 1 would be, you know, trying to run analytics directly off of Skyflow. We talked about this a little bit earlier. It's not really what it's designed for slash core sort of competency. And, ideally, you're driving analytics from the de identified database and storing that within your analytical databases.
And then the other 1 I would say is it's not really data privacy related, but has to do with something that's considered sensitive, like API keys or passwords to a database or something like that or encryption keys. We're probably not the right choice for that if you're trying to store that kind of data because I would say, like, we are a more comprehensive, you know, overall data privacy solution. You should probably use something like a secrets vault or a secrets manager. That's a more appropriate solution for that particular problem. You know, secrets vault or secrets manager has some of the same spirit as the data privacy vault where it's, you know, principle of isolation.
You should be using that technology, but it's complimentary, I think, to the data privacy vault.
[00:46:29] Unknown:
As you continue to build out the product and grow the company and grow the use cases, what are some of the things you have planned for the near to medium term future?
[00:46:38] Unknown:
A lot of stuff. I mentioned Skyflow Studio, which is our web based UI. We're going through a big redesign of the studio. And 1 of the things I mentioned earlier when I talked about sort of the history of how the vault was architectured, you know, a lot of the things that we had to do was based on our own expertise and intuition. And I think we were right a lot of the times, but I think there's times when we were maybe not as right in terms of how people wanna use the product. So we're kind of rethinking how that studio works based on the actual feedback from customers because now we have lots of customers that are using the products, and we have a better understanding of how people kinda wanna interact with that product.
Another area that we're building out is going back to the use case I talked about with the customer support roles where they can only see essentially the customer data that is currently in their queue. You know, we can support that through APIs today, but it requires some work on the customer side or the integrator side where they have to be doing some of that business logic on their end. It's not super complicated. They can build that out, but I think we're really focused on let's make this easy for people so that they can do the right things, going back to a theme that we talked a little bit about earlier, where we can take a lot of these use cases that maybe people have to do a little bit of work today, and we could just make that a product feature and make it really simple to do. So we're investing in a bunch of things around that, and we really wanna invest in continue to make the best practices the default experience for people. And we can continue to expand sort of the preconfigured governance rules based on the vault structure.
We also have a number of partnerships with companies like Plaid, with MuleSoft, Move, and so on. So we're continuing to expand the number of partnerships. And some of the cool things that we can do with that is because of the data that we're storing is highly specialized, there's really, like, a limited set of potential partners where you're gonna be passing that data to them. Like, if you're storing Social Security numbers, there's only so many companies you're gonna pass a Social Security number too. So we could build, you know, partnerships and integrations with each of those companies, then you can create a really seamless developer experience where if we know that you're storing Social Security number, we can say, like, here's your, you know, options for integrations, and it's really like a point and click thing that you can be up and running within minutes, which is I thought something that can be really magical for people.
[00:48:58] Unknown:
To your point of storing Social Security numbers and earlier mentioning storing credit card numbers and things like that, 1 of the things that I was just realizing we didn't discuss yet is the question of data modeling within Skyflow and being able to do some, you know, validation at the point of submission where you can say, you know, this field is supposed to be a Social Security number, so I know that it's supposed to be 3 digits, a hyphen, 2 digits, and another hyphen, and 4 digits. And just some of the ways that Skyflow is able to support some of that kind of data validation at the point that you're entering it and just some of the kind of schema development and the ways that you think about structuring the data as it lives in the vault?
[00:49:35] Unknown:
In Skyflow Studio or even through APIs, you can define your own schemas and that, you know, consists of creating your own tables and columns. And you can build up a column similar to how you would build it up in a database where we support basic data types like strings and integers and booleans and dates and so forth. But we also developed over 50 different we call them privacy preserving data types that are data types essentially encapsulate all the common types of PII that you might store. So we have a Social Security data type. We have a credit card data type. And within that data type, we can automatically create validation rules so that we check the data as it goes into the vault to make sure that it, you know, looks like a Social Security number, looks like a credit card number. We can also preconfigure the privacy and security controls around it. So a common use case for something like, credit card number is that you're gonna wanna mask some portion of it. So maybe a support agent needs to be able to see the last 4 digits. We make that the default experience for essentially the redaction on that column.
And you can, of course, change it if you need to change it. But we preconfigure that pattern. We'll we'll essentially mask everything by the last 4 digits. Similar where we preconfigure your settings for the polymorphic encryption so that you can do encrypted operations over the column. And then finally, with tokenization, we support a bunch of different tokenization schemes, including, you know, deterministic and nondeterministic tokenization. And we also support format preserving tokenization. So we can make it so that when you tokenize a credit card number, the resulting token still looks like a credit card number. That way, if in your database where you're storing, say, an email address and say you have a rule on that email address that validates that the email actually looks like an email, you can still do that even if you're using a token because the token will look like an email address. It's just not the actual email address.
[00:51:24] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:39] Unknown:
I think the biggest thing, and and this is something we talked a little bit about is making the right thing to do easy. You know, if it's complicated or not the default behavior, people tend to just take shortcuts. You know, that's how like a proof of concept ends up in production where the credentials to the database are, like, hard coded into the source code because someone was building a proof of concept. It was supposed to be a demo. It was never supposed to go to production. They got it working, and then they forgot that they did that. Somehow gets into production, or maybe they're logging plain text data or something like that. And then it's like years passed by, maybe the engineers now left, and suddenly there's a data breach. And the company has to announce that, oh, you know, we had our, you know, API key hard coded in the source code and someone got access or, you know, we were logging sensitive data and plain text, didn't realize it. And essentially everyone external companies like how can this happen? And the reason it happened was because the right thing to do wasn't the easy thing to do. People took a shortcut because they were trying to build this proof of concept. And then, you know, just through human error, they forgot that they had done this thing and they didn't fix it before going to production. And I think we just need to make that the default experience so that there's a lot we can build in our tooling that prevents those types of things from happening.
[00:52:51] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at Skyflow and introducing the overall idea of the data privacy vault as a way to manage some of the sensitive information that virtually every application has to deal with. So appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks so much, and thanks for having me. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Sean Falconer: Introduction and Background
Skyflow: History and Concept of Data Privacy Vault
Comparing Data Privacy Vault to Other Strategies
Core Use Cases and Industry Applications
Adoption Path and Decision-Making Process
Performance Implications and Mitigation Strategies
Skyflow Platform Architecture and Capabilities
Integration and Migration Process
Data Modeling and Analytical Workflows
Access Control and User Experience
Supporting Different Use Cases and Core Requirements
Data Catalog and Lineage Integration
Customer Education and Awareness
Interesting and Innovative Use Cases
Challenges and Lessons Learned
When Skyflow is Not the Right Approach
Future Plans and Developments
Data Modeling and Validation
Biggest Gap in Data Management Tooling
Conclusion and Final Thoughts