Summary
As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today
- RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder
- Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
- Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization
Interview
- Introduction
- How did you get involved in the area of data management?
- Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved?
- How has the scope and complexity of implementing security controls on data systems changed in recent years?
- In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?
- What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?
- How much of the problem is technical vs. procedural/organizational?
- As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)
- What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)
- What are some of the ways that data security and organizational productivity are at odds with each other?
- What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?
- What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?
- How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use?
- How can education about the motivations for different security practices improve compliance and user experience?
- What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology?
- What are the areas of data security that still need improvements?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Satori
- Data Masking
- RBAC == Role Based Access Control
- ABAC == Attribute Based Access Control
- Gartner Data Security Platform Report
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn more
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) Promo Code: dataengpod20
- TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png) TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible. You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters. Go to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you tired of dealing with the headache that is the modern data stack? It's supposed to make building smarter, faster, and more flexible data infrastructure a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it, it's all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to work properly. But don't worry, there is a better way. Time extender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, Time extender helps you build data solutions up to 10 times faster and saves you 70 to 80% on costs.
If you're fed up with the modern data stack, give Time extender a try. Head over to data engineering podcast.com/timeextender where you can do 2 things. Watch them build a data estate in 15 minutes and start for free today. Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real time with your own JavaScript or Python code. Join the RudderStack transformation challenge today for a chance to win a $1, 000 cash prize just by submitting a transformation to the open source RudderStack transformation library.
Visitdataengineeringpodcast.com/ rudderstack today to learn more. Your host is Tobias Macy. And today, I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption organization. So, Yoav, for anybody who hasn't listened to your previous episode, can you give a brief introduction?
[00:01:54] Unknown:
Sure. Hi, Tobias. Thanks for having me. My name is Yoav. I'm the cofounder and CTO of a data security startup called Satori. What we do is we help companies streamline access to their data by helping them implement just in time access to data and resolving a lot of the bottlenecks
[00:02:16] Unknown:
around getting the right data to the right people at the right time in a secure and easy way. And for, again, for folks who didn't listen to your previous appearance on the show, can you share again how you first got involved in working in data?
[00:02:29] Unknown:
Sure. So, I've been fascinated with data and databases for as long as I can remember. I think I was at, like, in in 3rd grade or something like that when I went to this computer class, and they taught us about relations and tables and SQL and all that stuff. So it's been it's been an area of interest of mine for a very long time. But more professionally, in my in my previous role at a company called Imperva, I I got the chance to build a globally distributed, highly scalable data platform back in the days where it was a lot about do it yourself, composing all these open source components together and figuring out all sorts of challenges that later on became architectures that we know today, separation of compute and storage and all that stuff. So anyway, I got acquainted with the challenges around data management and and securing access to sensitive data specifically. So that that's been that's been a lot of fun, but, it did get me exposed to all of the challenges in in that area.
[00:03:32] Unknown:
And so in terms of the topic at hand today, we're discussing some of the elements around data security, and that's a very broad term. And as with a lot of things in the data ecosystem, it means different things to different people. So I'm wondering if you can just start by enumerating some of the different concerns that are involved and encapsulated in that term of data security and maybe even some of the ways that term has grown to mean more and different things over the past few years.
[00:03:58] Unknown:
Yeah. Sure. That that's that's so true. It's a very convoluted term. It can mean different things to different people. So some of the concerns of data security, like, the basics is I have this data, and it's sitting somewhere in in a database or a disk or in cloud storage. And how do I secure it just there and then? Call this data at rest. Folks talk about encryption and access and all of these different things that can help secure data as it just sits there waiting to be used. Other aspects are securing data as it as it moves between systems and it, you know, goes to users. And there's also, securing access to the data. You could also think about concerns like database security, which is basically, how do I make sure that my data system is up to date, patched, and and so on? But if you think about it, that specific concern is is mostly being held handled today at the infrastructure level by the data platform vendor. A lot of us are migrating to more managed services so, we don't have to take care of, you know, patching our database software to make sure that we have the latest security update.
I do think that managing access to the data itself, that's a concern that is going to be with us for the long term because I don't see how that can get solved at the infrastructure level. Like, there's no 1 size fits all, and a database vendor cannot make a decision about who's going to be you know, who needs access to what what data. That's a that's a customer, consumer decision. So, I think that's going to be a concern that we're going to have to deal with for a long time. Another interesting element of this question of data security is that
[00:05:43] Unknown:
particularly back in the nineties early 2000, that largely probably just meant, can I control access to my data warehouse, at least from the perspective a of an analytical workflow? And now as we have proliferated the different types of data systems and the ways of working with data and the ways that it's being used, that has drastically increased the overall surface area of the problem. And in your experience as somebody who is working in this space of managing data security and data access control, I'm curious what you have seen as a typical order of magnitude of the number of data locations that an organization is trying to manage access and permissions within and some of the, challenges that they're facing in terms of being able to get a holistic approach to that data security and access problem?
[00:06:33] Unknown:
Yeah. It's a great question. I I totally agree. If you go back 10, 20 years, then things were much simpler. Companies, organizations didn't have as many different data platforms as they have today. It was, you know, you had BI. You had reporting. You had transactional databases supporting applications, but that that's just that's just what you had. And use cases around consumption of data were quite limited to, again, BI and reporting. Also, the amounts of data that companies process, store, collect, and use have grown significantly. What we see from our customer base is, on average, you see many thousands of, let's call them tables, could be, like, buckets and files and other stuff, but thousands and on thousands of of data assets that need to be, managed. Access to them needs to be needs to be managed. There's obviously the regulatory environment that has grown significantly more complex and restrictive.
And when you combine all of these things together, it's a lot it's a lot to handle. Like, if you think about your our our average customer, multiple data platforms, you know, 100 or thousands of users who need access to thousands of data assets, and all that needs to be, you know, conforming to various types of regulatory systems and frameworks, it's, it's a pretty tall order for any organization to be able to handle effectively.
[00:08:11] Unknown:
And another aspect of the problem space that has grown in complexity recently is the question of the regulatory environment where, for a while, you were mainly concerned about the regulatory aspects if you were working in health care or finance and, otherwise, you know, you wanted to be a good steward in of the data, but you didn't have as many legal concerns to think about. Whereas now with the advent of GDPR and CCPA and just the broader awareness and understanding of data privacy and the impacts thereof, it has made this overall space of security and kind of the the, ways that beta is being applied and where and by whom a much more complicated problem space. And I'm wondering, what are some of the main kind of buckets of challenges that teams are facing in trying to prioritize what security controls they need, what security controls are kind of a nice to have and beneficial, but not absolutely critical, and some of the ways that they're thinking about how to tackle the overall problem space of data security, data privacy, access control, and the the the ways that those factor into the kind of compliance and regulation aspects?
[00:09:24] Unknown:
Yeah. So I think you can think about this in in 2 main layers or levels. The first 1 is is is more high level, And that's basically, I would say, organizations have to do the right thing. And I don't think it's hard to know what the right thing is. So it starts by implementing the table stakes of data security. As I said before, all the vendors today, they provide data encryption at risk. They provide encryption of data in transit. They provide pretty good basic security controls for managing access to data. So I think the first thing that organizations have to implement are all of these pretty basic controls. And most of them, they get out of the box. And then they have to take into account why they're doing what they're doing with the data and whether that's aligned with the the, what they're allowed to do and their purpose. I I don't think the regulation is as complex.
I think it aims to to protect, data subjects from their data being being misused. And I think organizations that are generally that want to do good and want to use the data in a responsible way, they can they can really do that. And as long as they match the purpose of why they're doing things to the appropriate control. So for example, if I'm in customer success and I don't have to see all of your personal data or just some of your personal data, that's something that that companies need to take into account and and need to implement. I think where the challenge is is how easy it is to go and implement those more complex controls. That's where there's a gap between what the technology provides and what organizations need in order to be able to move fast with their data, but still remain compliant and and secure and responsible.
[00:11:24] Unknown:
The other interesting element of this is that there's always the gradation of how much of the problem is technical versus how much of it is organizational, and trying to map some of these technical concerns around the business requirements and the requirements of being able to ensure that you're not blocking progress and productivity for the people who are supposed to be just getting their work done. And I'm curious what you see as some of the areas of trade off or some of the areas where data security needs to have some measure of compromise in the interest of ensuring that the business is able to kind of have the access that they need and be able to get their work done. Because if it's just a technical problem, then, yeah, we can add all the security we want, and it'll be perfect. Nobody will access anything. There there's no questions about data leakage, but then it's no use to anybody. So I'm curious what are some of the gray areas of compromise that organizations need to work through on their own?
[00:12:24] Unknown:
Yeah. So I think it's a great question. I think at the root cause, I think it is technology. And and I can talk all day about why the technology is lacking. But what it creates, it creates an organizational problem. It creates a procedural problem because the lack of great technology to secure access to data or data in general is lacking the ability to roll it out in a safe and easy way. So the technical problem creates a change management problem to organizations. If you go into, as I said, some of the controls are already there. But as you said, it's you can implement them, but then no 1 is gonna have access. So I think the the thing to crack here is how organizations can roll out these controls in a way that doesn't, you know, doesn't stop everything they have to do with data, and they have to rethink it. And it's it's almost like replacing the tires on your car while while driving at 60 miles per hour. Right? It's, data is used all the time, like 1, 000, 000, 000 of millions of queries a day in any sizable organization.
How do you the big question is how do you roll out these controls without stopping those businesses from doing what they need to do? That's that's where the, that's where innovation has to come
[00:13:50] Unknown:
and and help us, and that's where we focus on. And as a vendor in the space of data security and somebody who's working very closely with some of these organizations to help them figure out how to manage that balance, What do you see as some of the broad categories and the effective boundary lines for those different elements of data security that we were discussing earlier where there are questions of the database security of who has permissions on what objects versus, you know, where where you're spanning across multiple different discrete storage locations or compute locations, and you need to figure out what are the kind of RBAC or ABAC policies, where and how do I need to apply masking, and just some of the ways to think about how to kind of bucket those concerns so that you don't lose your mind trying to figure out an end to end solution for everything all at once.
[00:14:42] Unknown:
Yeah. So that's that's actually 1 of the problems that data security suffers from today, that despite being an age old practice, it's still there's still no playbook. And in other areas, in computing, in IT or security, you have playbooks. Everyone knows that they need single sign on. That's the way to manage application, access to applications at scale. Everyone knows they need to protect their websites from all these different sorts of attacks, types of attacks. In the data security space, we see this playbook emerges as we speak.
Gartner has a very good perspective on this. They actually coined a term called data security platform. And a data security platform are solutions that bring together several of these, what now is a bit disparate capabilities in the area of data security into a single product. So that's good. If 5 years ago you had to buy 5 products, do 5 different things, today you can get 1 product and get these 5 integrated into 1 solution. And usually, these products, data security platforms, they don't support just 1 type of, you know, data platform vendor. They support multiple. So that's, that's definitely a bright spot in in how this space is evolving. I think that and what Gartner suggests, and it also follows the 0 trust architecture concepts, is it's best to focus on implementing late binding controls.
And what I mean by late binding controls, these are controls that come into effect or enforce or are applied as late in the data access lifecycle as possible. So to give you an example, think about static masking versus dynamic masking. Static masking is an early binding control. The control is implemented, is enforced way before anyone is even accessing the data, as opposed to dynamic masking, which is a control that is enforced when someone is accessing the data. Obviously, dynamic masking is much more flexible and appropriate for today's environment than static masking because it doesn't require a lot of up front investment in going and creating copies of your data and and masking all the data upfront could be a lot of data. And because use cases change, because access patterns change, dynamic masking is more appropriate because it's more flexible.
You can create new dynamic masking rules to fit your new use cases rather than go and create more copies of your data with static masking. And I think the same is same applies to role based access control and attribute based access controls, RBAC and ABAC in short. Like, RBAC is more let's plan everything up front. Let's create our roles. Let's give users, you know, let's grant these roles to our users. And then by, you know, my birthright as having this role, I'm going to have different access levels to data. Where attribute based access control and, you know, if you take it to the extreme 0 trust architecture basically says that it there's no birthright access. Your level of access will be determined by different attributes that will be evaluated when you access data.
And, you know, you can take into account a lot of different attributes. It could be the department in which I work in, in the company. It could be my office location. It could be which network I'm using to access data or which client tool I'm using to access data. You can also have behavioral aspects. If you have, you know, flexible enough policy engine, you can say, well, if this user has been consuming a lot of sensitive data in this session, maybe it's a good idea to validate that user's authenticity or raise an alert or something like that. All of these things are more late binding than early binding. And I think when companies start to thinking start thinking about how we build our data security program, that's where they need to focus on those late binding controls.
[00:19:02] Unknown:
Another interesting element of the security problem is we're talking about data security platforms, but security has also been a concern for, kind of, IT and application systems for many years as well. And I'm curious what you see as some of the reasons that those 2 have largely evolved along separate tracks and what you see as some of the potentials for kind of unification of those concerns and some of the ways that they are distinct problems that need to be solved in their own way?
[00:19:34] Unknown:
So despite data or securing data being a very hard problem, I think it has more chance of being solved universally than application access. And the reason I say this is because if you think about 2 different applications, they can they have totally they can have totally different domains of of of knowledge and domains of expertise. Like 1 application could be a financial application and another could be an H. HR application. Controlling access to both of these applications in a unified way is it's a bit impossible because they they don't have the same objects.
They don't have the same concepts. However, when you think about data, data is actually more organized and it's more uniform. And yes, you have a Snowflake table and you have a S3 bucket and you have a, like, a collection in MongoDB. But you can make you can you can derive similarities between these different data assets and also the types of operations you would want to perform on these data assets are somewhat similar. You can update a table, you can modify a collection, you can write a file into a cloud storage bucket. So with data, there's actually a chance of building this this universal language, this universal set of concepts that would be applied to many, many data systems, maybe all of them or almost all of them. And so that's where I think there's an opportunity for data security platforms to play a role. Whereas if you think about the application space, the most advanced, you know, we went into is we got to is just doing authentication and authorization is largely kept at the application level. So if you think about any employee in any, you know, reasonable company today, they have, like, a single sign on solution, could be, like, Azure AD or Okta or all of these systems. Basically, what these systems do is they do authentication into applications. And they might transfer some data about the user to the application, but then the application has its own security engine, its own authorization engine to, like, control what actions that user can do on on its objects. With data, because you can derive these similarities, there's a real chance here to unify the space. And that's what gets me excited because I'm an infrastructure guy. I I like things to be organized.
I want things to be as uniform as possible because that's where we derive efficiencies. That's where companies can move faster with, with with the data that they have. And another thread that we've started to touch on is the question of productivity
[00:22:25] Unknown:
on working with data and some of the ways that it can start to become at odds with data security controls. And I'm curious how you have seen that manifest both in your own work and in some of the ways that you are working with some of your customers at Satori and some of the some of the ways that that tension can lead to bad data practices?
[00:22:49] Unknown:
Yeah. So we have 1 of our customers. It's the biggest Fintech in in Canada. 1 of, the folks that we work with there, he has this saying that you need to make the secure way the easy way to use data. Otherwise, folks are just trying will just try to go around it and, you know, try to avoid using using the secure way or or just get stuck. So there's, definitely a trade off if you don't use the right technology between productivity and security. And what our goal is, is to eliminate that trade off. We strongly believe that companies can be both secure and productive. And I'll tell you another story from 1 of our other customers, another well known SaaS technology provider. And before using Satori, they told us that their Slack channel with with the DevOps team who manage the access to the hundreds and hundreds of different databases they have on on Amazon. That Slack channel was filled all day long with requests.
Hey. Can you grant me access to this database because I need to troubleshoot this ticket in Jira? Hey, can you open access for me to this system? So that company, they process a lot of sensitive data. They're very responsible in how they, you know, see their role as processors and custodians of that data, but they didn't have the right technology to secure it. So they were suffering from a severe productivity problem. When they started using Satori, we helped them automate all of those all of those, manual access requests and implementations of these requests into something that was very or is very convenient for their data consumers to use. You know, they have a Slack app they can request access from or even grant themselves access in a sole service way as long as their purpose for accessing the data is legit. And the way they mitigate that is by having Satori dynamically mask sensitive data, for example.
So we help them, like, create these really simple workflows, but yet very powerful workflows that help them both protect the data, get the right data to the right people immediately just in time, and, you know, not have them wait for data. And that that was a that was a really big win for both, them and us seeing that happen.
[00:25:26] Unknown:
And you mentioned that the kind of the goal is to make the secure way the easy way, and I'm wondering if you can talk a bit more about some of the ways that the kind of education of the motivation behind some of the security controls can help to encourage people who maybe hit a little bit of friction and just say, ah, I just wanna go and, you know, directly access the database instead of going through this proxy or going through this procedure to to get the appropriate access. Some of the role that education plays in that overall environment of making the secure way the easy way, but also encouraging people to understand what is the secure way and why does it matter?
[00:26:09] Unknown:
Yeah. I think it's a great, it's a great question because like any other area involving security, awareness is key in in mitigating the risk. Because unless, you know, people are are aware of the risks to the company, to themselves in not using the tools that and information that they have been given by the company in a responsible way, then I would maybe argue that it it would hard to find an effective security system that would be still effective if people are completely irresponsible. You know, it's always us humans who are the the weakest link. So I think, first of all, you you have to educate your employees on what is this data that we're collecting.
Is it patient information? Is it, you know, personal information? Is it financial information? And talk to them about what could happen, not just for the company, but also for those people who own that data, that they belong to them. What can happen to these people if we're irresponsible with handling the data? We can talk about identity theft. It could be insider trading. It could be health related issues. It could be fraud. It could be a lot of bad bad things. 1 thing hasn't changed is that bad actors will always try to leverage information, would always try to leverage companies and systems to to gain financial advantage. And all that data that a lot of companies collect today that is sensitive is is, you know, has a lot of value in some markets.
And so I think talking to people about, you know, what's right, what's wrong is very important because, you know, you can you can do all sorts of things if you really wanted to to get access to to sensitive data. Or maybe you you do things the right way, but then you're a bit careless with other things. So I think it's very important to have these conversations. Make sure your employees or data consumers are aware, and and they act in a responsible way with, the data and the tools that they're they've been given.
[00:28:26] Unknown:
And another way that security can become challenging is when you try to bolt it on to an existing system or try to implement it as an afterthought. And I'm wondering what you have seen as some of the impact of incorporating security in the early design and implementation phases of a platform versus just focusing on the functional aspects of I need to be able to get data from here to here, process it this way, and then send it over to this other place. And then say, okay. Now I need to figure out the security protocols around these data flows versus saying, okay. Upfront, I need to be able to get this data here to here. How do I make sure that it is properly secured? How do I make sure that I'm only pulling the data that I want into this other system versus all of the data that might have PII or other sensitive information?
How do I process it in a way that I'm being cognizant of? What are the fields that have that PII data? How do I make sure that I'm only exposing the attributes of this data or only processing the aggregate attributes of this data in a way that I'm not going to be, you know, violating any sort of compliance issues or, you know, data governance or policy issues and just some of the overall impact on the effectiveness of those security controls when they are designed up front versus bolted on afterwards and some of the impact that it can have on the kind of total delivery time of a given data project?
[00:29:47] Unknown:
Yeah. So, I I think it's obviously better to plan more upfront than having security as an as an afterthought. The way I think that principle needs to be applied in the modern data infrastructure is not necessarily by baking all of the security concerns into your data engineering concerns. I think the best way to address that is to have a a component in your data stack that is responsible for these data security aspects across the board because your, you know, your data intake is gonna change. You might introduce more data platforms or tools into your environment.
And having something that is decoupled from the data layer, something that can adjust to your changing needs, I think is the better way of implementing what you mentioned, which is planning planning upfront. Because it also if you don't do that, it also puts a lot of dependency on your data teams because they are the ones who would have to go and implement all of these controls in all of these different systems. They're gonna spend a lot of time on that, which is gonna take them away from their core activities, which is to generate more data products and deliver more data to more people. So I do think that it's a big challenge, and that's why, you know, companies like us are trying to offer this alternative and decouple those security concerns from the actual data layer.
[00:31:26] Unknown:
Join in with the event for the global data community, Data Council Austin. From March 28th to 30th, 2023, they'll play host to 100 of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering, and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data. As a listener to the Data Engineering podcast, you can get a special discount of 20% off your ticket by using the promo code data eng pod 20. Don't miss out on their only event this year. Visit data engineering podcast.com/datahyphen council today.
And as far as the ways that data teams are thinking about security controls and data privacy, what have you seen as some of the notable shifts or evolutions in the space from when you first started, I think it was about 3 years ago, to where you are today, and some of the overall visibility and understanding of the challenges that are involved, how to address them, and what are the kind of necessary controls around ensuring that their data is being used appropriately?
[00:32:41] Unknown:
Yeah. So a lot has changed, in the past 3 years. I think when we started having conversations with data driven organizations, it was more educational. They felt like they needed to understand and learn from us what are the best practices and what they need to, to take care of and what's the most important thing to take care of. I think if you look at conversations we have today driven by the maturity of the market, and the complexity that is growing is, you know, companies are much more educated. They understand that the first thing that they need to take care of is the access piece.
And what I what I call the access piece is who can access what data and and how, and making sure that their admins, DBAs, data engineers are not do not have to get involved in those on a daily basis. So that's the first thing that they, you know, they know they need to take care of. And on top of that, it's, how can we protect that sensitive data and and, you know, deliver more data as we and and desensitize it. So, for example, I wanna give very broad access to a sensitive dataset as long as the the sensitive data is being, you know, properly handled, for example, dynamically masked. But then, if someone needs access to the actual sensitive data, then, you know, let's have them go through an approval process with the data owner to make sure that, you know, they're doing it, according to the right purpose and everything is, you know, they check all the boxes. So, yeah, I think companies today realize that, you know, they need this component. They suffer.
Data teams that we talk to, they suffer from being overloaded with managing access. They, you know, they have different workarounds and different systems. And 1 of the funny things that I didn't expect coming into this space, but I learned in in in by by operating in this space that, when you talk about really basic security controls or paradigms, let's say, role based access control, they're implemented completely different in in different systems. Like Snowflake has RBAC, BigQuery has RBAC, other systems have RBAC. These are completely different implementations. It's very hard to derive similarities between these implementations.
And, you know, guess what? Data teams then have to go and and deal with all that complexity about how a certain engineer 7 years ago at a certain company decided to implement what they thought is our back. And it's obviously not aligned with that, you know, the other engineering, that other company, 8 years ago that decided to do something, that sounds the same, but it's actually very much different. So that's,
[00:35:44] Unknown:
that's a funny funny little nugget. Absolutely. Yeah. It it's definitely funny how the the same word or the same term can come to mean so many different things as you span engineering organizations and particularly the kind of generational technology shifts. And then another aspect of the kind of data security industry that's interesting over recent years is that it has kind of grown to encapsulate more considerations where, as we were saying, it, you know, was just, can you access the data warehouse and what tables can you access? And now it's also all the data masking and, you know, auditing. And I'm wondering what impact the recent growth in metadata systems has had on the capabilities and effectiveness of security controls where you are more able to identify, you know, at ingest, this field has personal address data. And so now I'm going to propagate that labeling throughout the, you know, different transformations and different stages of the data life cycle so that I can maintain visibility of what fields do I need to be concerned with from a security and governance perspective, and what are the fields that are know, in the clear and I don't need to worry about, and some of the ways that those metadata and lineage views are being integrated into the security and compliance regime.
[00:37:07] Unknown:
So we we definitely have seen advances in metadata management in recent years. 3 years ago, if I were to ask a a 2 1, 000 person organization whether they are thinking about implementing the data catalog. They would say, no. That's not for us. It's only for, like, really big companies, and you you need, like, a 10 person team to manage that. And that's not something that we're gonna do. You see more and more companies adopting metadata management solutions for different purposes. Some just want to have really good documentation on their datasets so folks can just know what they're accessing.
Some, are more interested, as you mentioned, in being able to classify the data that they have and understand where they have sensitive data. You know, the the higher the quality of the metadata that's available in the environment, the better, security tools can can become. Obviously, there's no dependency between implementing a security tool and a metadata solution. Most security tools know how to create their own metadata, classify their own data, and and so on with you know, without having you supply that externally. But if that's available, it's a plus. Also from a workflow perspective, metadata solutions typically don't get into managing access to data, although that's shifting a bit.
They feel like they do want to create or or be involved in more parts of the the data life cycle than just, you know, the documentation piece. You do see an interesting, like, subspace in the data security market, which is called data security posture management, which is a very hot topic right now in our industry, which basically it's aimed more for the security leaders than than the data leaders and data teams. And it's all about, hey. We need to understand, like, where are our data assets? What type of databases we have? Are we properly securing these databases at the infrastructure level? Do do we have the encryption checkbox turned on and so on? So that's something that is, you see many many players addressing. It's not a big problem to solve, but if if I were a consultant going into a company today and, you know, trying to help them, work on, you know, improve their data security, that's the first thing I would do. I would, you know, go and understand where the data is, what type of platforms, whether it's sensitive or not, and whether whether the table stakes, whether the the checklist is is is covered.
[00:39:47] Unknown:
And in your experience of working in the space and working with customers, what are some of the most interesting or innovative or unexpected ways that you've seen teams approach the challenge of data security and aligning that with the productivity needs of the organization?
[00:40:03] Unknown:
So I have a really good example from 1 of our customers, which I think it was last week or a week before I had a conversation and with and, they told me about their use case and and what they're doing. So that's a that's a US based technology company that is processing a lot of patient data from from really big US based hospitals. And they have hundreds of databases on prem. They have twice as many databases, in the cloud, and they have to manage access to all of these, databases in something that would be, you know, considered a uniform way. Otherwise, it's very hard to both stay productive and secure. So what they actually built was this system on top of Satori that meets their users in their existing business flows.
And what I mean by that is, for example, let's say 1 of their customers opens a support ticket, and that support ticket gets assigned to, a support engineer in Salesforce. What they implemented, they implemented a hook into Salesforce that automatically grants access to the relevant data for that specific customer to the support engineer, assuming they would, you know, need that level of to go and troubleshoot the issue. That happens automatically as the ticket gets assigned. You know, if ticket gets assigned to someone else, that person loses their access and, you know, the new person gets access. And when that ticket is resolved, they all lose their access. So if you think about this, they didn't really have to go and modify how they work with data. They didn't have to implement a lot of new business processes, for their employees. Obviously, they invested a lot in in their back end to make this happen, and, we, you know, helped them with technology and the tooling around that. But they basically effectively eliminated all of the risk around overprivileged access to patient information from their users because they implemented this flow and they're as productive as they were before, but taking on much, much, much less risk.
[00:42:17] Unknown:
And in your experience of building Satori and working in this space of data security and the technology controls around it, what are some of the most interesting
[00:42:30] Unknown:
security controls, which I mentioned before, and how things that you would expect to be, you know, security controls, which I mentioned before, and how things that you would expect to be quite similar are actually quite quite different even though, you know, they have the same name. I think 1 of the 1 of the things that keep keeps on surprising me is how maybe it shouldn't surprise me because it's human nature, but how data teams prefer to be self sufficient and independent as they implement their projects. And I'll give you an example. We have this attribute based access control feature where you can have use attributes on users, and you can use that in your policies. Let's say that I have an an attribute that says that I'm part of the Israeli office, and then I get access to a certain database on that app. And you would think that, you know, the source of these user attributes would be your identity system where all of the, you know, system of record about your users, employees, and all that stuff is being managed.
However, we hear from data teams that not necessarily they have access or cooperation from the teams managing those identity systems to go and implement all the things that they need. And sometimes they get pushback. Like, we don't want to put these attributes in our system because, you know, it's not aligned with how we see the world, and it's not it's not an identity concern. You go and figure this out. And so, in some cases, they come to us and ask, hey. Can you help us, like, manage these attributes on the story side? Because we don't have anywhere else to go.
We're being blocked. So that keeps on surprising me. When we started out, I thought mistakenly that the world is going to be very organized, like identity in identity systems, metadata in metadata systems. You know, but, I think it's it's it's a bit messier. And, obviously, there's the human and organizational aspects that that come into play that require
[00:44:32] Unknown:
capabilities that I I didn't think we had to build, but we're building. So that's good. Yeah. And then you run into the situation where the identity and the metadata system is all just an Excel file somewhere.
[00:44:44] Unknown:
I read into that last week. Exactly. Exactly that.
[00:44:49] Unknown:
And from your perspective as somebody who is building a data security platform, I'm wondering what you see as some of the areas of improvement that the industry needs to focus on and some of the ways that we can help to unify these concerns and make it less of a point to point to point to point solution?
[00:45:09] Unknown:
Yeah. So we've we've made a lot of progress as an industry, as an industry, and I think that things like managing permissions to data, with the help of data security platforms, obviously, and things like dynamic masking are largely are largely being handled in a pretty good way for a pretty good amount of the data platforms and use cases out there. 1 of the things that we hear from more forward thinking organizations is encryption of data at risk, but not at the infrastructure level, meaning not at the disk level, but at the application or, you know, sometimes they call it client side encryption. What it basically means is that let's say that I use PIC any, you know, SaaS based, data platform vendor today.
They offer encryption. But if I don't want to trust them with access to my sensitive data, then I would have to go and encrypt it on my side before I load it into their system to prevent them from somehow, you know, getting breached or accidentally or otherwise have the option of accessing my data. It's a pretty advanced use case. And while the technology around encryption is super advanced, the technology around integrating encryption into the data layer is still very much in its infancy from a it's still very much complex. It slows things down and that's something that I think would be a focus area. Well, not now, but maybe in a few years. But we do hear from, you know, the larger financial institutions and more forward looking companies that that's something that they would like to do.
[00:47:00] Unknown:
Are there any other aspects of the overall practice of data security and data security platforms and some of the ways that they factor into productivity of data teams and organizations that we didn't discuss yet that you would like to cover before we close out the show?
[00:47:16] Unknown:
I I think the, you know, the main takeaway for me, for our audience, is to think about with all these different options on how to secure data is to think about this concept of early binding and late binding. And the fact that if you secure the point of access to the data, that's the most effective and flexible way of solving the problem. And looks like this is the playbook that's being, you know, commonly developed, more in the industry. There are some really good Gartner write ups about data security platforms. I encourage you to to check out and educate yourself about, this new component that is materializing itself into the modern data stack. And, you know, there are solutions out there and and common problems, and common solutions to do your research. And, I think the space today is much more advanced than it was a few years ago, and you don't have to solve everything by yourself.
Not everything has to be solved in SQL in some view or a database. There are better tools today.
[00:48:27] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:43] Unknown:
So, the biggest gap is I think integrating all of these things together. You see a lot of vendors in the data space suggesting and, promoting different different solutions to the same problems. There's there's no standardization. I I like to I like to talk about the application space where it's very, as I said, a lot there's not much overlap, but then you do have some areas where you have overlap, like single sign on with or or other types of protocols. That's not something we see on the data space so much. Like, I haven't seen a unified way of managing access. I haven't seen a unified security protocols.
I think that's a really big gap in in data management when, you know, when you talk about data security. And I think the industry's leaders should really step up and and work together on building these building these protocols and not just compete on price performance and on features and and help data teams improve the the the operability of the different systems that they're using. Because at the end of the day, and I think most vendors understand this, there's no 1 size fits all. And then, you know, companies will always want to select best of breed. But then they want all these things to be reasonable reasonably integrated and working well together so, you know, they don't suffer from productivity losses or security issues or or other negative consequences of their choices.
[00:50:24] Unknown:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing at Satori and your perspective on the overall space of data security and some of the ways that it can be aligned with and not at odds with productivity. So appreciate all of the time and energy that you're putting into,
[00:50:44] Unknown:
making that a tractable problem, and I hope you enjoy the rest of your day. Thank you so much. Thanks for having me, and you too. Have a great rest of your day.
[00:50:57] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Yoav Cohen's Background in Data Security
Understanding Data Security Concerns
Evolution of Data Security Challenges
Regulatory Environment and Data Security
Balancing Security and Productivity
Data Security Platforms and Late Binding Controls
Unifying Data and Application Security
Productivity vs. Security in Data Management
Educating Teams on Data Security
Incorporating Security in Early Design
Shifts in Data Security Practices
Impact of Metadata Systems on Security
Innovative Approaches to Data Security
Future of Data Security Platforms
Key Takeaways and Closing Thoughts