Summary
One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Manav Mital about the challenges involved in securing your data and the work that he is doing at Cyral to help address those problems.
Interview
- Introduction
- How did you get involved in the area of data management?
- What is Cyral and what motivated you to build a business focused on addressing data security in the cloud?
- Can you start by giving an overview of some of the common security issues that occur when working with data?
- What new security challenges are introduced by building data platforms in public cloud environments?
- What are the organizational roles that are typically responsible for managing security and access control to data sources and repositories?
- What are the tensions, technical or organizational, that lead to a problematic or incomplete security posture?
- What are the differences in security requirements and implementation complexity between software applications and data systems?
- What are the data systems that Cyral integrates with?
- How did you determine what platforms to prioritize?
- How does Cyral integrate into the toolchains used to deploy, maintain, and upgrade an organization’s data infrastructure?
- How does the Cyral platform address security and access control of data across an organization’s infrastructure?
- How are schema changes handled when using Cyral to enforce access control to PII or other attributes?
- How does Cyral help with reducing sprawl of data across unmonitored systems?
- What are some of the most interesting, unexpected, or challenging lessons that you learned while building Cyral?
- When is Cyral the wrong choice?
- What do you have planned for the future of the Cyral platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta, that's imuta, and get a 14 day free trial. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.
With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Manav Mittal about the challenges involved in securing your data into the work that he is doing at SIRAL to help address those problems. So, Manav, can you start by introducing yourself?
[00:01:54] Unknown:
Hi. I'm Manav Mittal, the founder CEO of Cyrel. Cyrel is the first of its kind Cloud native solution that makes it easy to observe, control, and protect the Data Cloud. SIRAL intercepts all activity across all your data repositories, applies granular access controls, and stops anomalous behavior, all with 0 impact on performance.
[00:02:21] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:24] Unknown:
Oh, yeah. Of course. So there was an early big data startup back in the day, which I joined in 2008 shortly after graduating from school. The name of the company was Aster Data. It was eventually acquired by Teradata, and that's where I met my cofounder and CTO of SIRAL, Srini. So that's where both of us got our start in data management, databases, data warehousing. And then after Aster was acquired, we followed different paths, but we landed together again here at SIRAL, protecting data management systems this time.
[00:03:01] Unknown:
And so you mentioned that at SIRREL, you're working to help with managing security and observability of data resources in cloud environments. I'm wondering if you can give a bit more of description about what you're building there and what motivated you to build the business focused on those aspects of data security in the cloud specifically.
[00:03:20] Unknown:
Sure thing. So what we've seen in the last few years that prompted us to build SIRL, Tobias, was the emergence of this data Cloud. You have all the operational data, all the business intelligence data, the system of record type data that companies used to host in their databases, their data pipelines, their data warehouses, we are seeing all of this data move to the Cloud where it now lives in services like Snowflake, BigQuery, Redshift, S3, and gets analyzed using tools like Looker, like Tableau, and gets processed using tools like Fivetran, Databricks, Kafka, etcetera. And these are all now different third party SaaS services where the crown jewels of every organization live.
This adoption of Data Cloud has made things very simple, very manageable, very agile for engineering teams, data engineering teams, DevOps teams. Businesses benefit from it in lots of different ways. However, it has really complicated things for security teams. We saw an opportunity to build a tool that really made it easy and simple for security teams to guarantee better security and better overall manageability over all this data splattered everywhere in the Data Cloud. That's why we built SIRL. It's an interesting anecdote actually. The name SIRL is derived from a word in my native tongue, SIRL, which means simple.
That was our North Star, that the Cloud is making everybody's life simple, and we should leverage the same trends and the same constructs to make it very simple for security teams to collaborate with their peers and secure their data.
[00:05:12] Unknown:
And in terms of the data resources specifically, what is it about cloud environments that introduces so much complexity in terms of being able to maintain the security of those resources? And what are some of the common issues that arise when trying to work with databases or object stores or data lakes or data warehouses in Cloud environments?
[00:05:36] Unknown:
So there's 2 issues which are at the core of it. 1 has to do with accessibility, and the other 1 has to do with tooling. Let me explain both. If you're a traditional large enterprise, think of a large bank, they would have thousands of people in their IT and technology department, and very few, a very small number of them would actually know where the database is. Now fast forward to this new world where all the assets, all the applications, everything is moving to the Cloud, you just have to know the name of the bank to know where their database is in Snowflake, where is it in BigQuery, etc. These data repositories have become a lot more accessible and that was by design.
That was to enable data democratization, that was to enable agility for application development. But what that does is it takes away these various different layers of defense that had been implemented in front of these databases and data warehouses that are now irrelevant. That increases the propensity for which this data can be breached and this data can be stolen. That's 1 to bias. The second 1, like I said, has to do with tooling. Companies are moving to the Cloud. They're adopting their DevOps team, development team. They're adopting infrastructure as code, models for development and deployment. The goal of all this is to iterate very quickly on the development, how they want to set up their infrastructure stack, how they want the different services to talk to each other.
Historically, security teams used to come after the fact that once the engineering and IT team set up a deployment in place, they would come in and they would put all the security controls and policies in place to make sure it cannot be compromised. But now the cycle time of this deployment is several orders of magnitude faster, and that makes it really, really hard for the security teams to keep pace with the engineering and IT teams working in the Cloud, and that's the other piece of this problem.
[00:07:40] Unknown:
And in terms of the breakdown of responsibilities for security of the resources and provisioning of them and the data engineers who are working on populating information into and out of these systems, how does SIRREL help with unifying those different roles and responsibilities to align them along being able to ensure that the databases and data warehouses are operational and easy to deploy and maintain and interact with while also still being able to maintain the necessary security elements of things like encryption and TLS and auditability?
[00:08:19] Unknown:
Yeah. So that's a very interesting question, Tobias. What we saw was a very latent need for truly enabling security to be a shared responsibility between the development, the DevOps, the IT, and the security teams. What they lacked or still lacked actually are the right tools and the right processes around which they can collaborate. That has been a general area of investment and innovation for a data Cloud, where any application, the same approach but to the data Cloud. Any application, any service, any user that wants to talk to a database or a data warehouse or a data pipeline, we can help secure that interaction and that communication. The way we enable that is we are championing this methodology called Security as Code.
The idea here is just like you're using Infrastructure as Code based workflows for an application and infrastructure deployment, you can use a security as code model for deploying all your security tools, all your security policies, all your governance constraints into the same workflow. As the engineering, Dev, and IT teams release new applications or upgrade their infrastructure or change their deployments, the security controls stay in place in lockstep with all those changes. That is what really enables companies to adopt our solution very quickly and roll it out very quickly.
[00:09:53] Unknown:
In terms of organizations that already have deployed resources, they've got databases running either on dedicated virtual machines, or they're using something like RDS or Google Cloud SQL, or they've already got a large amount of data in various s 3 buckets. How do they go about using something like SIRL to be able to identify those resources and the information that they contain and be able to apply appropriate security policies to those existing resources?
[00:10:24] Unknown:
A lot of these companies that we are working with, Tobias, we're actually catching them at the right point. Right? Like, 3 years ago, if I came on this podcast and threw around the word data Cloud, very few people would actually understand it. Right? 3 years ago, if I went outside of a few select pockets and through the term Snowflake, they would not understand that I'm referring to a Cloud based data warehouse. Just in the last 1 year, all these models, all these tools, all these solutions have literally caught on fire. The larger enterprises that we work with, they're all, at this point, either just beginning to move to the Data Cloud or just planning to move to the Data Cloud in a pretty big way, and that is where we start working with them. So for all the legacy databases and data warehouses that they have, which are on prem or that have been deployed and secured using their existing investment. They're happy with that. They work with us to deploy this next generation, this new architecture of these data engineering components that they're deploying.
Did I answer your question, Tobias?
[00:11:30] Unknown:
Yeah. I think the main thing I was driving at is just whether the sort of complexity arises from existing resources and being able to discover them. And if the primary issue that you're seeing is around sprawl of data and how it's being propagated across different systems, then you don't necessarily have the visibility of that. Or if it's for people who are deploying resources, and they want to ensure that there is an appropriate amount of security applied at the time of creation.
[00:11:59] Unknown:
Oh, I see. Yeah. So the way to think about it is there's a lot of sprawl like use the word at the data layer where companies would be using different kinds of databases, different kinds of data warehouses, different teams inside the same organization, the same company would be using different stacks for accessing, managing, manipulating, analyzing the data. What we have taken in approach is that we're going to embrace all these tools and all these different repositories and work uniformly across them. It sounds very audacious and very grandiose, but the key insight here is because we are focused on these data repositories, there's like a small set of grammars. Right?
Very small set of grammars that we have to be effective for, which gives us a very wide coverage across all the different components in a typical customer's data stack. Now with SIRL, when they think of security policies, when they think of access control policies, they can almost forget about whether behind the scenes it's a MongoDB or whether it's Snowflake or whether it's S3. They just think in terms of information types. Like, for example, if it's an ecommerce company, they would say that, look, the really valuable information for me is credit card data associated with my customers or that name and direct shipping address. Regardless of where it is, I want only these people inside the organization to be able to access it, and only under these circumstances should an application be able to query them. And Sirel takes care of that diversity, that heterogeneity in a very uniform way.
[00:13:33] Unknown:
And as far as the types of security issues that you're seeing, what do you see as being the most problematic natively deployed there in the first place? And some natively deployed there in the 1st place. And some of the common points of confusion or lack of understanding as to the impact of what they're building and the particular settings that they need to take advantage of and maintain observability for?
[00:14:05] Unknown:
Really big issue that they see is, you know, when they move to the Cloud, oftentimes, the transition to bias will be driven by agility. Right? That they want to really invest in a very modern microservices based application. It's going to be all deployed using infrastructure as code, and that's driven by a lot of different business reasons. It could be digital transformation, just because they want to really enable data democratization. Companies said that data is the new oil and the way we are going to really extract value out of it is by putting it on the fingertips of everybody in the organization, whatever the drivers may be. They end up with a very fast changing technology stack where application services, oftentimes databases also, they spin up, they do something that they spin down. A lot of these services become ephemeral.
Across the board, these companies invest a lot in observability. They make sure that they're collecting traces and logs and metrics from everywhere, from all the different systems and components that they have deployed in the Cloud. Recently, if you've seen investment in tools like Datadog, like SignalFX, Splunk, etc, they've really shut up because of this particular reason. However, when it comes to their databases, their data warehouses, their data pipelines, there is no good, easy, simple ways for them to even log that information. For example, in a database that your application is talking to, if you turn on logging, it really syncs the performance of the database, which has a downstream impact on the performance of all your applications, all your infrastructure.
It's getting very basic observability out of your database, out of your pipelines, and all these systems is very hard. It starts from there because if you can't really see what's going on, then protecting it, securing it, managing it becomes that much harder. That's 1 of the big drivers of adoption of SIRL with our customers.
[00:16:07] Unknown:
And so in terms of SIRL itself, can you talk through the way that the platform is architected and what's involved in actually adding it to an existing environment and integrating with data sources like databases or object storage?
[00:16:21] Unknown:
Yes. So at SIRL, like I said earlier, 1 of the north stars, a north star for us is simplicity. Right? We want to build a service which is very ergonomic in nature, very simple to deploy, easy for the teams to adopt. The underlying technology that we have built, which enables, which powers all of this, is what we call stateless interception. It allows us to intercept requests to any SQL database, NoSQL database, data warehouse, data pipeline without impacting performance or scalability. This interception service is something that customers run locally in their own environment.
We call it a sidecar. The sidecar runs locally, it intercepts requests to any data endpoint that they may have on prem in the Cloud as a third party SaaS service, and all application service user tool requests get routed to a sidecar. Sirel provides a SaaS based control plane from which a customer can see all the sidecars deployed in their environment and centrally, then they can manage what policies they want to enforce and who can access what data or get observability metrics routed from the sidecar to the favorite tool of their choice. Because the sidecar is a very simple containerized service, customers are able to deploy it almost however they like. Some customers deploy it as a Kubernetes service, some of them run it as a self managed hosted service in their own Cloud.
Some customers deploy it using Terraform, CloudFormation. You've seen all deployments.
[00:18:03] Unknown:
I know that Cloud native has often been associated with Kubernetes and that that's 1 of the driving factors in that overall space and containerization. What are the challenges for organizations who are haven't already adopted things like Kubernetes in terms of being able to keep up with the rate of change as far as being able to get databases deployed and maintained and keeping them up to date and being able to leverage something like SIRAL if they're just using the sort of previous generation of cloud management tools like Terraform and Ansible or SaltStack or something like that versus fully leveraging the capabilities of container orchestration?
[00:18:47] Unknown:
Kubernetes, container orchestration, they certainly complicate the problem organizations already have with visibility and with security. But even if you take all that away, right, and we think only about accessibility, there's like a customer that we are working with that, you know, has been growing really, really fast. They are very heavily invested in data analytics. They run their data warehouse in Snowflake, and every week, they would have people joining the organization who would get access to Snowflake. Right? And then because, you know, a lot of what this company does is analysis of their existing consumers' data and figure out how to best engage them and what kind of, you know, offers to present in front of them. You know, all sorts of data ends up in Snowflake, and most of the organization has access to the data.
And at some point, it ended up becoming a big concern for the CTO that, look, I'm reasonably sure that we have good hygiene in terms of, you know, what data will end up in Snowflake and what type of, you know, applications and services are accessing, what kind of are transmitting, what kind of information to Snowflake. However, he still wanted assurance and visibility and some guarantees that that data will not be misused and PII information will not be visible to any user by accident. This has almost nothing to do now with Kubernetes. This has just to do with general rate of change where there's a Cloud based data warehouse, namely Snowflake in this case, and a very fast number of people with a RapidClip are getting access to it and their analytics workload is changing with a fast rate. Even in that scenario, SIRL can be extremely useful where all requests are monitored by SIRL and it can help organizations like these make sure that data is never inappropriately accessed even if by accident.
[00:20:44] Unknown:
And that brings up an interesting point too as far as how to effectively prevent things like leakage of PII when you don't necessarily know what is contained in a given data repository, particularly for things like Snowflake or s 3 where anybody can land semi structured or unstructured data into that source. How do you prevent the leakage of PII when you don't necessarily know what the schema is going to be ahead of time or what the sort of rules will be for eliding a given record when it's not clear upfront what exists in the dataset?
[00:21:23] Unknown:
Yeah. No. That's exactly right. Great. And this is exactly why you need a solution that works seamlessly across all data repositories. What will happen in a typical organization, there will be some tribal knowledge, if you will, around what data is sensitive and where is it stored. The challenge is that data will be living in some database. It will then move to s 3. From s 3, it will move somewhere else. Then it will land in Snowflake. Then somebody will read the data, push it somewhere else. And very quickly, people lose control over where the data is going, where it has landed, and it is impossible to express any kind of governance policies at the application layer or in a disaggregated way saying that, look, this will be the governance policy for this database or this will be the governance policy for this tool. You have to think more fundamentally.
You have to think about it in terms of the data itself. That is exactly what SADL enables. What customers will do is they will say, let's put SADL in front of a few data repositories. Now we understand what the PII data is, where is it stored, what the structure is. Once SIRAL starts seeing all the interaction, we can very quickly inspect all the different data flows, help customers map out where the data is, where is it flowing. And very soon, they start getting a lot of confidence in their own assessment of where the data is flowing across the organization, and then they can start putting in policies.
Look, it's okay if a developer comes in and spins up a new database in AWS and has an application talking to the database and pulling in data from somewhere else to populate the database. However, for this dev test type thing, there is under no circumstances you should be reading PIA information about our customers. And that is where SIRREL is very valuable.
[00:23:07] Unknown:
The other thing is for the case where you do know what the schema is ahead of time and you map out the fields that need to be elided for particular roles, how do you address things like schema updates and being able to identify when those underlying columns or table structures change so that the rules that you have defined upfront are no longer going to be valid and need to be updated? And how do you raise that awareness for people who are managing that infrastructure?
[00:23:37] Unknown:
Yeah. So there's 2 ways that we help with this. 1 is, again, we enable our customers to think in terms of data types. Customer says that, look, the information that I really care about could be somebody's age, it could be somebody's phone number, and it could live in this column, or it could live in this field, or it could live in this bucket. That schema, of course, you're right, it could be changing all the time. First of all, the policies are defined on these information types. Then for the enforcement of these policies, because it's a security as code approach, we stay in line with all requests going to a data endpoint as the applications and the underlying schema and all that information is evolving.
To update the schema, for example, the way customers use us, for example, is if they see a service or a human being or a DBA update a schema or create a new table, they want to be flagged about it. Then they start tracking that new schema or that new column or that new table that was created, and then you can start inspecting the data which is showing up over there. If the data is coming in from what used to be some sensitive data that you had tagged, then all of a sudden you know that you need to automatically extend your policies and your visibility and your constraints to that new data location as well. And that's how Sirel helps customers to constantly stay on top of their evolving data models and these schemas and tables and whatnot.
[00:25:12] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt. And going back to the concept of sprawl where before I was talking about the preexistence of various different sources of data and locations that they might be residing, The other aspect of sprawl is where you have a well known place where most of your data is being located, and then you're building various reports or data extracts, and those get sent to different people and then copied into things like Excel. How does SIRL help with identifying or mitigating or removing the need for that type of sprawl where the different elements of data are being copied into multiple different locations that you don't necessarily know how they're being maintained or if they're being kept up to compliance with any sort of regulatory or security regimes?
[00:26:36] Unknown:
This has to do with basic visibility. I was talking to a CISO recently. Right? And he had a really interesting perspective on this. And it was, like, look, let's just focus on solving even the most simple problems that are the highest bang for the buck and all these super complicated problems we'll get to later. The challenge that we have, that a lot of these security leaders have, that there's no tools even for solving simple problems. Right? And let's take this Excel scenario as an example. Right? Let's say if you have a company which is hosting a bunch of data in their database and they're partnering with other companies and their partner and an employee in another partner organization wants to hook up into the database and pull data out for Excel. Right? Today, it's hard for companies to answer a very basic question that which employee in which partner access for data. This is for way for them to do that, right, in a lot of cases. This is where a Settle can be helpful, where we can sit in front of all this activity and then hook up into their identity provider and give them real time visibility that who is accessing what data, And then they can implement some very simple policies saying that, look, I have a gold partner.
And a gold partner, if there's some analyst, he should be able to read data, but only 1 that is relevant to that partner. They should not be able to read data that belongs to another partner and show me exactly all the data that has been read. Another way where this becomes valuable is now you can see that based on attributes of different users, what type of data have they been reading. For example, if you see that, you know, your support engineers are reading data generally between, you know, morning 8 AM to evening 6 PM. But at midnight, some support engineer decided to read data associated with 1 specific individual even though there was no case or issue triggered against that user for that time period, that quickly becomes a red flag. So just by providing this visibility to buyers, you can a lot of issues and nip them in the bud before you end up on the front page of the newspaper. It looks 100, 000, 000 user records are stolen or 200, 000, 000 credit card records are stolen.
This is what we enable,
[00:28:56] Unknown:
security leaders and organizations to do. To your point too about the data breaches and unauthorized access to large volumes of data, a lot of that has to do with things like s 3 or other object storage platforms where the security controls aren't properly configured. Does SIRL provide any way to gain insight into that where you can say, here are the buckets where I have information stored and then being able to use the cloud APIs to introspect what the security policies are and determine whether the access is too permissive and what other systems are being used to
[00:29:32] Unknown:
read and write data to and from that? Yeah. No. That's exactly right. So we can certainly help with that. In fact, if you think of, you know, like, data governance frameworks. Right? Data governance frameworks typically have different aspects to them. You know, 1 big pillar is discovery. Right? Finding out where all my data is, what are the different data sources that I have, where are they kept. Another 1 is classification, which classifies your data into, you know, 1 or many different kinds of categories. And depending on the category, you can assign severity and policies around who can access how much of the data, under what context, etcetera. Then there's another 1, a very important 1, which is access control.
Right? But now you understand where the data is, what type of data it is. In real time, you have to enforce access control to that data. That is what is unique and special about Suddle. We are able to sit in line to all requests and implement that access control. Because we are seeing all this activity, now we can integrate with other tools that companies may be using for classification or cataloging or discovery, or we can just start from there and help them build that out. And most importantly, whether they build something themselves or they use a third party solution or use us for that, we can make sure that's just kept up to date because we don't require our customers to do offline scans and discovery anymore because all the activity is now in front of them at their fingertips.
[00:30:59] Unknown:
We can make sure that catalogs and discovery engines are always kept up to date. Another element of the challenge of being able to build something like this where you are monitoring access and gaining observability into the interaction patterns of these systems is the variance in APIs that exist for different data sources. So I'm wondering how you decided which platforms to prioritize as you were building out SIRAL. And what are some of the systems that you found to be most challenging to work with?
[00:31:29] Unknown:
As you were starting SIRAL, we were building the first of its kind of product. Like, it's when most of the customers that we work with, this is a new spend or a new budget item for them. Right? We're not replacing something that existed before. The big bet that we made to bias is, you know, as companies move to the Cloud, they are going to use these Cloud native, Cloud first, Cloud friendly data repositories like Snowflake, like BigQuery, like MongoDB Atlas, etcetera. For our first cut of going to the market, that's the set of data repositories that we prioritized.
Call it luck, call it foresight, it actually worked out really well for us because most of the customers that we work with, that we discover, 90% of their workload involves these early repositories that we had decided to undertake. And as the company has been growing, we kind of, you know, try to stay very close to our customers, very determined to solve their most pressing needs. And then, of course, we keep on adding coverage for new types of repositories or tools that they want us to be effective for. So that's the strategy that we've taken.
[00:32:42] Unknown:
As far as your interactions with your customers and as you are building out the product and figuring out which directions to take, what are some of the biggest security challenges that organizations are dealing with or the things that are most likely to keep them up at night and lead to breaking compliance regimes?
[00:33:06] Unknown:
Yeah. So there's 2 big areas, 2 big vectors broadly in terms of how we think of designing and prioritizing a solution and in terms of, you know, the needs for our customer that we tackle. 1 is adoption of the data Cloud where, you know, companies a lot of companies for the first time having their, like, really sensitive, mission critical, company critical data in a data repository where they don't have where they cannot put their arms around the infrastructure on the server where the data is stored. Right? And this data is being analyzed using 3rd party SaaS tools like Looker, like Tableau, all other such similar tools, And there is basically no control that these organizations have on the data which is flowing from 1 repository to a tool and back end. Right? So that's an area of concern that a lot of security and technology and CIO leaders have inside organization that we focus on. And the other 1 is around agility and simplicity of deployment. You know, you can solve the grandest challenges if you don't get overwhelmed by them, and that's how we approach this. Look, this data protection, heterogeneity of data, applications spinning up and down, data flowing from 1 place to another. It seems like a really daunting problem.
But by really being maniacally focused on building a product which is simple to use, simple to deploy with very intuitive workflows, that's how we are enabling customers to step up and solve these issues.
[00:34:40] Unknown:
And as far as the challenges that you're facing or some of the most interesting or unexpected lessons that you've learned while building SIRREL, what are the things that stand out most to you?
[00:34:51] Unknown:
Well, if you had asked me last year, since you're asking me now, I would say that, you know, life can throw any kind of a curveball at you, including a global pandemic. So you have to be agile as hell and responsive as hell to adapt to a changing environment of, you know, customers and investors and employees' needs and all that. Right? But I suppose I'm not alone in this. Every organization is dealing with it. But really the most interesting challenge that we had to solve at SIRL was that of education. Like, how do we explain what we are trying to do, the product that we're trying to build, the benefits of this product to a to an audience that has not used a product like this before.
And, you know, that is both daunting and fun because we really have to think from first principles. Look, how do I describe the problem? Is data Cloud the right word? Is there some other way to capture this? Is security as code the right way to explain how we operationalize that or is there a different term? These have been some of the most fun, yet the most challenging aspects of building Sarell so far. But like I said, we got very lucky that, you know, we've had the right team and the right market trends behind us, And we were able to come to a place where the SIRREL positioning and the SIRREL story does actually resonate quickly and easily with most people that we talk to.
[00:36:18] Unknown:
Particularly for security engineers who might be familiar with software applications and the traditional access patterns and life cycles of things like maybe a web app. What have you found to be some of the challenges that they're facing and some of the gaps in knowledge and appreciation for the complexity when they are being tasked with understanding what policies to implement and how to manage the overall accessibility and access patterns of these larger repositories of data that have multiple stakeholders and much different access pattern?
[00:36:53] Unknown:
No. So they are in a bit of a bind. Right? The business is forcing the development, the IT teams to run at the speed of light. Right? And the security teams, you know, they are still trying to figure out how to best work out their operating relationship with these teams, how to best interject themselves into their release management processes, how to best communicate with these teams about what are the right policies to be enforced, where should they be enforced, at what frequency will they be updated, who will own that. These new models, these new Cloud native architectures have thrown a wrench in the agreement that all these teams had over the last many, many decades. And a lot of companies are still trying to figure out that how do we work together, how do we operationalize the security of our our Cloud or overall Cloud platforms in a way that makes sense for everyone so that they can still continue to move fast while making sure that they're just not opening themselves up to some massive breach and something that really erodes that trust of their customers in the business.
So that's basically what we're seeing. And again, I don't think this is unique or specific to any particular vertical or any particular segment. And it's a fairly hot topic inside most organizations and, you know, 1 that we are all collectively trying to figure out from different angles.
[00:38:25] Unknown:
And similarly for data engineers who have, up to now, largely been working in their own silos and maybe not necessarily as directly integrated with security and development teams or the overall operations workflows. What are some of the areas areas where they're being challenged and some of the gaps of knowledge that they might be presented with as they become more integrated into the responsibility of stay
[00:39:00] Unknown:
productive, stay above stay productive, stay above water, and make sure that they can keep up with growth in both the volumes of the data as the complexity of their own jobs. Right? Like, they have to support an increasing amount of business use cases that are imposed upon them. They have to continuously adopt the latest technologies that show up in the data processing world and continuously improve their toolkit and sometimes their own skills, go for different, you know, certifications, etcetera, to make sure things are just running smoothly. And they're all very security conscious, but they will often look to the security team for guidance. Okay. What are the best practices to follow over here? What is the right tooling to follow over here? This is exactly the gap that SIRREL helps cover because ours is a service that can be adopted by these engineering teams, recommended by the security teams, and helps both of them collaborate with each other and make sure that their data stays safe.
[00:40:02] Unknown:
And so for teams who are trying to improve their overall security posture and their capabilities to be able to move fast while keeping things appropriately locked down. What are the cases where SIRL is the wrong choice, and what are some of the alternative options for being able to manage this either via in house tooling or some other types of service?
[00:40:27] Unknown:
So a lot of companies that are kind of, you know, working on the more traditional, very on prem centric workloads. Right? And there's still a very large number, 1 of them. For them, it's not like Sirel is the wrong choice. It's more like Sirel is probably not a pressing need for them. The reason is because, you know, they have a security model that has been working for them for many years, and, you know, they probably have better things to work on. Another scenario is where, you know, the sensitive data in the organization is, you know, 1 that lives on file servers or inside, you know, some of these Cloud storage services like Box and Dropbox, right, which is completely free format, sitting in the form of files used by information workers, that is where Sarell is irrelevant for them as well. But whenever organizations, you know, have data which is highly proprietary or very sensitive, which is very core to their business and something which is a competitive advantage for them, which is setting in these, you know, databases, data warehouses, data pipelines, access by an increasing number of stakeholders in the company, that is where SIRREL is really valuable service for them.
[00:41:51] Unknown:
Another element of what you're building at SIRREL is the question of performance, where you are sitting in line with the data resources where you need to be able to block requests that aren't allowed or elide certain elements of the data sources or obfuscate things like credit cards. I know that proxies can be sometimes a performance issue. So I'm wondering how you have tackled that particular problem and the issues that you've seen in terms of people who may have had bad experiences with that type of system and trying to encourage them to try out SIRAL?
[00:42:28] Unknown:
Yes. So that's a great question. At SIRAL, we came up with this technology called stateless interception, and it allowed us to build what we call a sidecar, a data layer sidecar. And this is something that can sit in front of your data endpoint, but does not come with the traditional inflexibility, the traditional performance penalty, scalability challenges that you have with proxies, and the manageability issues and applicability restrictions that you get with agents. Very unique and interesting thing that our engineering team came up with, which allows us to intercept requests to any of these databases and data warehouses without impacting performance and scalability.
And when we go present our solution to a large enterprise, you know, the security team would obviously be very receptive to something like us, and then we would get introduced to the data engineering teams. What we saw was they were very interested in our design considerations and how we can be in line without having the traditional challenges that people associated with these proxies and how it actually further simplified their lives by giving them real time observability into their database connection tools that they did not have before. That we started getting more and more pull into these accounts as we started working with engineering teams.
What we decided to do, Tobias, to answer the second part of your question, was we just decided to completely embrace this, and we decided to be very open with our overall architecture, our design principle. And if you go to our website, www.sara.com, in the technology section, you can actually recap how we designed the sidecar and what the key insights are that we applied to kind of, you know, do away with this whole notion of a proxy and build something which is very new and very lightweight and very effective. So we kind of have as opposed to other companies that try to hide their high IP stuff, We actually just put it out there on the web for everybody to read, and the reception for that has been fantastic.
[00:44:42] Unknown:
As you continue to build out the platform and try to bring on new customers and support new data systems, what are some of the things that you have planned for the future of Cyrel, both technologically and from the business side?
[00:44:57] Unknown:
For us, what we have seen at Cyrel is like a really greenfield opportunity where companies are increasingly moving to data Cloud. The engineering teams are increasingly adopting infrastructure as code, and we'll continue to invest in making sure we play well with the various different ecosystem tools that these companies use for data management, for data analytics, for deployment, for orchestration, for monitoring, for log collection, etcetera, etcetera. And that's going to be our big focus for a long time to make sure that we play well with these tools and we have we remain the simplest, the easiest to use security product for these security teams. I think as long as we keep doing that, there's like a very large untapped market for us to go after.
[00:45:48] Unknown:
Well, for anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap on the tooling or technology that's available for data management today.
[00:46:03] Unknown:
See, it's, what I have seen is, look, data management is a you know, tooling for data management is a very fluid and very rapidly evolving team space. Like I said, Srini and my cofounder Srini and I, we started working together in the big data space back in 2008. And in the last 12 years, right, number of tools, the number of companies that have been built and that are still growing up and still coming to the fore, it is a very, very large number of choices that data management, data engineering professionals have today versus 15 years ago, right, where it was just Oracle, DB2, SQL Server. Remember, there were a few number of players that completely dominated, data engineers' headspace, and today there's so many alternatives in front of them.
And lot of them, fairly established and a lot of them that are still just bubbling up. So, you know, can't think of any gap. In fact, I think we are at a point where it's looking at 3 different alternatives for doing anything that anybody wants, which actually creates the complexity that Sarend is trying to address.
[00:47:10] Unknown:
Well, thank you again for taking the time today to join me and discuss the work that you've been doing with SIRUL. It's definitely a very interesting project and 1 that is addressing a particular need for data platforms and data teams because security is something that is and always will be challenging. So I appreciate your efforts to simplify that. So thank you again for taking the time, and I hope you enjoy the rest of your day.
[00:47:34] Unknown:
Tobias, thank you so much for having me. It was a real pleasure coming here, talking to you, and I hope you have a great day as well and have a great weekend.
[00:47:46] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Overview of Cyrel and Its Functions
Motivation Behind Building Cyrel
Challenges in Cloud Data Security
Unifying Security and Data Engineering Roles
Adoption of Data Cloud and Security Policies
Cyrel's Platform Architecture
Preventing PII Leakage
Visibility and Data Sprawl
Prioritizing Platforms and Challenges
Security Challenges and Lessons Learned
Performance and Stateless Interception
Future Plans for Cyrel