In this podcast recorded at RSA Conference 2018, Dino Dai Zovi, co-founder and CTO at Capsule8, talks about what continuous security is, and how you should bring more of this mindset to your security operations.
Here’s a transcript of the podcast for your convenience.
Hi, this is Dino Dai Zovi, I am the co-founder and CTO at Capsule8. In this Hel Nnet security podcast, I am going to talk about continuous security, what it is, and how you should bring more of this mindset to your security operations.
A lot of people have been trying to get a good name for a new approach to security and some people call if DevSecOps, some people call it SecDevOps, and what we are trying to do with this is, bring a lot of the transformations that has happened in the infrastructure operations through DevOps or cyber liability engineering, or SRE culture, and bring that to security. Because a lot of what has happened is, organizations have adopted these changes and the security teams are trying to figure out what needs to change, and how they do their work and how they protect these organizations to adapt to that agility. The way that we think about it is, we think a better description of what needs to be done here is, continuous security, and I will tell you a little bit why.
A lot of companies are early adopters of the cloud, or had large cloud-like production environments, developed this capability for continuous delivery, and that allowed them to bring features and products to the market faster by embracing agility in their software engineering processes. In order to enable this, what security teams need to do is, take a lot of these engineering philosophy and apply it to security, and that’s what we call continuous security.
Sontinuous security responds to change, it doesn’t block or add friction, and a lot of our approaches to security have traditionally being by gating deploys, by gating releases of software, doing security reviews, and a lot of these add friction and really makes it harder for the organization to move faster. So, when we are talking about continuous security, it’s a mindset that embraces automation, that embraces reacting to change, rather than finding it. We think it is a lot healthier way for engineering and security to work together for the benefit of the organization.
A good way to measure where your organization is, on the spectrum of continuous delivery is based on how frequent releases are deployed and I take this from a book called “Accelerate” that was recently released. You might be deploying or releasing every 100 days, every 10 days or daily, and some organizations are deploying or releasing codes 10 times a day, or even a 100 times a day, and there is a lot of different changes that the engineering organization needs to go through in order to achieve this.
The book kind of really breaks down what a lot of these changes are, what’s behind them, and we are just now figuring out what some of the changes the security team and the security process need to do in order to adapt to those, and this is under the umbrella of what we call continuous security. So, I want to talk about some few things that will help organizations become more agile and how it improves security. So, to get a mental frame for what needs to be done, I think the best guide for this is what infrastructure teams have done as they transition to DevOps and SRE, and this is where they really embraced automation and responding to change to handle incidence, and a really healthy culture of metrics, data-driven decisions and automated responses to incidence, I think has helped out a lot there, but also really helps in security.
So, this is what they call vent-triggered automation, is really key to being able to scale what happens. In order to kind of bring about this change, I would recommend that people start with, first, monitoring metrics that enumerate and measure risk. And so there is kind of an art to this, of finding metrics that don’t just kind of measure things for measurement sake, but point in the direction of things that you really want to change, and are good proxies for the risk your organization faces.
So, one example I have given for this is, what is the oldest known vulnerability exposed to the Internet in days? This can be a vulnerability in a third-party library or can be a vulnerability in internally-developed code. Now, this is a metric you really want to minimize, and if you think about just measuring and tracking what the older vulnerability is, you can only bring that number down by improving your processes across the entire organization, and that’s why it is one of those simple numbers that really gives you an idea of the health of your application security program, because you can monitor a SaaS environment, a lot of upstream software dependencies and being able to incorporate and fix vulnerability and get that deployed quickly is one of the kind of key skills to a continuous security program.
On the infrastructure security side, another thing that I think is a good metric is how many days since a full, clean wipe of a device? Some people also call this reverse uptime. When I started working with systems, we would brag about how long the uptime of a system was, but as we started thinking about security more, that means that there is a lot of opportunities for an attacker to persist on a device for long uptime. And so when you move towards embracing kind of the ephemerality of systems, you want to actually minimize the uptime of systems and there is this trade off of performance versus reliability, but reducing that uptime, you are giving an attacker less time to actually persist on a host, if they are able to get privileged access. So, measuring your reverse uptime and kind of oldest… what is the longest amount of time that a piece of infrastructure has been running for with a clean install, without a full system wipe? It is going to make sure you are kind of constantly cleansing the environment through this. And one metaphor that I encourage people to think about is, think of this like water. We know that brackish water doesn’t flow, you don’t want to drink it. Clean water is water that flows, and is clear and fresh, that’s what we want out production environments to be like, full of this life-giving churn.
The systems that measure this and kind of identify this outliers, and kind of how to make this processes, really needs to be owned by the security team and one of, I think the most difficult transitions to continuous security is embracing a culture of automation and security. I think because one of the things we have relied on for a long time is a culture of analysis, and that’s important, but as our organization scale to the cloud and it becomes really trivial for someone to just click a box, send it to our email, script to say, no I want 10,000 nodes running this, instead of 10, the size in the environment can really quickly outstrip the ability of your security team to man and respond to what happens in it. Developing a culture of automation is also critical to security being able to have visibility into what they need, and have control over things they need, and also to just to mitigate the rest across of the organization.
In the transition; embracing continuous security, another thing that happens is, the security team will begin by building this monitoring systems and generating alerts for them to handle, and it’s good to make sure the security team handles the alerts first, because as they are the ones building these systems, when you align the… I call it aligning the pain with the agency. You basically say, the people who have the ability to make the changes should feel the pain, because we just tap into that, into the human instinct to minimize pain and they will fix the problems. And so if you write systems that send alerts to someone else, they feel the pain, but they can’t fix the alerts, and so it will not improve nearly as fast as if you make sure if the people who control what get alerted and how noisy those are, are aligned with the people who feel, who have to handle the alerts. That scales to a point, but at some points, you need to triage a lot of those alerts to individual users.
So, for instance, if you have a system that is identifying suspicious activity in productions, saying: “Hey, Dino dropped that database, did you mean to do that?” Initially, I send out these alerts to the security team, but at a certain stage you want to send the alert to Dino, say “Hey Dino, you deleted this database in production, did you really mean to do that?” And if he is like, yes, and then the next question is, prove that it is really you, so you use two factor authentication or some other way to prove that it is really Dino and I really did want to drop that database, and if it is a mistake, I am sure I’ll hear about it from my manager. But, having that first little triage really reduces the number of alerts that go to the security team, and if a user says no, that absolutely wasn’t me that took that action, that can be a very high priority alert and the security team will jump on it pretty quickly.
So, those are some of the ways that you can adopt the mindset of continuous security, but as you get more practical, you are going to need to build systems that scale that detection response in the cloud. The occurrence on the system are first going to do no harm, you cannot negatively affect production or performance. The second requirement is that you need to minimize on-host resource utilization, I call this the security bill that you pay at scale, because an extra 1% overhead on every node, you will pay times the number of nodes in your environment and that can be thousands, hundreds of thousands. So, that is critical to how expensive, in monetary terms your security solution is.
The third requirement is that security analysis needs to happen off of the potentially-compromised host, and I think of this as making sure that the evidence has outrun the attacker. Because we know how attack change work pretty well, because a attack is going to gain a foothold, try and escalate privileges, and if there are monitoring systems on the host that are trying to hide from those monitoring systems, so the sooner that evidence can get off the host to something that has lower attack surface, the more difficult it is going to be for an attacker to cover their tracks.
The ourth requirement is that you most have a low false-positive rate, and if you look at the false-positive rates that are considered acceptable for an academic machine learning result, those are noisy in many environments. That can be a 20% false-positive rate or 10% false-positive rate. Once you get to a reasonable scale, those are deadly. Building human teams to triage those, but also the systems to deal with that flood of alerts that can happen in form of false-positives. So, this is what we really need to actively work against.
The fifth requirement is that responses must be automated and low latency. This is the only way to make sure that responses scale along with the environment. As the environment scales up, responses need to scale up as well and not all of them should require a human in the loop as a kind of a time-bounding factor.
And there is a variety of open source systems that could help. The ones that I recommend and are a system called the elastalert, that was built at Yelp, StreamAlert, built at Airbnb, and Capsule8, my own company. We released an open source sensor on GitHub, that is a useful piece in building a system like this for yourself, that gathers low level system telemetry from your running servers and allows you to analyze them in whatever means you want.
And I want to talk a little bit about sort of why we decided to write our own sensor, rather than use something like Linux Audit, which is freely available and built into a lot of systems, and this is actually what a lot of other systems use under the hood. One of the things we’ve found and we are looking at Audit, is that it is built for trusted operating systems not production environments, so as the system gets into a scale, it’s performance goes parabolic and then it will hard block your workload. So, this is pretty unacceptable in most production environments and it’s fundamentally built into Audit and how a lot of tools use it, so we decided that we wanted something a lot more lightweight, and wanted something that could… did not require loading a Kernel module, downloading Kernel code signing, and still gave us low-level systems telemetry, like in Kernel function execution. Aand we needed something that supported all the way back to Red Hat 6 and Kernel 2.6.32 and later.
We think we’ve built a good solution here, and we think that might be useful to other people, that’s why we released it openly and this is the basis of our full commercial product. If you’d like to learn more about our open source sensor, you can check it out on GitHub, or if you want to learn more about our full commercial offerings, go to our website at www.capsule8.com.