How to Achieve Application & Cloud Security Resilience

Published in

better appsec

14 min readFeb 16, 2023

A guide to defining and maturing a truly resilient Application and Cloud Security program through automation and data.

Written by

James Chiappetta and with contributions from Dor Zusman

Disclaimers

The opinions stated here are the authors’ own, not necessarily those of past, current, or future employers.
The mentioning of books, tools, websites, and/or products/services are not formal endorsements and the authors have not taken any incentives to mention them.
Recommendations are based on past experiences and not a reflection of future ones.

Background

Have you ever wondered what it takes to find and fix ALL vulnerabilities and misconfigurations on an ongoing basis with the highest level of accuracy? If you have read our post on developer driven remediation then you will know about the cornucopia of detective security tools, how to measure their impact, and the importance of getting developers to drive their own remediation. But, what are the subsequent steps to deliver the most accurate and meaningful results? We will tackle this highly nuanced question and hopefully help shed light on the path to building a highly resilient security program that should stand the test of time.

First Things First

After quite a bit of research on the topic of "vulnerability triage", something became very clear: there was a lack of clear documentation around how to effectively use the output from security tools sprinkled across a software development lifecycle (SDLC) in a way that benefits the broader array of security programs. Having seen the benefits of what this looks like in practice (less noise, centralization, better visibility, etc.), one of the best ways to summarize the outcome is: continuous security program improvement that drives resilience and trust.

Operationalizing automated security scanning tools that are supported with strong processes is what will separate checkbox driven security and one focused on continuous improvement. We will take you through how to piece this together by covering the following in this post:

Automated Security Scanning Tools: A quick recap on some of the most important detective security tools and where they fit across the software development lifecycle
Dialing in Tool Scanning: Where to perform comprehensive scans vs targeted ones.
Manual Security Processes: Building a high quality detection set (hint: continuous assessment & improvement).
Unlocking the power of centralization: The art of security issues root cause analysis, deduplication, and attribution.
Quantifying application security program resiliency: What are some measures of success?

Let’s get to it!

Automated Security Scanning Tools

There are a fairly wide array of detective security tools out there and at the disposal of security engineers. Here are the ones we know well:

Static Application Security Testing (SAST) - code scanning for security issues.
Dynamic Application Security Testing (DAST) - web application fuzzing.
Software Composition Analysis (SCA) - software dependency security scans.
Secrets Scanning - code scanning for secrets (passwords, keys, tokens, etc)
CI/CD Pipeline Scanning - newer scanning tech that looks at the health/hygiene of code pipelines
Web Application Firewalls (WAF) - layer 7 (Application layer of OSI model) firewall
Infrastructure as Code (IaC) Scanning - scans your Terraform, CloudFormation, etc. for misconfigurations or issues
Network & System Vulnerability Scanning - uses network discovery techniques or targeted lists of hosts to scan for security issues
Docker Container Image Scanning - Scans the built artifacts, called images, that are used for containerized workloads
Cloud Workload Scanning - an agent or daemon set to introspect cloud based compute
Cloud Security Posture Management Scanning (CSPM) - reads the actual configurations of a cloud environment to enumerate issues
Central Vulnerability Management (CVM) - a security misconfiguration & vulnerability aggregator to avoid having to report from n tools

There are more, but let’s stick with these for now. Next, we will see how these fit into the bigger picture of the SDLC. What is important is that they all detect or handle security misconfigurations and vulnerabilities.

Note: We have covered these tools in more detail here.

Security Tools Across the SDLC

There are many security product companies in the market. Each is exceptional at something that differentiates them versus the competition. You will need to perform your due diligence and decide which is best for your organization's objectives and budget. That said, how does that big list of tools fit into the bigger picture?

If this picture of tools seems like a lot to manage, it is. There is no magic formula on how many security engineers you need for each tool and they will all need maintenance. You will likely need to prioritize based on where you will get the best bang for your buck.

Code Analysis is always a cost effective way to find security issues quickly and early, but usually requires manual effort in order to make sure coverage, accuracy, and scan performance are dialed in to whatever requirements you may have. For instance, you may want at least 80% on language coverage, 70% true positives (actual findings), and sub 15 minute scan times. Getting there will require work. We have always set the bar higher than these metrics, but if you are just getting started, then the example may be a good starting point.

Pro Tips

You don’t always have to buy a tool. You can implement an open source tool, and for those who are feeling adventurous, implement your own (not often advisable but sometimes necessary).
It’s important that you understand the broader elements of an Application Security Program. Take a look at our post about this in the CISO’s guide to a modern AppSec Program. If you are already familiar, cool beans, but for those who aren’t, feel free to take a read.

Comprehensive Security Scans vs Targeted Scans

It's common to see security engineers find tools that lead to the highest amount of coverage based on endpoints, packages or/and vulnerability classes. This strategy is one that will acquire coverage in favor of a lower accuracy. Everyone's budget and capabilities to manage many security tools will differ. There are three strategies to tackle this.

Comprehensive scans in passive mode

Everything gets scanned with every detection, early and often, but in a passive non-blocking mode. Think of this as the kitchen sink mode. The results are typically harder to use as there may be more false positives than a targeted scan. The local development environment is usually best to perform such scans, but you can perform these as a scheduled scan out of band from production pipelines as well. Last point, you don’t want these scans to become useless and unused. Therefore, you will want to broadly disable detections that simply would never apply in your environment, to increase accuracy.

Targeted scans in validation/blocking mode

Everything gets scanned with only a set of highly vetted detections for each tool. Those detections should map to a prioritized set of issues that everyone agrees should never materialize. This means detections are at least 95% accurate (more on this next). These types of scans should contain detections used ubiquitously across all environments. Therefore preprod/development pipelines and production are looking for the same issues. Passive and non blocking scans should bubble these issues to the top so there are no surprises pushing code to production.

A common example for targeted scans is to detect Cross Site Scripting (XSS) issues and prevent any production code pipeline from advancing if one is detected for an external web application. Another would be to scan your infrastructure as code for something like having a Security Group permit full network access, 0.0.0.0.

A balanced scanning approach

While both methods might work differently for each individual organization, it is common wisdom to do a mix of both, as the situation dictates. Choosing an extreme of both sides can manifest in issues in later stages:

Blocking too often can result in a negative impact on the developer experience.
Scanning too passively can result in only discovering security issues after the fact, and not preventing them.

Using risk to guide your decisions is usually a good way to go. For example, perform very targeted scans to block critical issues making it to production business critical infrastructure, but allow for more flexibility in a lower privileged development environment.

Pro Tips

Every security team starts somewhere. Focusing on the list of security tools in this post, we know many will start with 4 key tools: SAST, secrets scanning, SCA and CSPM. These give security engineers visibility into code as it changes and the cloud environment as it changes.
XSS is a very common issue on web apps and is usually indicative of other issues.

Creating a Central Vulnerability Process

Identifying security issues is always the first leg of the journey. The subsequent steps are typically:

Security issue analysis & reporting (looking at what issues are being found)
Remediation (people fixing issues)

Doing this effectively means you need to move the results out of the tools into a central location, perform root cause analysis, deduplicate issues, attribute them to owners, and then have someone (security engineer or developer) validate from there.

Centralization

There have been many attempts to solve this by security product companies but they are usually either focused on application security issues OR cloud and infrastructure issues. Almost never the combination of both. Why? Glad you asked. There are many underlying challenges with combining security issues from various tooling. For instance, SAST tools produce issues for code repositories, code files, and on lines of code. Conversely, DAST tools produce issues for web endpoints (urls) and HTTP parameters. These things are difficult to put together in a single place. However, it's not impossible and you just need to get the data in a single place to get started.

Root Cause Analysis (RCA)

Root cause analysis is the process where one starts to decipher the issues from the detective tools to reach a meaningful datapoint to act on. We all have different ways of doing this and it could vary based on what the issue or tool is. Here is a basic flow diagram of a typical RCA process:

Note: It is expected that you will be searching the internet or internal documents for information, trying to understand exactly what is the issue, in what part of the infrastructure it lays, manually sifting through log records, and looking for the developer or team that you need to talk to.

Deduplication

When you have more than one tool in the mix, it’s likely you will have duplicative results from them. Duplication usually manifests in 2 ways:

Duplication from multiple scanning tools. This means the same issue was detected on 2 or more tools because of scanning overlap (i.e. 2 code scanners scanning the same repo).
Duplication from multiple scans from the same or similar tools but at different points in the development process. This means the same issue is detected because you are scanning the same fundamental code on different stages of its lifecycle (i.e. a container scanner while scanning the artifact and a code scanner while scanning the repo are identifying the same CVE).

Attribution

This is where you tie an identified issue back to a person and/or system that introduced it. For instance, a developer runs a “terraform apply” with a wide open security group and your CSPM detects that infrastructure once built. Not all issues have a single root cause and you will want to account for this in your process.

Validation

In short, validation is the process of confirming if a security issue is true or not. It also helps sort out priority and remediation effort. More on this next as this is the thing many security teams don't do well.

Pro Tip

Some SCA and SAST tools can automatically create pull and merge requests (PRs and MRs) when obvious dependency issues arise. This can help expedite the change management and remediation process in a big way. Always test your changes before you merge to production to ensure dependency updates don’t cause any adverse effects.

Operationalization of the Vulnerability Data

Typically when we talk about manual security processes, we are talking about Application Pentesting, Incident Detection & Response, Threat Hunting, Bug Bounty, and Manual Code Review. Those should not be forgotten, but, we would like to crucially add:

Vulnerability Triage

This is a process where you have security engineers or knowledgeable developers triage specific and agreed upon vulnerabilities that come out of tooling. There are three phases of this process.

Phase 1: Vulnerability class prioritization. Assuming you have an existing code base, your tools, such as SAST, will probably uncover a smorgasbord of vulnerabilities. You will need to review the tool output and ask the all important question of: "what issues actually are impactful?" and "what’s the complexity of a fix or exploit?". This is usually sorted out with 3 levers, so to speak. The first is business risk, for instance, if you have a public facing website then maybe you don’t want Cross Site Scripting (XSS) ever to be present on it. The second, is ease of discovery and/or exploitation, so how easy is it for an attacker to find and take advantage of. The third is the level of effort to remedy, which is the amount of time/work to address the issue in development. This is not a prescribed way to prioritize but one to consider. You may also want to consider the quantity of a specific vulnerability, but generally, we have used this as a second ordering function. Last note, you want to agree across the board on what issues to focus on so you can begin backlog pruning.

Phase 2: Backlog pruning and correlation. After you pick your security issue to focus on then it’s time to go through the backlog of issues. This is vital for phase 3. A clean backlog allows you to establish confidence that the tool is finding the issue accurately and help to determine what the best remediation plan will be. Usually, a security engineer or knowledgeable developer can clear out the backlog. Additionally, if there are multiple separate issues that are related to one another, they should be reported and fixed at the same time. During this process, an engineer might determine an issue is not trivial to resolve but can push a workaround for temporary remediation, such as a firewall rule.

Phase 3: Ongoing vulnerability monitoring. This is the phase where your engineers can feel confident that when an issue arises that it is worth responding and fixing. You can also leverage the decision you made about a vulnerability class to build high fidelity alerts with your SOC. In this phase, you will seek to continuously improve runbooks, remediation plans, and track vital metrics (more on this later) such as time to remediate.

Pro Tips

Build run books for how best to validate and handle these vulnerabilities as this will be key for consistency, as well as scale.
Open tickets with any commercial tool vendors when you find false positives or false negatives. This will help improve the tools for the broader community that uses them.
Leverage a ticketing system to keep track of the issues from the tool. This way you can track and measure the process far more efficiently.

Developer Driven Security Remediation

After you have picked a vulnerability class or two and have completed all of the phases of Vulnerability Triage then it's time to federate the work of fixing the issues. With hopefully just True Positives validated, you can start the process of having developers own the remediation of vulnerabilities on an ongoing basis. Over time, as you expand the vulnerability classes, the quality of their code and risk to the business should improve. This will also help new developers who aren't familiar with security vulnerabilities.

Having developers own their security remediations will allow security engineers to focus on improving the tooling, standards for remediation, education, and step into more of a partnership with developers. This process can also help identify security champions and create clarity around what developers are responsible for from a security perspective.

Executive Health & Hygiene Reporting

We don't live in a perfect world and there will be instances where risk acceptance or management will need to take place. This is expected. You should funnel top level metrics and build a risk acceptance process for senior leadership to keep a pulse on their development teams. This will drive accountability at all levels.

Scan Tool & Manual Process Performance Reporting

Once you have built out one, some, or all of these tools and processes, then you have officially put security in the critical path of product and software delivery. This means you need to make sure your security controls are healthy and highly performant. If your security scans are failing and slow, someone should be alerted and work to resolve any issue. We recommend working with your Site Reliability Engineers on establishing the right SLAs, SLOs, and contingency plans.

Putting the processes together

The confluence of these manual processes, plus the ones mentioned earlier (e.g. pentest or bug bounty), should uncover gaps in your automated detections. As your automated tooling improves, your manual processes should improve along with it. It's important for your security program to report and monitor the quality, performance, and health of all automated tools and manual processes.

Building a Resilient Application & Cloud Security Program

Quantifying Security Program Resiliency

Measurements are vital to maturing practically anything, security tools and processes included.

Here are a few metrics to get you started:

Issues identified and fixed by early threat modeling or scanning a pre-production release. The inverse also applies, so issues released to production that were identified in production via an incident or through bug bounty that were identified before a release to production.
Vulnerability classes (hard coded secrets, XSS, public s3 buckets, etc.) detected by tool or tools and in aggregate
Security issues fixed, risk/severity reduced, or risk accepted on an ongoing basis by tools and in aggregate
Tool accuracy based in True Positives, False Positives, and False Negatives
Tool engagement/usage over time. (This will help answer: Are developers using the tool locally before they ship code/product?)
Accuracy by vulnerability class
Total alerts/issues found by tools/automation vs manual process (also track what manual process uncovered the issue)
High or critical issues based on vulnerability/misconfiguration class fixed before a production release
Security issue classes where net severity adjustments were made
Issues deduplicated by tool or issues class

Note: These all should be tracked over time.

Takeaways

There are A LOT of automated security tools on the market. Make sure you perform your due diligence before implementing any. If you are just deploying a tool to say you have it, then you should strongly consider the implications of what that means in the long run.
In order to achieve the highest level of trust, you will need to find the right places to leverage comprehensive scans vs targeted ones. If scans take too long and don’t render useful results, then what’s the point? Use attributes about the organization you are trying to secure to help decide what to look for and prevent from making it to production.
The data from security tools can be cumbersome to manage and report if managed through each tool directly. Developers and security practitioners will waste time hopping from tool to tool. Centralizing vulnerabilities, deduplicating, and attributing them is a key way for reducing this pain while driving efficiency back into the tools.
If you can get the majority of your organization, namely product, development, and security, to agree on which security issues are "must fix", then you can put the wheels in motion to make sure those issues get identified, reported, and fixed on an ongoing basis. Granted, this will take a lot of effort from your security team upfront but will yield immense dividends in the long run.
Your manual security processes will help you bolster a high quality detection set. If there was a security incident or a bug report, use those as an opportunity to make sure whatever was found can be found by your tools in the future.
It's easier than you would think to start quantifying application and cloud security program resiliency. There are basic measurements like time to remediate or vulnerability class accuracy that will start painting a clear picture.

Words of Wisdom

As the sayings go: perfect is the enemy of good and the first step to perfection is starting. Those who build simple but effective mechanisms for continuous improvement, ultimately endure and often can achieve greatness.

It is impossible to improve any process until it is standardized. If the process is shifting from here to there, then any improvement will just be one more variation that is occasionally used and mostly ignored. One must standardize, and thus stabilize the process, before continuous improvement can be made. - Masaaki Imai

Contributions and Thanks

A special thanks to those who helped peer review and make this post as useful as it is: Abhishek Patole, Marshall Hallenbeck, Jeremy Shulman, Tim Lam, Brandon Wu, Tomer Schwartz, Henry Smith, John Nichols, Sean McConnell, and Luke Matarazzo.

A special thanks to you, the reader. We hope you benefited from it in some way and we want everyone to be successful at this. While these posts aren’t a silver bullet, we hope they get you started.

Please do follow our page if you enjoy this and our other posts. More to come!