Mastering Vulnerability Elimination Starts With The Basics

Published in

better appsec

17 min readFeb 15, 2024

From detect to protect: an overview of how to eliminate vulnerabilities from your Application and Cloud security Programs.

By: James Chiappetta with contributions from Vishal Jindal

Important Disclaimers:

The opinions stated here are the authors’ own, not necessarily those of past, current, or future employers.
The mentioning of books, tools, websites, and/or products/services are not formal endorsements and the authors have not taken any incentives to mention them.
Recommendations are based on past experiences and not a reflection of future ones.
Artificial Intelligence (AI) is a rapidly developing space and therefore any recommendations, opinions, or examples are based on present knowledge and subject to change.

Background

It may come as no surprise that CISOs, security engineers, developers, and auditors alike all share a very common obligation: to manage the ever changing state of security issues for software, cloud infrastructure, systems, and the supply chain. They rely on automated security scanning tools and manual processes to find these issues. They then report, triage, and build or validate remediation strategies. There are many ways to go about this but no matter which way you decide to do it, you will inevitably need to deal with limitations or changes in tools, coverage, business priorities, threats, and many other factors.

There is an ever growing need for a robust vulnerability management and remediation program. This is due to maturing compliance requirements and more sophisticated 0-days (e.g. SolarWinds, log4shell, etc.). Not all vulnerabilities are critical and it’s simply not practical to fix all vulnerabilities, as much as we would like to think it is. When there isn’t a 0-day vulnerability, how do you know what to fix, and what to do about those vulnerabilities? We have seen many ways to approach this problem and this post will take a closer look at a strategy for creating an efficient vulnerability management process.

Let’s get to it!

Legacy Vulnerability Remediation Process

If you rewind time to the early 2000s and compare it to today, you will see some obvious differences. As it relates to Application & Cloud Security, there really weren’t any defined “clouds”, other than the ones in the sky. Applications tended to be compiled software, and static web applications were just beginning to be “modernized” with JavaScript.

During those times, most of the security teams (if they existed) followed a basic scan and patch formula where they would rely heavily on security scanners and then would hand reports over to engineering teams to fix.

Why? Many reasons but to make it as simple as possible: Security Teams were plugging leaks as fast as they were finding them, but detection capabilities were not as advanced as they are today. They focused on largely manual point-in-time detection of security vulnerabilities and worked with development teams on fixing those vulnerabilities with a narrow scope (i.e. only fixing a vulnerability for applications identified as vulnerable as part of a manual process). For example, if a vulnerability was found through a security scan or attack against a single endpoint, then teams worked with the development team to get that one issue fixed so they can move onto the next thing.

Another issue was the quality of the fix. In the early 2000s, the security programs we remember were based on active scans with tools like Nessus or Nmap. Detections were limited, and most developers weren’t very educated on security, so those “fixes” ended up being changes that would block the scan from flagging, but not patch against a real attack. Changing the listening port of a common service to a non-standard port as an example.

Things have changed since then…

Today’s Vulnerability Elimination Process

We believe a few core things have changed since the early 2000s:

Development velocity broadly has increased which has resulted in an increase in complexity. To have a single security expert in every single aspect simply doesn’t scale. If a tech organization is doing 50 releases a day with code in 7 different languages for applications that are running in 3 cloud providers, how can any security process scale? Even if there was plenty of talent we would still be doomed without automation.
Developers are much more used to thinking holistically about bugs in general (not just security) and they can now systematically factor in mitigations against them into their code.
Security tools have improved a lot. New analysis technologies are now abundant, and there’s much more expertise on things like variant analysis, so security practitioners can zoom out faster from individual vulnerabilities.
The threat landscape has changed and continuously changes at an ever growing rate.

Now you know how we got where we are today and some challenges we face in the modern world. So, how do we build a machine that helps us manage and eventually eliminate vulnerabilities iteratively? Great question!

This brings us to our “modern” vulnerability elimination model.

Detect (& Prevent)- Have process and technology to identify security vulnerabilities accurately, efficiently, and comprehensively. Prevent vulnerabilities detected against trusted sources of detections as early as possible.
Triage — Filter and contextualize identified security vulnerabilities to reduce noise and plan for remediation.
Fix — Use dashboards, policies and enforcement mechanisms to ensure identified security vulnerabilities are addressed in a timely manner.
Improve — Address gaps in automation or process to ensure vulnerabilities don’t regress.

Note: Please don’t take this as some sort of panacea, rather a reference framework.

Detect (& Prevent)

We aren’t going to say anything new here but understanding the business you are trying to protect is the best first step before you even start detecting anything. You wouldn’t expect firefighters to drive around town looking for fires. Rather, they rely on CO2 and smoke detectors, building inspections, sprinkler systems in buildings and other sources to inform and reduce the impact of potential fires. The fire department’s time is best spent working with the city’s safety departments to map out the city they are supposed to protect and ensure they have the right level of detections and mitigations in place, as well as high risk areas vs low risk areas determined.

Manual & Automated Security Processes — Detecting vulnerabilities seems like one of these things that should be as simple as pointing a scanner at something and seeing what it returns. We think (obviously) that is not 100% correct, nor sufficient. Every organization requires manual processes, like penetration testing, red teaming, bug bounty, and security reviews. Those, in conjunction with automated processes and security scanning/detections, are vital to identifying security vulnerabilities. These parallel processes help to rigorously test applications and identify simple or complex security vulnerabilities. In the past we have posted about building a resilient security program covering manual and automated security processes extensively. Please go take a look at that and come back here (we will wait) before proceeding to the next leg of the journey: triaging!

Pipeline Prevention — If you are mature enough to support it, then the strongest defense against a vulnerability is to never have one in the first place. We understand every organization is different and will have varying code pipeline technology and standards. But if you can, placing checks and blocks in the code pipeline will be an immensely effective tool to prevent vulnerabilities, but not a replacement for the process we have outlined. This comes at the cost of putting security in the critical path so you may want to be very selective on what you want to block. You will also want a self service resolution path for developers and if people are closing issues as false positives then you should triage those to improve your detections.

Pro Tips:

Although manual processes can vary, it is key for organizations to invest in security automation. Automated scanners not only help detect simple vulnerabilities (like hardcoded secrets) but they can help ensure vulnerabilities you expect to be patched remain that way. Automation is also the answer to scale.
Understand the threats specific to your company and environment — how they work and what they target. Work with your Security Operations team to get visibility on what past attacks have occurred and build a pipeline of threat data.

Triage

Once you have a funnel with incoming security vulnerabilities, you will need to build a triage function. In our experience, there are several main components of triage: Accuracy, Enrichment, Correlation, Remediation Planning, and Prioritization. Let’s quickly step through each of these.

Accuracy — In our post on Democratizing Security: Application Security Scanning, we take you through some important details on handling the output from security scanners. There are three main outcomes:

False Positives — These are findings that are inaccurate.
False Negatives — These are vulnerabilities missed by scans and then found during manual security processes by security engineers or developers. This data is instrumental in crafting custom detections that can identify similar issues in the future.
True Positives — These are accurately identified security findings that need to be addressed.

Note: Knowing the anatomy of your targets can help the fidelity of your scans as well. For example if you are doing a DAST scan on an API, the scan should be targeted for API use cases. Doing this can speed up your scans while finding more vulnerabilities that are target-specific.

Enrichment — Break down attack paths to the vulnerabilities you are seeing. Are they even reachable from the network or internet? Is there an exploit or can you prove out that the vulnerability can be leveraged in some way? Perform blast radius analysis. Assuming a vulnerability is reachable and can be leveraged, then what? Can you gain further system, network, or data access? Are there any indicators from your SOC team’s Threat Intel program? These are some important questions to consider when working through an initial pass on a vulnerability. This context will unlock more accurate prioritization so you don’t waste developers’ time fixing vulnerabilities they shouldn’t focus on, which DataDog showed in their 2023 State of AppSec report is a big problem.

Vulnerability Correlation — If you have a legitimate security vulnerability, was it detected by a single tool or many security tools/processes? Deduplicate the specific vulnerability as best as you can (there are commercial tools to help with this). Is the vulnerability specific to a single application or system? Or is it present across a subset of, or even worse, all applications? Performing this correlation is necessary to scoping how much work it will take to fully address the vulnerability or class of issues. Here are a few ways to think about how to correlate vulnerabilities or vulnerability classes across applications:

Function Specific — Vulnerability is localized to a proprietary function or set of specific functions in an app that do something similar (e.g. SQL Injections due to lack of a parameterized SQL query for a database call).
Application (App) Specific — Vulnerability applies generally across a single app (e.g. nearly all SQL calls aren’t parameterized)
Framework Specific — Vulnerability applies to a subset of apps (e.g. log4shell on 50% of all running apps).
Systemic Issue — Vulnerability impacts nearly all applications or infrastructure (e.g. log4shell on all apps and numerous third party systems).

Remediation Planning — This is where you put in the work to sort out what needs to be done. What is the level of effort and is this a point fix vs systematic fix? That’s where correlating the vulnerability across your environment comes in handy. Will you need training, documentation, or a standard library to address the vulnerability or class of vulnerability? Be thorough and thoughtful. It will go a long way with your internal stakeholders (likely developers). One last thing — we have almost always had problems finding the owner to assign a security issue to. Correlating applications to teams and tech leads can be a burden to triagers. Our advice is to work with the sources of truth in your organization and simply reach out to Product Owners or Tech leads before any scheduling, hand off, or ticket creation happens.

Architectural and Design Changes — If you have taken the time to perform vulnerability correlation and have identified an issue that is systemic, then this an opportunity to see if an architectural change or refactor could be the most effective way to eradicate a specific type of vulnerability altogether. If you keep seeing similar vulnerabilities, and improve your detections, and we find opportunities to reduce the number of cases where they happen, then this is a clear opportunity to fix them once and for (almost) all. Sometimes such projects like this will be very expensive (like migrating off of a specific compute type or a compiler change) and sometimes they’re quite cheap with high ROI (like eliminating frequent misconfigurations by changing to safe defaults). You will need to perform your own cost benefit analysis here but ultimately these types of projects can lead to strong partnerships with development teams.

Prioritization — Now it’s time to review the vulnerability with the development team and discuss real world risk and priority against other work on their plate. Make sure a ticket with vulnerability details and retest steps is on the dev team’s board.

First order of business is usually calculating a risk score. This is where a CVSS calculator comes in handy.

Secondly, it is important to ensure that every identified vulnerability has enough information (priority, reproducing steps, remediation information, etc.) for a developer to be able to take action on it. Just telling a developer an XSS vulnerability needs to be fixed against an endpoint doesn’t help and will result in the dev team wasting their time and losing trust in your vulnerability management process.

Along with enrichment, it is important to validate the quality of findings to ensure you are not bugging developers with false positives. Without proper triage, you will end up with a bloated number of findings that cannot be acted upon, and result in the loss of developer trust.

Fix

Once you have identified true positives and enriched them with actionable information, it is time to push those results to respective development teams. But, the work doesn’t end there. It is important to have a proper tracking and remediation process to ensure a malicious user doesn’t act on that vulnerability before your dev team takes action on it.

Remediation Handoff & Implementation — In most cases, at the other end of a security issue, is a developer or engineer who needs to make the fix, with many other priorities. Production downtime may be more important to them right now than your critical XSS. So, if you are a security engineer, this is an opportunity to avoid the “over the fence” approach and instead be a partner. Often this will lead to wider and more valuable outcomes. Ultimately, you will need to figure out the best way to do handoffs, but developers and security should always agree on the priority, remediation plan, and time it will take to fix. This should be captured in a work ticket for tracking purposes.

Retest and Validation — When there is a completed fix in place, it is time to make sure the issue is fully remediated. This can be done by a senior developer or a security engineer. There are commercial tools incorporating “retest” functionality so that a new scan isn’t needed, but an engineer or developer can click a button to get instant verification.

Dashboards — Help track open security vulnerabilities with criticality and how long they have been open. It is also helpful to track burn down of those vulnerabilities to measure effectiveness of the vulnerability management process. Dashboarding can help identify particular teams that need focused training in real time, or the inverse, what teams are no longer introducing vulnerabilities after training. It can also identify if the organization at a whole is prone to a particular missing coding standard.

Service Level Agreements (SLAs) — It is important to document a policy defining time-to-fix based on criticality of a vulnerability. Once the policy is documented, it is also important to educate development teams on the policy.

Enforcement & Risk Management — Once you have defined and advertised SLAs, it is important to build enforcement mechanisms to ensure development teams are taking action within defined SLAs. This ensures that security vulnerabilities don’t fall off of tracking mechanisms, critical vulnerabilities don’t go unaddressed and you build an accountable process that helps improve security culture across the org. Having a good risk acceptance framework is also critical to include. This will allow safe handling of cases when remediation within SLA is not a good time investment for the business, for example when code will be decommissioned in a short period of time or getting a feature/product to market by a certain date is mission critical. There should be a leadership approved process that acknowledges Security did their job in reporting the finding and transfers liability to the developer team if they choose how and when to best deal with the risk.

Note: Many of these items require leadership buy-in and it is important that your leadership cares about security.

Improve

So, we have identified a vulnerability, triaged it and filed it for the development team to fix it. But… Was that the only application vulnerable to that exploit? What if a development team fixes it and it shows up again next year? These are the questions that should keep you up at night and automation is key to answering them.

Once you have received a security report, it is important to determine if there is a way to identify that vulnerability using available or custom scanners. There are many security scanners available in the market. You can easily script something in Python to identify risks specific to your org (for example a custom header field that you found to be vulnerable). These automations can help you scale up detections while managing cost and prevent regression of identified vulnerabilities in the future. This is where you turn one vulnerability into a hundred and eliminate those at scale.

Training & Standards — Creating a central repository of security standards and training for developers is one the best ways to drive effective security change at scale in a company. These act as the rules of the road and allow for even better security detections. Find high risk and high value issues to focus on. These will likely be “evergreen” for your company. Issues to consider: no secrets checked into source code, no public access for AWS S3/EC2/RDS/Etc., no XSS on public websites, or missing authorization checks on functions.

Detection Improvement — If you have a set list of “non-negotiables’” (see some examples in the previous section) then you now can better find the issue or like issues. You have focus and support from the broader organization. An additional benefit beyond better detections is the avoidance of detection fatigue. Don’t get us wrong, there is a time and place for reviewing all the output from your scanners. This is where we think AI assisted triage will have a major impact (more on this in the future).

Detection Gap & Root Cause Analysis (RCA) — If you have received a security finding report about how and why you missed a security issue (found during runtime usually), then it’s time to figure out why your manual and automated security processes didn’t catch it earlier in the Software Development Lifecycle (SDLC), as seen in the diagram below. If you are just getting started with your security program, then you may be doing a lot of RCAs. Make sure you have an efficient process. If there was a gap for a specific application then you may also want to consider looping in that team into the analysis process.

However you want to look at it, there is a substantial amount of work that needs to go into embedding security detections across the SDLC and each scanner, tool, or process is an opportunity to catch security vulnerabilities.

Pro Tip: Remember, if you have automated detection for a vulnerability, feed that into the detection stage and keep letting that machine run.

Performance Measurement — We have alluded to many possible measures throughout the post. We have written a couple times (1)(2) about KPIs too, so also referred to those. For the sake of bringing a finer point to what to measure over time:

Total opened vs closed vulnerabilities by type or class.
Specific CVEs or CWEs detected (or prevented) in early stage scanning (SAST/SCA) VS late stage (DAST/runtime scanning).
Dollar amount and/or number of instances of type/class of vulnerability found in Bug Bounty.
Developer Training Courses taken by Vulnerability type VS newly introduced vulnerabilities.
Vulnerability Exploit attempts seen by WAF VS open vulnerabilities.
Vulnerabilities actioned (fixed/partially fixed/accepted/etc.) by developers in early stage vs late stage detection.
Open vulnerabilities (Auth/XSS/CORS/etc) that can be addressed with existing security standards or libraries.
(If you have a central code pipeline) Number of vulnerabilities prevented in code scanning from reaching the main/production branch.
Vulnerability classes that are growing or shrinking over time.
Total number of vulnerabilities detected with automation vs manual assessment.
Vulnerability Classes detected with SAST vs DAST.

Note: There are many (many) more metrics and KPIs to track. These are simply illustrative of what you could track.

Data Engineering & AI

You didn’t think we would end it without talking about how AI is going to radically change all of this, did you?

Well if you did, sorry and we will keep it brief. Graph based databases, vector databases, machine learning, and LLMs have started to change the way we manage the mass of data that goes into all of this. Our previous 3 posts (1)(2)(3) can start to help you better understand how. The most important thing to note is that these technological advancements don’t magically get all the work that goes into vulnerability remediation done. But they certainly help improve aspects of the job and reduce the time it takes. This ultimately means, long term, we will hopefully spend less time managing time intensive vulnerability management practices.

Takeaways

Eliminating vulnerabilities is not easy and a mature vulnerability management program is a must for eliminating business-critical vulnerabilities for your organization. If you don’t start mapping the business and technology landscape, then you may struggle gaining traction. But if you do, then from there you can figure out what is most important to detect and prevent using both manual and automated security processes.
Don’t sit on vulnerabilities and don’t just chuck them over the fence. Your Vulnerability Triage process will allow you to weed out junk but improve tooling & process. More importantly it will help get to what actually needs to be fixed and what it will take to fix it.
Finding opportunities to better partner with developers and eliminate vulnerabilities for good involves many steps. Correlating vulnerabilities, building remediation plans, properly assigning severities, and defining reasonable SLA policies is a start.
Come up with a remediation plan and build durable point solutions that are low friction for the development community at your organization. This is a chance to work and partner to solve a common problem together. Hold vulnerability owners accountable. They should either fix the vulnerability or document business approval for accepting the risk past defined SLA.
If a major architecture or vulnerability remediation effort is needed, then that will require effective communication outward and a dashboard so that the entire organization can see progress. Build training to ensure developers know how to avoid introducing “non-negotiable” issues early.
Automation is key. Put preventions in place if and where you can. This will ensure no new issues can make it to code, or at least production. If you do then give developers a self resolution model to unblock themselves.
Constantly look for opportunities to improve the entire vulnerability management and elimination process. Every rotation is a chance to raise the bar for security and quality. Having the right measurements will be influential to both your ability to demonstrate value and trust.
And, yes, AI will undoubtedly help reduce the time/effort to do ALL of this in time and we will explore this further in future posts.

Words of Wisdom

At the most basic level, security engineers exist to aid in the identification, remediation, and elimination of security risks. If you are a security engineer or have worked closely with them, then you know it’s easy to end up like a dog chasing its tail. It’s hard work to do the job and do it well at scale. We encourage you all to keep after it.

“Many of life’s failures are people who did not realize how close they were to success when they gave up.” — Thomas Edison

Contributions and Thanks

A special thanks to those who helped peer review and make this post as useful as it is: Dor Zusman, Luke Matarazzo, Conor Sherman, Prashant Tunwal, Alex Poloniewicz, Brandon Wu, Eric Ormes, Abhishek Patole, John Nichols, Tomer Schwartz, and Jeremy Shulman.

A special thanks to you, the reader. We hope you benefited from reading this in some way and we want everyone to be successful. While these posts aren’t a silver bullet, we hope they get you started.

Please do follow our page if you enjoy this and our other posts. More to come!