survey

A Survey on Software Vulnerability Exploitability Assessment

Authors:

Sarah Elder

Authors Info & Claims

ACM Computing Surveys, Volume 56, Issue 8

Article No.: 205, Pages 1 - 41

https://doi.org/10.1145/3648610

Published: 26 April 2024 Publication History

PDF eReader

Abstract 摘要

Knowing the exploitability and severity of software vulnerabilities helps practitioners prioritize vulnerability mitigation efforts. Researchers have proposed and evaluated many different exploitability assessment methods. The goal of this research is to assist practitioners and researchers in understanding existing methods for assessing vulnerability exploitability through a survey of exploitability assessment literature. We identify three exploitability assessment approaches: assessments based on original, manual Common Vulnerability Scoring System, automated Deterministic assessments, and automated Probabilistic assessments. Other than the original Common Vulnerability Scoring System, the two most common sub-categories are Deterministic, Program State based, and Probabilistic learning model assessments.
了解软件漏洞的可利用性和严重性有助于实践者优先考虑漏洞缓解工作。研究人员提出了并评估了许多不同的可利用性评估方法。本研究的目标是通过对可利用性评估文献的调查，帮助实践者和研究人员理解评估漏洞可利用性的现有方法。我们确定了三种可利用性评估方法：基于原始的、手动通用的漏洞评分系统、自动确定性评估和自动概率评估。除了原始的通用漏洞评分系统外，最常见的两个子类别是基于确定性的、基于程序状态的评估和基于概率学习模型的评估。

1 Introduction 1 引言

The U.S. National Vulnerability Database (NVD) has reported annual increases in the number of reported vulnerabilities every year since 2016.¹ Due to the volume of vulnerabilities that need to be addressed in software systems, practitioners prioritize their efforts by addressing the most pressing security risks first [9].
美国国家漏洞数据库（NVD）自 2016 年以来每年都报告了报告的漏洞数量逐年增加。由于需要解决软件系统中的漏洞数量众多，从业者通过首先解决最紧迫的安全风险来优先考虑他们的努力。[9]

Understanding and assessing the exploitability of vulnerabilities is a key component in risk-based vulnerability prioritization by developers and security experts [6, 15, 65]. A frequently referenced definition of exploitability is from the Common Vulnerability Scoring System (CVSS), where exploitability is defined as “the ease and technical means by which the vulnerability can be exploited” [25]. Exploitability assessment can include a wide range of methods, from manual/expertise-based assessment of the Base Exploitability CVSS score to automated program analysis tools and machine learning models.
理解并评估漏洞的可利用性是开发人员和安全专家基于风险进行漏洞优先级排序的关键组成部分[6, 15, 65]。经常引用的可利用性定义来自通用漏洞评分系统（CVSS），其中将可利用性定义为“利用漏洞的容易程度和技术手段”[25]。可利用性评估可以包括多种方法，从基于手动/专家评估的基础可利用性 CVSS 评分到自动化程序分析和机器学习模型。

Literature surveys to date (e.g., [70, 76, 79, 112, 113]) have focused on specific categories of tools or techniques, such as machine learning or binary analysis tools for vulnerability assessment, within which “exploitability assessment” is one of the tasks that such tools are used for. Instead of focusing on the technical details of the techniques and tools, which have already been thoroughly covered, we look at exploitability assessment across techniques.
截至目前的研究综述（例如，[70, 76, 79, 112, 113]）主要集中在特定类别的工具或技术，如用于漏洞评估的机器学习或二进制分析工具，其中“可利用性评估”是这些工具被用于的任务之一。我们不再关注这些技术和工具的技术细节，这些细节已经被充分探讨，而是从技术层面审视可利用性评估。

The goal of this research is to assist practitioners and researchers in understanding existing methods for assessing vulnerability exploitability through a survey of exploitability assessment literature.
本研究的目的是通过调查可利用性评估文献，帮助实践者和研究人员理解现有的漏洞可利用性评估方法。

Our survey is based on two key concepts: vulnerabilities and exploits. The NVD defines a vulnerability as “a weakness in the computational logic (e.g., code) found in software and hardware components that, when exploited, results in a negative impact to confidentiality, integrity, or availability” [89]. The verb “to exploit” means to take advantage of something for one’s personal gain [90]. In software security, the term exploit, when used as a noun, refers to the series of steps used to exploit a system, particularly when these steps are documented as computer code [13, 18, 32, 39, 118, 121]. Documentation of exploits typically includes the inputs required to trigger specific program states, which results in negative security consequences. More than one exploit may be possible using the same vulnerability, and more than one vulnerability may be used by an exploit.
我们的调查基于两个关键概念：漏洞和利用。NVD 将漏洞定义为“软件和硬件组件中存在的计算逻辑（例如，代码）的弱点，当被利用时，会对机密性、完整性和可用性产生负面影响” [ 89]。动词“利用”意味着为了个人利益利用某物 [ 90]。在软件安全中，当“利用”作为名词使用时，指的是用于利用系统的步骤序列，尤其是当这些步骤以计算机代码的形式被记录下来时 [ 13, 18, 32, 39, 118, 121]。利用的记录通常包括触发特定程序状态所需的输入，这会导致负面的安全后果。可能存在多个利用同一漏洞的情况，也可能存在多个漏洞被同一利用所使用。

In this survey, we focus on software vulnerabilities. Assessments of hardware, user, or other vulnerability categories [91] are outside the scope of this work. We address the following research question:
在本调查中，我们关注软件漏洞。硬件、用户或其他漏洞类别的评估[91]不属于本工作的范围。我们探讨以下研究问题：

—

RQ: How is the exploitability of software vulnerabilities assessed?
RQ：软件漏洞的可利用性如何评估？

We performed a methodical search of the academic literature to identify papers on exploitability assessments for software vulnerabilities. We reviewed the papers and identified characteristics shared between different software vulnerability assessment methods and characteristics that can be used to differentiate between methods. We categorized the research based on those characteristics.
我们对学术文献进行了系统搜索，以识别关于软件漏洞可利用性评估的论文。我们审阅了这些论文，并确定了不同软件漏洞评估方法之间的共有特征以及可以用来区分方法的特征。我们根据这些特征对研究进行了分类。

The contributions of this work include the following:
本工作的贡献包括以下内容：

—

A list of academic literature on exploitability assessment for software vulnerabilities
软件漏洞可利用性评估的学术文献列表

—

An analysis of characteristics of exploitability assessment of software vulnerabilities that illustrate similarities and differences between assessment methods.
对软件漏洞可利用性评估特性的分析，阐述了评估方法之间的相似性和差异性。

The rest of this article is organized as follows. Section 2 explains key terms and related surveys. Section 3 provides an background on CVSS, a standard assessment method commonly referenced within the literature. Section 4 provides the methodology for our search for relevant papers. Section 5 provides an overview of our findings. Sections 6 through 9 describe the groups of exploitability assessment methods in greater detail. Section 10 discusses the study’s limitations. Finally, we present our discussion and conclusion in Sections 11 and 12, respectively.
本文其余部分组织如下。第 2 节解释了关键术语和相关调查。第 3 节介绍了 CVSS，这是一种在文献中常被引用的标准评估方法。第 4 节提供了我们寻找相关论文的方法。第 5 节概述了我们的发现。第 6 至 9 节更详细地描述了可利用性评估方法的组别。第 10 节讨论了研究的局限性。最后，我们在第 11 和第 12 节分别提出了我们的讨论和结论。

2 Background and Related Work
2 背景及相关工作

In Section 2.1, we provide definitions for common terms that will be reused throughout the article. Then, in Section 2.2, we summarize recent surveys that partially overlap with our work and highlight the differences between the studies.
在 2.1 节中，我们提供了将在全文中重复使用的常见术语的定义。然后，在 2.2 节中，我们总结了与我们的工作部分重叠的近期调查，并突出了这些研究之间的差异。

2.1 Common Terms 2.1 常用术语

Next, we present brief definitions of common terms used throughout the article:
接下来，我们给出文章中常用术语的简要定义：

Exploit signature: 利用签名：

Exploit signatures are a way of identifying a specific set of steps that an attacker can use to achieve their goals [5, 62]. Exploit signatures are commonly used by Intrusion Detection systems which scan network traffic, and by Anti-Malware programs which scan computer systems to find signs of malicious attacks [5, 6, 88]. For example, a signature could be a particular pattern of network messages that are always part of the exploit.
利用签名是一种识别攻击者为实现目标所采取的特定步骤集的方法[5, 62]。利用签名通常被入侵检测系统使用，这些系统扫描网络流量，以及被反恶意软件程序使用，这些程序扫描计算机系统以寻找恶意攻击的迹象[5, 6, 88]。例如，一个签名可能是一组始终是利用过程一部分的网络消息的特定模式。

ExploitDB: ExploitDB

ExploitDB is a publicly available repository of exploits submitted and maintained by penetration testers and security professionals [38].
ExploitDB 是一个由渗透测试人员和安全专家提交和维护的公开漏洞库 [38]。

CVE: CVE：CVE

CVE refers to vulnerabilities that are part of the Common Vulnerability Enumeration (CVE) list [22]. The CVE list is managed through the CVE program and used by many tools and organizations, including the NVD [15, 89].
CVE 指的是属于通用漏洞枚举（CVE）列表中的漏洞[22]。CVE 列表由 CVE 项目管理，并被许多工具和组织使用，包括 NVD[15, 89]。

NVD: NVD：国家漏洞数据库

The NVD is “the U.S. government repository of standards based vulnerability management data” [89]. The NVD adds CVSS scores to vulnerabilities in the CVE list and provides a public API for accessing the CVE list [89].
NVD 是“基于标准的漏洞管理数据的美国政府存储库” [89]。NVD 将 CVSS 评分添加到 CVE 列表中的漏洞，并提供公共 API 以访问 CVE 列表 [89]。

OSVDB: OSVDB：开放源代码漏洞数据库

The Open Source Vulnerability Database (OSVDB) was a publicly accessible vulnerability database active between 2002 and 2016 [27, 49, 72]. The OSVDB included CVSS scores from the NVD, as well as information from other databases and information curated for OSVDB [66].
开源漏洞数据库（OSVDB）是一个 2002 年至 2016 年间公开可访问的漏洞数据库[27, 49, 72]。OSVDB 包含了 NVD 的 CVSS 评分，以及其他数据库的信息和为 OSVDB 整理的信息[66]。

2.2 Related Surveys 2.2 相关调查

Table 1 summarizes six related works whose survey partially overlaps ours. The first two columns of Table 1 indicate the Reference and Year. The next three columns show the three major categories in our survey. An “X” indicates that the related work includes some of the same literature as our survey for that category. We provide an overview of these categories in Section 5. Next, the Summary column provides a brief overview of the related survey. Finally, the last column explains how we expand upon and differ from the prior work.
表 1 总结了六个相关研究，其调查部分重叠于我们的研究。表 1 的前两列表示参考文献和年份。接下来的三列展示了我们调查的三个主要类别。一个“X”表示相关研究包含与我们调查该类别相同的部分文献。我们在第 5 节对这些类别进行了概述。接下来，摘要列提供了相关调查的简要概述。最后，最后一列解释了我们如何扩展和区别于先前的工作。

Table 1.

Reference 参考文献	Year 年	(Manual) CVSS (手动) CVSS	Deterministic 确定性	Probabilistic 概率	Summary 摘要	How We Expand upon Their Work (Differences) 我们如何扩展他们的工作（差异）
Pendleton et al. [91] Pendleton 等人[91]	2016	X	X	X	Survey and taxonomy of system-level security metrics, including CVSS-based and network topology based metrics 系统级安全指标调查与分类，包括基于 CVSS 和网络拓扑的指标	While some of the characteristics examined in this survey also apply to the vulnerability-level metrics we examine, other characteristics are unique to the system or vulnerability level. 在本次调查中考察的一些特征也适用于我们研究的漏洞级别指标，但其他特征则仅限于系统或漏洞级别。
Shoshi-taishvili et al. [112] Shoshi-taishvili 等人 [112]	2016		X		Describe and implement offensive and defensive binary analysis techniques from prior security research, including binary analysis for exploitability assessment 描述并实现从先前安全研究中获得的攻击和防御二进制分析技术，包括用于漏洞评估的二进制分析	We focus on defensive techniques that could be used for assessing and prioritizing vulnerabilities. We examine a wider range of exploitability assessment techniques. 我们专注于可用于评估和优先排序漏洞的防御技术。我们考察了更广泛的利用性评估技术。
Liu et al. [79] 刘等[79]	2022		X		Survey on binary exploitation in Industrial Control Systems, including work on the causes and consequences of exploitation, exploit categories, known cyber incidents, and mitigations 调查工业控制系统中的二进制利用，包括对利用原因和后果的研究、利用类别、已知的网络事件和缓解措施	We focus on exploitability assessment for individual vulnerabilities rather than systems, exploring a wider range of contexts. 我们专注于针对单个漏洞的利用性评估，而不是系统，探索更广泛的背景。
Sotos Martinez et al. [113] 索托斯·马丁内斯等人[113]	2021			X	Survey on learning models for vulnerability assessment including detection, exploitability, and propagation 关于漏洞评估学习模型的调查，包括检测、可利用性和传播	We look at vulnerability exploitability assessment methods beyond learning models. We examine learning models for vulnerability exploitability assessment in relation to other forms of exploitability assessment rather than other (non-exploitability) learning models. 我们探讨超越学习模型的外部漏洞可利用性评估方法。我们研究漏洞可利用性评估中的学习模型，相对于其他形式的可利用性评估，而不是其他（非可利用性）学习模型。
Kotenko et al. [70] 科滕科等人[70]	2022			X	Review academic literature on machine learning frameworks for security and bug-finding tasks using program analysis based features 回顾基于程序分析特征的机器学习框架在安全性和漏洞查找任务中的学术文献
Le et al. [76] Le 等人[76]	2022			X	Categorize and describe the key tasks performed, data features used, and evaluation methods for vulnerability assessment models using machine learning, deep learning, and natural language processing 对使用机器学习、深度学习和自然语言处理进行漏洞评估模型的分类和描述，包括关键任务执行、使用的数据特征以及评估方法

Table 1. Related Surveys 表 1. 相关调查

As seen in Table 1, the prior work with the greatest overlap to our study is by Pendleton et al. [91]. The categorization of the studies in our survey is heavily influenced by Pendleton et al., which we discuss further in Section 4.2. However, there are also many key differences between our study and theirs. More than half (49 out of 76) of the papers we surveyed were published after 2016, which is the year of publication for the Pendleton et al. study. There are differences in scope, which also reduce the overlap. Pendleton et al. focus on metrics, whereas we focus on the assessment methods used to produce scores like metrics. Pendleton et al. examine system-level metrics, whereas our primary focus is vulnerability-level assessment. Finally, Pendleton et al. focus on all security assessments, whereas we focus specifically on the topic of exploitability, a subset of security. Focusing on exploitability allows us to examine assessments in greater detail. As shown in Table 1, the remaining surveys overlap within only one category.
如表 1 所示，与我们研究重叠度最大的先前工作是 Pendleton 等人[91]的研究。我们调查中的研究分类受到 Pendleton 等人研究的强烈影响，我们将在第 4.2 节中进一步讨论。然而，我们的研究与他们的研究也存在许多关键差异。我们调查的论文中，超过一半（76 篇中的 49 篇）是在 2016 年之后发表的，而 Pendleton 等人的研究是在这一年发表的。在范围上存在差异，这也减少了重叠度。Pendleton 等人专注于指标，而我们的重点是用于产生如指标等评分的评估方法。Pendleton 等人考察系统级指标，而我们的主要重点是漏洞级评估。最后，Pendleton 等人关注所有安全评估，而我们专注于安全的一个子集——可利用性。关注可利用性使我们能够更详细地考察评估。如表 1 所示，剩余的调查仅在某一类别内重叠。

3 CVSS Background 3 CVSS 背景

The CVSS is a standardized vulnerability assessment framework used or referenced by more than half of the papers in this survey. The prevalence of CVSS in exploitability literature influenced our survey process, as we will describe in Section 4.1.1. FIRST (the Forum of Incident Response and Security Teams) [25] organizes the CVSS Special Interest Group (SIG) to maintain and improve the CVSS specification. This overview of CVSS focuses on the exploitability-related metrics and sub-metrics. The full CVSS documentation may be found on the CVSS website (https://www.first.org/cvss). The latest version of CVSS, as of our survey, was v3.1. Figure 1 illustrates which CVSS metrics are exploitability related, which we discuss in Section 3.1, and highlights differences between CVSS v2 and CVSS v3, which we discuss in Section 3.2.
CVSS 是一个标准化的漏洞评估框架，本调查中超过一半的论文都使用了或引用了该框架。CVSS 在可利用性文献中的普及影响了我们的调查过程，我们将在第 4.1.1 节中描述。FIRST（事件响应和安全团队论坛）[25]组织 CVSS 特别兴趣小组（SIG）以维护和改进 CVSS 规范。本概述重点介绍了与可利用性相关的指标和子指标。完整的 CVSS 文档可以在 CVSS 网站上找到（https://www.first.org/cvss）。根据我们的调查，CVSS 的最新版本是 v3.1。图 1 说明了哪些 CVSS 指标与可利用性相关，我们将在第 3.1 节中讨论，并突出 CVSS v2 和 CVSS v3 之间的差异，我们将在第 3.2 节中讨论。

Fig. 1.

3.1 Exploitability-Related CVSS Metrics
3.1 可利用性相关的 CVSS 指标

CVSS v3.1 has three metric groups: Base, Temporal, and Environmental [25]. The Base metric group includes characteristics of the vulnerability that are constant, regardless of context. The Temporal metric group captures the characteristics of a vulnerability that can change over time, such as whether an exploit has been made publicly available. The Environmental metric group includes vulnerability characteristics that may vary based on the environment, such as whether a vulnerable library is being used by an application that manages sensitive data.
CVSS v3.1 有三个度量组：基本、时间和环境[25]。基本度量组包括与漏洞相关的恒定特征，无论在何种环境下。时间度量组捕捉到随时间可能变化的漏洞特征，例如是否已公开利用。环境度量组包括可能根据环境变化的漏洞特征，例如是否存在一个易受攻击的库被管理敏感数据的应用程序使用。

In CVSS v3.1, the Base score is derived from two sub-scores, Exploitability and Impact, and adjusted based on a third Scope score. For this survey, within the Base score, we only consider the Exploitability score and its sub-metrics to be “exploitability related.” The Exploitability sub-score [25]includes four metrics: Attack Vector (AV) indicates how close an attacker must be to be able to exploit the vulnerability; Attack Complexity (AC) indicates whether exploiting the vulnerability requires conditions that the attacker does not directly control; User Interaction (UI) indicates whether a user, other than the attacker, is required to provide inputs to the application for the attacker to exploit the vulnerability; and Privileges Required (PR) indicates the level of logical access needed to exploit the vulnerability. In CVSS v3.1, the Base Exploitability sub-score is calculated as

8.22 \times A V \times A C \times U I \times P R

.
在 CVSS v3.1 中，基础分数由两个子分数组成，即易损性和影响，并根据第三个范围分数进行调整。对于本次调查，在基础分数中，我们只考虑易损性分数及其子指标为“易损性相关”。易损性子分数[25]包括四个指标：攻击向量（AV）表示攻击者必须接近到何种程度才能利用漏洞；攻击复杂性（AC）表示利用漏洞是否需要攻击者无法直接控制的条件；用户交互（UI）表示是否需要除攻击者以外的用户向应用程序提供输入以使攻击者能够利用漏洞；所需权限（PR）表示利用漏洞所需的逻辑访问级别。在 CVSS v3.1 中，基础易损性子分数的计算公式为

8.22 \times A V \times A C \times U I \times P R

。

The Temporal group includes the Exploit Code Maturity (E) metric, which “measures the likelihood of the vulnerability being attacked” [25]. The other Temporal metrics, Remediation Level (RL) and Report Confidence (RC), are not related to exploitability.
时间组包括漏洞利用代码成熟度（E）指标，该指标“衡量漏洞被攻击的可能性” [25]。其他时间指标，修复级别（RL）和报告置信度（RC），与可利用性无关。

The Environmental metric group [25] includes modified scores of all Base metrics, including the Exploitability metrics: Modified Attack Vector (MAV), Modified Attack Complexity (MAC), Modified User Interaction (MUI), and Modified Privileges Required (MPR). The remaining Environmental metrics are based on Impact. The modified scores update the Base Metric sub-scores based on environmental factors such as using a non-default configuration setting [25].
环境指标组[25]包括所有基础指标的修改得分，包括可利用性指标：修改后的攻击向量（MAV）、修改后的攻击复杂度（MAC）、修改后的用户交互（MUI）和修改后的所需权限（MPR）。其余的环境指标基于影响。修改后的得分根据环境因素（如使用非默认配置设置[25]）更新基础指标子得分。

3.2 Differences in Exploitability-Related Sub-Scores between CVSS v3 and CVSS v2
3.2 CVSS v3 与 CVSS v2 在可利用性相关子评分上的差异

Many of the publications surveyed refer to CVSS v2 since CVSS v3 was not released until 2015 [23, 101] and not officially supported in the NVD until 2019 [89]. In the Base Exploitability group of CVSS v2 [23], the Attack Vector (AV) was referred to as the Access Vector (also abbreviated AV). In CVSS v2, the Attack Complexity (AC) and User Interaction (UI) metrics were combined in the Access Complexity metric (also abbreviated AC). Finally, instead of a Privileges Required (PR) metric, CVSS v2 had an Authentication (AU) metric which recorded the minimum number of times an attacker had to provide credentials to an application when exploiting the vulnerability. The Base score CVSS v2 also lacked the Scope (S) sub-metric [23]. The CVSS v2 Base Exploitability metric is also calculated

20 x A V x A C x A U

—that is, using a slightly lower weight than CVSS v3 (20 vs.

8.22

).
许多被调查的出版物都提到了 CVSS v2，因为 CVSS v3 直到 2015 年才发布[23, 101]，并且在 NVD 直到 2019 年才正式支持[89]。在 CVSS v2 的基漏洞利用组[23]中，攻击向量（AV）被称为访问向量（也简称 AV）。在 CVSS v2 中，攻击复杂性（AC）和用户交互（UI）指标被合并为访问复杂性指标（也简称 AC）。最后，CVSS v2 没有所需的权限（PR）指标，而是有一个认证（AU）指标，该指标记录了攻击者在利用漏洞时向应用程序提供凭证的最小次数。CVSS v2 的基分数也缺少范围（S）子指标[23]。CVSS v2 的基漏洞利用指标也是按照

20 x A V x A C x A U

计算的——即使用略低于 CVSS v3 的权重（20 vs.

8.22

）。

In the Temporal metric group, the name of the metric referred to as “Exploitability” in CVSS v2 was updated to “Exploit Code Maturity” (E) in v3 to better reflect what the metric evaluates [23]. Additionally, in the Environmental Metric, CVSS v2 had two distinct sub-metrics: Collateral Damage Potential (CDP) and Target Distribution (TD), which assessed the impacts of exploiting a vulnerability. The CVSS v2 Environmental metric did not include the Modified Base scores.
在时间度量组中，CVSS v2 中被称为“可利用性”的度量名称在 v3 中更新为“利用代码成熟度”（E），以更好地反映该度量所评估的内容[23]。此外，在环境度量中，CVSS v2 有两个不同的子度量：附带损害潜力（CDP）和目标分布（TD），它们评估了利用漏洞的影响。CVSS v2 的环境度量不包括修改后的基本得分。

3.3 CVSS in Practice 3.3 实际应用中的 CVSS

The Base CVSS score is widely available for all vulnerabilities in the CVE list since the NVD [89] adds a Base CVSS score for each CVE. Some vendors, such as Oracle,² also use the Base CVSS score to communicate the severity of vulnerabilities in their systems.
基础 CVSS 评分对所有 CVE 列表中的漏洞都广泛可用，因为 NVD[89]为每个 CVE 添加了基础 CVSS 评分。一些供应商，如 Oracle， ² 也使用基础 CVSS 评分来传达其系统中漏洞的严重性。

Criticisms of CVSS include the lack of environmental or temporal information in the Base CVSS score [81], and the weakness of the score’s correlation with exploit likelihood and overall risk [5, 6, 14, 136]. These have been addressed in later versions of the CVSS specification, which emphasizes that the Base score is not intended to include temporal or environmental factors, and “CVSS is designed to measure the severity of a vulnerability and should not be used alone to assess risk” [26]. Other criticisms include the unequal distribution of CVSS scores for vulnerabilities in the CVE list [80, 115], where most CVSS scores are relatively high. Several aspects of the score distribution remain unexplored, such as whether the uneven distribution is a product of the source (i.e., whether vulnerabilities in the CVE list tend to be higher severity) or of CVSS itself.
CVSS 的批评包括基线 CVSS 评分中缺乏环境或时间信息[81]，以及评分与利用可能性和整体风险的关联性较弱[5, 6, 14, 136]。这些问题在 CVSS 规范的后续版本中得到了解决，强调基线评分不包括时间或环境因素，“CVSS 旨在衡量漏洞的严重性，不应单独用于评估风险”[26]。其他批评包括 CVE 列表中漏洞的 CVSS 评分分布不均[80, 115]，其中大多数 CVSS 评分相对较高。评分分布的几个方面尚未探讨，例如不均匀分布是否是来源（即 CVE 列表中的漏洞是否倾向于更高的严重性）或 CVSS 本身的结果。

4 Methodology 4 方法论

In Section 4.1, we describe the two-phase process used to collect studies for inclusion in our survey. Once the papers were collected, we organized and categorized the papers as described in Section 4.2.
在第 4.1 节中，我们描述了用于收集纳入我们调查的研究的两阶段过程。一旦收集到论文，我们就按照第 4.2 节所述对论文进行了组织和分类。

4.1 Paper Collection Process
4.1 论文收集过程

We followed a two-phase process for collecting relevant papers, involving both a Keyword Search using Active (Machine) Learning phase and Snowballing phase, based on the SYMBALS methodology [120]. Proponents of using active learning for literature surveys found that active learning alone can perform similar to human researchers, with precision and recall greater than 70% [120, 139]. Combining active learning with Snowballing can produce precision and recall greater than 90% [120].
我们采用了基于 SYMBALS 方法[120]的两阶段收集相关论文的过程，包括使用主动（机器）学习的关键词搜索阶段和滚雪球阶段。支持使用主动学习进行文献综述的学者发现，仅主动学习就能与人类研究人员的表现相似，其精确率和召回率均超过 70%[120, 139]。将主动学习与滚雪球方法相结合，可以产生精确率和召回率均超过 90%[120]。

The inclusion/exclusion criteria used in both phases of the review are discussed in Section 4.1.1. In the first phase, we use an active (machine) learning tool, FAST2 [139], to analyze the results returned from a keyword search, followed by further review and analysis by three researchers. In the original SYMBALS research by van Haastrecht et al. [120], the authors found that FAST2 consistently performed better than the alternative learning system, ASReview, in terms of recall (60%–90% compared to 40%–70%). We discuss the first phase further in Section 4.1.2. In the second phase, addressed in Section 4.1.3, we performed snowballing [120] to collect the papers citing and the papers cited by the studies returned in the first phase. An overview of the paper collection process is shown in Figure 2.
审查的两个阶段所使用的纳入/排除标准在 4.1.1 节中讨论。在第一阶段，我们使用一个主动（机器）学习工具 FAST2 [139]来分析关键词搜索返回的结果，随后由三位研究人员进行进一步审查和分析。在 van Haastrecht 等人[120]的原始 SYMBALS 研究中，作者发现，在召回率方面，FAST2 始终优于替代学习系统 ASReview（60%–90%与 40%–70%相比）。我们将在 4.1.2 节中进一步讨论第一阶段。在第二阶段，如 4.1.3 节所述，我们进行了滚雪球[120]以收集第一阶段返回的研究引用的论文和被引用的论文。论文收集过程的概述如图 2 所示。

Fig. 2.

4.1.1 Inclusion/Exclusion Criteria.
4.1.1 纳入/排除标准。

Paper selection in both phases of the search leveraged the inclusion/exclusion criteria shown in Table 2. The Screening criteria are based on commonly used criteria for other surveys and literature reviews [68]. The Relevance criteria are based on the goal of our survey. The first column of Table 2 indicates which criteria are Screening criteria and which criteria are Relevance criteria.
论文选择在搜索的两个阶段都利用了表 2 中所示的增加/排除标准。筛选标准基于其他调查和文献综述中常用的标准[68]。相关性标准基于我们调查的目标。表 2 的第一列表明哪些是筛选标准，哪些是相关性标准。

Table 2.

	Inclusion Criteria: Include papers if: 纳入标准：包括以下论文：	Exclusion Criteria: Exclude papers if 排除标准：排除以下论文：
Screening 筛查	The paper is available 论文可获取	We are unable to obtain a copy of the paper 无法获取论文副本
	The entire paper is available in English 整篇论文可用英文阅读	The main body of the paper is in a language other than English 论文主体使用非英语语言
	The paper is selected via peer review 论文通过同行评审选出	The paper is not selected via peer review (e.g., Technical Report) 该论文未通过同行评审（例如，技术报告）
	The paper is selected via peer review 论文通过同行评审选出	The paper has been retracted 论文已被撤回
	The paper is full length 论文为全文	The paper was not full length (e.g., short paper, poster) 该论文并非全文（例如，短文、海报）
Relevance 相关性	The paper proposes or evaluates exploitability assessments for individual software vulnerabilities 该论文提出或评估了针对单个软件漏洞的可利用性评估	All exploitability assessments proposed or evaluated in the paper assess exploitability of systems as a whole rather than individual vulnerabilities 论文中提出的或评估的所有可利用性评估，都是评估系统的整体可利用性，而不是单个漏洞
		The paper focused on network vulnerabilities or system vulnerabilities, such as configuration vulnerabilities, which are not software specific 该论文专注于网络漏洞或系统漏洞，例如配置漏洞，这些漏洞并非特定于软件
		The assessment is not security related 评估与安全无关
	If the paper proposes or evaluates a method for computing CVSS-based scores (e.g., exclude the paper if the paper only computes or evaluates an overall severity score; i.e., exploitability-related metrics, like the Base Exploitability score, are not computed or evaluated) 如果论文提出或评估了基于 CVSS 评分的计算方法（例如，如果论文仅计算或评估总体严重性评分，则排除该论文；即不计算或评估与可利用性相关的指标，如基本可利用性评分）	The paper discusses exploit generation as a software vulnerability exploitability assessment technique 论文讨论了利用生成作为软件漏洞可利用性评估技术

Table 2. Inclusion/Exclusion Criteria
表 2. 纳入/排除标准

In Phase 1 and for backward snowballing, we only use the criterion from Table 2. We did not limit our search based on venue or year in our primary inclusion/exclusion criteria. As we note in Section 4.1.3, we performed additional forward snowballing to identify publications from major venues published in 2021 and 2022, since backward snowballing provided no new papers from 2021 or 2022. Hence, two additional criteria were applied for forward snowballing: year (2021–2022) and venue quality. For conferences, we determined venue quality based on the GII-GRIN-SCIE (GGS) [48]and Computing Research and Education Association of Australasia (CORE) [125] conference ranking systems. If a conference was considered tier 1 or tier 2 in GGS and had A– or higher in CORE, we consider it a “Major” venue. Neither the GGS nor CORE systems maintain rankings for journals [48]. For journals, we rely on Impact Factor reported by Journal Citation Reports [19] and H-Index reported by SCImago [108]. We consider journals with Impact-Factor and H-Index scores in the top quartile to be “Major” venues. We also filter venues by discipline, as reported by CORE and GGS for conferences and by Journal Citation Reports and SCImago for journals.
在第一阶段和向后雪球法中，我们仅使用表 2 中的标准。我们未根据会议或年份限制我们的主要纳入/排除标准。正如我们在第 4.1.3 节中提到的，我们进行了额外的向前雪球法，以识别 2021 年和 2022 年在主要会议发表的出版物，因为向后雪球法没有提供 2021 年或 2022 年的新论文。因此，对于向前雪球法，我们应用了两个额外的标准：年份（2021-2022）和会议质量。对于会议，我们根据 GII-GRIN-SCIE（GGS）[48]和澳大利亚及新西兰计算研究教育协会（CORE）[125]的会议排名系统确定会议质量。如果一个会议在 GGS 中被认为是第一或第二梯队，并且在 CORE 中有 A-或更高的评级，我们将其视为“主要”会议。GGS 和 CORE 系统都不维护期刊排名[48]。对于期刊，我们依赖于《期刊引证报告》[19]报告的影响因子和 SCImago[108]报告的 H 指数。我们将影响因子和 H 指数得分位于前四分位的期刊视为“主要”期刊。我们还会根据学科对场地进行筛选，如 CORE 和 GGS 对会议的报道，以及期刊引证报告和 SCImago 对期刊的报道。

4.1.2 Phase 1: Keyword Search with Active Learning.
4.1.2 第一阶段：基于主动学习的关键词搜索。

We used a keyword search to gather results from four database indices in Table 3: ACM Digital Library, IEEE Xplore, Compendex, and Web of Science. We selected these indices because they are widely used and include major relevant publication venues. We do not include the widely used Google Scholar because the results can only be downloaded via a web crawler, which was disallowed by the Google Scholar robots.txt file³ at the time of the search.⁴ The query was constructed based on the research goal. For each index, the query required the terms “exploitability,” “vulnerability,” and “software.” Since the term “assessing” has many synonyms and forms, we designed the query to include at least one of the following terms: “metric,” “measure,” “measurement,” “assess,” or “assessment.” Table 3 provides the database-specific syntax of this query.
我们使用关键词搜索从表 3 中的四个数据库索引中收集结果：ACM 数字图书馆、IEEE Xplore、Compendex 和 Web of Science。我们选择这些索引是因为它们被广泛使用，并包括主要的相关出版物。我们没有包括广泛使用的谷歌学术，因为结果只能通过网络爬虫下载，而谷歌学术的 robots.txt 文件在搜索时禁止了这一点。查询是基于研究目标构建的。对于每个索引，查询需要包含“可利用性”、“漏洞”和“软件”等术语。由于“评估”一词有许多同义词和形式，我们设计了查询以包含以下术语之一：“指标”、“度量”、“测量”、“评估”或“评估”。表 3 提供了该查询的数据库特定语法。

Table 3.

Name (URL) 姓名（网址）	Description 描述	Query Syntax 查询语法
ACM Digital Library (dl.acm.org) ACM 数字图书馆（dl.acm.org）	Database & index from ACM 数据库与索引来自 ACM	[All: exploitability] AND [All: vulnerability] AND [All: software] AND [[All: metric] OR [All: measure] OR [All: measurement] OR [All: assess] OR [All: assessment]] 所有：可利用性 AND 所有：漏洞 AND 所有：软件 AND [[所有：指标] OR [所有：度量] OR [所有：测量] OR [所有：评估] OR [所有：评估]]
IEEE Xplore (ieeexplore.ieee.org) IEEE Xplore（ieeexplore.ieee.org）	Database & index from IEEE 数据库与索引来自 IEEE	(“All Metadata”:exploitability) AND (“All Metadata”:vulnerability) AND (“All Metadata”:software) AND (“All Metadata”:“metric” OR “All Metadata”:“measure” OR “All Metadata”:“measurement” OR “All Metadata”:“assess” OR “All Metadata”:“assessment”) 所有元数据：可利用性 AND 所有元数据：漏洞 AND 所有元数据：软件 AND 所有元数据：“指标” OR 所有元数据：“度量” OR 所有元数据：“测量” OR 所有元数据：“评估” OR 所有元数据：“评估”
Compendex (www.engineeringvillage.com) Compendex（www.engineeringvillage.com）	Engineering database & index from Elsevier Engineering database & index from Elsevier 工程数据库与索引（来自 Elsevier）	exploitability AND vulnerability AND software AND (metric OR measure OR measurement OR assess OR assessment) 可利用性 AND 漏洞 AND 软件 AND (指标 OR 度量 OR 测量 OR 评估 OR 评定)
Web of Science (www.webofscience.com) 科学网（www.webofscience.com）	Multidisciplinary database & index 多学科数据库与索引	(ALL=(exploitability)) AND (ALL=(vulnerability)) AND (ALL=(software)) AND ((ALL=(metric) OR (ALL=(measure) OR (ALL=(measurement) OR (ALL=(assess) OR (ALL=(assessment)) (ALL=(可利用性)) AND (ALL=(漏洞)) AND (ALL=(软件)) AND ((ALL=(指标) OR (ALL=(度量) OR (ALL=(测量) OR (ALL=(评估) OR (ALL=(评估))))

Table 3. Databases Examined
表 3. 检查的数据库

Our original search returned 2,684 results. Three researchers used the FAST2 system [139] to perform an initial screening of the titles and abstracts of the 2,684 results, using the inclusion/exclusion criterion described in Table 2. FAST2 leverages machine learning to reduce the overall workload in the initial screening process. We applied FAST2 following the guidelines described by Yu and Menzies [139]. The FAST2 tool includes a graphical user interface that presents a list of 10 papers to classify as relevant or irrelevant. After the user has classified 10 papers, another 10 papers are selected by the tool. Initially, these papers are chosen randomly. Over time, the papers are presented based on their classification by an underlying Support Vector Machine (SVM) machine learning algorithm [138]. The FAST2 tool includes an estimated recall for the number of papers included at each step of the review. All three reviewers reached the 90% estimated recall target recommended by the authors of FAST2. Any paper selected for inclusion in the output from at least one researcher’s application of FAST2 using the inclusion/exclusion criteria from Table 2 was included at the end of the initial screening. The initial screening reduced the list of papers from 2,684 to 160.
我们的原始搜索返回了 2,684 个结果。三位研究人员使用 FAST2 系统[139]对这 2,684 个结果的标题和摘要进行了初步筛选，使用表 2 中描述的纳入/排除标准。FAST2 利用机器学习来减少初步筛选过程中的总体工作量。我们遵循 Yu 和 Menzies[139]描述的指南应用 FAST2。FAST2 工具包括一个图形用户界面，显示 10 篇论文以分类为相关或不相关。用户对 10 篇论文进行分类后，工具将选择另外 10 篇论文。最初，这些论文是随机选择的。随着时间的推移，论文将根据底层支持向量机（SVM）机器学习算法[138]的分类进行展示。FAST2 工具包括在每个审查步骤中包含的论文数量的估计召回率。所有三位审稿人都达到了 FAST2 作者推荐的 90%估计召回率目标。任何符合至少一位研究者使用表 2 中的纳入/排除标准通过 FAST2 应用选择的论文，均被纳入初步筛选的末尾。初步筛选将论文列表从 2,684 篇减少到 160 篇。

The 160 papers were then read in detail by researchers, who removed 122 papers due to not meeting our inclusion/exclusion criteria or where the work was not articulated clearly enough to be accurately categorized. At the end of Phase 1, we included 38 papers.
160 篇论文随后被研究人员详细阅读，其中 122 篇因不符合我们的纳入/排除标准或工作表述不够清晰，无法准确分类而被移除。第一阶段结束时，我们纳入了 38 篇论文。

4.1.3 Phase 2: Snowballing.
4.1.3 阶段二：滚雪球法

Once we had extracted our initial set of 38 papers, we performed snowballing by identifying papers that were cited by the initial set of papers (backward snowballing), as specified in the SYMBALS methodology [120]. Backward snowballing found no new papers from 2021 and 2022, the last 2 years of our search. Therefore, we also looked for papers that cited papers from our initial set (forward snowballing), which were published at major computer science venues in 2021 and 2022, as described in Section 4.1.1. We used the Semantic Scholar [110] and OpenCitations [92] APIs for both backward and forward snowballing. We collected 974 entries from the APIs. One researcher then analyzed the list of results to remove duplicate entries. The researcher also applied the Screening criteria from Table 2 (discussed in Section 4.1.1). After removing duplicates and applying the Screening criteria, 601 papers remained.
一旦我们提取了最初的 38 篇论文，我们就按照 SYMBALS 方法[120]的规定进行了滚雪球法，即识别被最初论文集引用的论文（逆向滚雪球）。逆向滚雪球在 2021 年和 2022 年，即我们搜索的最后两年，没有发现新的论文。因此，我们还寻找了引用我们最初论文集的论文（正向滚雪球），这些论文在 2021 年和 2022 年发表在主要计算机科学会议上，如第 4.1.1 节所述。我们使用了 Semantic Scholar[110]和 OpenCitations[92]的 API 进行逆向和正向滚雪球。我们从 API 中收集了 974 条条目。然后，一位研究人员分析了结果列表，以删除重复条目。该研究人员还应用了表 2 中的筛选标准（在第 4.1.1 节中讨论）。在删除重复条目并应用筛选标准后，剩余 601 篇论文。

Of the 601 papers, 3 were extensions of work included in the previous phase of the survey and therefore known to meet the criteria from Table 2 and included. Two researchers reviewed the other 598 papers using the Relevance criteria from Table 2. The first researcher performed two iterations of the review. The second researcher then performed an initial analysis of 100, and the two researchers compared their results. Based on this discussion, the second researcher reviewed the remaining 498 papers in two iterations, with a brief discussion between iterations to clarify high-level trends. For the final set of categorizations, the observed agreement was

95.2 %

with a Cohen’s kappa of 0.54 (

95 %

confidence interval for the kappa:

\pm 0.18

), indicating moderate agreement [37, 73]. The two researchers then discussed and resolved the remaining disagreements. Ultimately, 31 papers were added in Phase 2, including the 3 papers which were extensions of work included in Phase 1.
在 601 篇论文中，3 篇是前一个调查阶段工作的扩展，因此已知符合表 2 的标准并被纳入。两位研究人员使用表 2 的相关性标准审查了其他 598 篇论文。第一位研究人员进行了两次审查迭代。然后，第二位研究人员对 100 篇论文进行了初步分析，两位研究人员比较了他们的结果。基于这次讨论，第二位研究人员在两次迭代之间简要讨论以明确高级趋势，对剩余的 498 篇论文进行了两次审查。对于最终的分类，观察到的协议为

95.2 %

，Cohen's kappa 系数为 0.54（

95 %

kapp 值的置信区间为

\pm 0.18

），表明中等程度的协议[37, 73]。然后，两位研究人员讨论并解决了剩余的分歧。最终，在第二阶段增加了 31 篇论文，包括第一阶段纳入的 3 篇论文扩展。

4.2 Organization and Categorization
4.2 组织与分类

Once we had identified the set of papers, we grouped papers that were part of the same study and should be considered together in a practical application of the exploitability assessment. We consider papers to be part of the same study if the papers (1) were from the same authors and include some of the same analysis, and (2) if later papers were building on the prior model for exploitability. We reviewed the studies to determine if prior work existed and if it would be necessary to read to understand the assessment technique proposed, but where the prior paper would not have been classified as “exploitability” related. For example, earlier work built on by Plate et al. [95] and Ponta et al. [96–98] did not refer to the tools in terms of “exploitability.” Searching for additional papers in each study added 7 papers to the survey. Our survey ultimately includes 76 papers, 72 of which are grouped into 59 studies, whereas 4 papers were proposing or presenting industry standards CVSS and the Exploit Prediction Scoring System (EPSS), written by the authors of the standards.
一旦我们确定了论文集，我们将属于同一研究的论文分组，这些论文在可利用性评估的实际应用中应被视为一组。我们认为如果论文（1）来自同一作者并包含一些相同的分析，以及（2）如果后续论文是在利用先前模型的基础上构建的，则这些论文属于同一研究。我们审查了这些研究，以确定是否存在先前的工作，以及是否需要阅读以理解所提出的评估技术，但先前论文不会被归类为“可利用性”相关。例如，Plate 等人[95]和 Ponta 等人[96-98]所基于的早期工作没有将工具称为“可利用性”。在每个研究中寻找额外的论文增加了 7 篇论文到调查中。我们的调查最终包括 76 篇论文，其中 72 篇被分为 59 项研究，而 4 篇论文是提出或展示行业标准 CVSS 和可利用性预测评分系统（EPSS），由标准作者撰写。

After the initial keyword search, an initial classification of papers was performed by two researchers and iterated upon further by the first author. We used a keywording-based approach [94] to determine the categories and characteristics useful to compare and contrast the different vulnerability assessment methods. The first author identified a set of keywords that applied to a wide range of papers, and the first and third authors then applied these keywords to a random subset of 15 papers and identified new keywords that were frequently occurring. The keywords were then grouped into categories and applied to the initial set of 38 papers. The keywords and categories were further iterated on and expanded during and after Phase 2.
在初步关键词搜索之后，两位研究人员对论文进行了初步分类，第一作者在此基础上进一步迭代。我们采用基于关键词的方法[94]来确定用于比较和对比不同漏洞评估方法的类别和特征。第一作者确定了一组适用于广泛论文的关键词，然后第一和第三作者将这些关键词应用于 15 篇随机选取的论文子集，并识别出频繁出现的新关键词。随后，这些关键词被分组归类并应用于最初的 38 篇论文集合。在第二阶段期间和之后，关键词和类别进一步迭代和扩展。

Our categories aligned closely with three categories for “Metrics for Measuring Severity” of software vulnerabilities from Pendleton et al. [91]: (Manual) CVSS-based, Deterministic, and Probabilistic assessment systems. The vulnerability-related subset of the taxonomy from Pendleton et al. [91] is shown in Figure 3, with the three categories we use in our survey highlighted in red.
我们的类别与 Pendleton 等人[91]提出的“衡量软件漏洞严重程度的指标”中的三个类别紧密对应：（手动）基于 CVSS 的、确定性评估系统和概率评估系统。Pendleton 等人[91]的分类学中的漏洞相关子集如图 3 所示，我们调查中使用的三个类别用红色突出显示。

Fig. 3.

Based on our initial keywording and our comparison with Pendleton et al. [91], we identified three primary categories of studies based on the techniques used in each study: (Manual) CVSS-based, Deterministic, and Probabilistic assessment systems. We describe our categories in Section 5.
基于我们最初的关键词筛选以及与 Pendleton 等人[91]的比较，我们根据每项研究中使用的技巧确定了三个主要研究类别：（手动）基于 CVSS 的、确定性和概率性评估系统。我们在第 5 节中描述了我们的类别。

5 Categories of Assessment Methods in Each Study
5 种每项研究的评估方法类别

Our final set of 59 studies and two industry standards encompassing 76 papers are shown in Table 4. Table 4 gives an ID for each study in the first column. Academic studies begin with an “S,” whereas the two industry standards are referred to by their acronyms: CVSS and EPSS. The second column indicates the bibliography entries for the paper(s) that were part of the study. The third column shows the year(s) the papers were published. The remaining columns indicate the category(ies) of exploitability assessments discussed in each study. A study, or even an individual publication, may cover multiple assessment methods. For example, CVSS scores from the NVD are frequently used as a baseline in evaluations of techniques in other categories, such as the learning model (LM) in S03 (see [14]). We use “M” to indicate that the assessment method is the main focus of the study and “C” to indicate if a method is primarily used for comparison.
我们的最终研究集包括 59 项研究和两项行业标准，共 76 篇论文，如表 4 所示。表 4 的第一列给出了每项研究的 ID。学术研究以“S”开头，而两项行业标准分别用其缩写 CVSS 和 EPSS 表示。第二列指出了构成研究的论文的参考文献条目。第三列显示了论文发表的年份。其余列表明了每项研究中讨论的可利用性评估类别。一项研究，甚至是一篇单独的出版物，可能涵盖多种评估方法。例如，NVD 的 CVSS 评分常被用作评估其他类别（如 S03 中的学习模型（LM））技术的基础（见[14]）。我们用“M”表示评估方法是研究的重点，用“C”表示方法主要用于比较。

Table 4.

ID	Bib. 参考文献	Year(s) 年（s）	(Manual) CVSS (手动) CVSS	Deterministic 确定性	Probabilistic (LM) 概率（LM）	Probabilistic (O) 概率（O）
S01	[123] [123]	2008		M
S02	[43] [ 43 ]	2009				M
S03	[14] [ 14 ]	2010	C		M
S04	[80] [80]	2011	M
S05	[45] [45]	2011	M
S06	[11, 12, 16, 107] [11, 12, 16, 107]	2011, 2014		M
S07	[53, 54, 55] [53, 54, 55]	2012, 2018, 2019		M
S08	[5, 6] [5, 6]	2012, 2014	M
S09	[58, 59] [58, 59]	2012, 2014, 2017, 2020		M
S10	[115] [ 115 ]	2013	M
S11	[57] [57]	2013	M
S12	[81] [81]	2014	M
S13	[135, 137] [135, 137]	2014, 2016		M
S14	[130] [130]	2015			M
S15	[56] [56]	2015	M
S16	[44] [ 44 ]	2015		M
S17	[136] [ 136 ]	2015	M
S18	[134] [ 134 ]	2016			M
S19	[104] [ 104 ]	2015			M
S20	[35] [35]	2015			M
S21	[60, 95, 96, 97] [60, 95, 96, 97]	2015, 2018, 2020, 2021		M
S22	[111] [ 111 ]	2016				M
S23	[117] [ 117 ]	2016	M
S24	[99, 100] [99, 100]	2016, 2017				M
S25	[1, 2] [1, 2]	2016, 2018		M
S26	[7, 8] [7, 8]	2017, 2018			M
S27	[31] [31]	2017	C	M
S28	[131] [ 131 ]	2017			M
S29	[46] [ 46 ]	2017		M
S30	[119] [ 119 ]	2017			M
S31	[102] [102]	2017		M
S32	[15] [15]	2017			M
S33	[51] [51]	2017		M
S34	[3] [ 3 ]	2018	M
S35	[122, 124, 142] [122, 124, 142]	2018, 2019, 2020		M
S36	[17, 127, 128] [17, 127, 128]	2018, 2019		M
S37	[47] [ 47 ]	2018		M
S38	[114] [ 114 ]	2018			M
S39	[66] [66]	2018	M
S40	[10] [ 10 ]	2019			M
S41	[50] [50]	2019			M
S42	[77] [ 77 ]	2019			M
S43	[144] [144]	2019	C	M
S44	[4] [4]	2020	M
S45	[132] [132]	2020			M
S46	[29] [29]	2020		M
S47	[103] [ 103 ]	2020	M
S48	[36] [ 36 ]	2020			M
S49	[133] [ 133 ]	2020		M
S50	[141] [141]	2020			M
S51	[63] [63]	2021			M
S52	[75, 86] [ 75, 86 ]	2021, 2022			M
S53	[78] [ 78 ]	2021		M
S54	[20] [ 20 ]	2021		M
S55	[116] [ 116 ]	2022	C		M
S56	[64] [64]	2022	C	M
S57	[140] [ 140 ]	2022		M
S58	[67] [67]	2022		M
S59	[126] [ 126 ]	2022			M
CVSS	[25, 82, 83, 106] [25, 82, 83, 106]	2006, 2009, 2019	M
EPSS	[40, 61, 62] [40, 61, 62]	2020, 2021, 2022			M

Table 4. Exploitability Assessment Methods Proposed and/or Evaluated in Each Study
表 4. 每项研究中提出和/或评估的可利用性评估方法

M indicates that assessments from this category are the main focus of the study; C indicates assessments from this category are compared against the main category.
M 表示此类评估是研究的主要焦点；C 表示此类评估与主要类别进行比较。

As discussed in Section 4.2, we classified the methods examined into three categories: Manual CVSS based, Deterministic, and Probabilistic methods. We then further sub-divided Probabilistic methods into methods based on LMs, such as SVM and neural networks, and Other (O) Probabilistic models.
如第 4.2 节所述，我们将所考察的方法分为三类：基于手动 CVSS 的方法、确定性方法和概率方法。然后，我们将概率方法进一步细分为基于语言模型的方法，如 SVM 和神经网络，以及其他（O）概率模型。

In the context of our survey, assessment methods classified in the Manual CVSS based high-level category are based on the original manual/expertise-based assessment described in the CVSS specification [26] for determining the low-level CVSS sub-metrics (e.g., AV, AC, RP, and UI). Assessments based on the original CVSS specification are not purely Deterministic since different experts may evaluate the score slightly differently, nor is it Probabilistic. Where automated Deterministic or Probabilistic methods are used to compute the CVSS sub-metrics, we classify the method as Deterministic or Probabilistic. The only assessments we identified in the literature that use a similar, manual/expertise-based approach rely on the CVSS assessment. We discuss Manual CVSS based assessments in Section 6. Studies in the Manual CVSS based category include S34 (see [3]), which evaluates what information is useful for analysts when determining metrics of the CVSS v3 Base score, including the exploitability-related sub-metrics AV, AC, UI, and PR. Similarly, S04 (see [80]) and S10 (see [115]) critique the CVSS scores from the NVD as being disproportionately high, and propose alternative equations to use with the manually derived sub-metrics.
在本次调查的背景下，根据《CVSS 手册》中的高级类别划分的评估方法，是基于 CVSS 规范[26]中描述的原始手册/基于专业知识评估，用于确定低级 CVSS 子度量（例如，AV、AC、RP 和 UI）。基于原始 CVSS 规范的评估不是纯粹的确定性，因为不同的专家可能对分数的评估略有不同，也不是概率性的。当使用自动确定性或概率方法计算 CVSS 子度量时，我们将该方法归类为确定性或概率性。我们在文献中仅识别出使用类似的手动/基于专业知识方法的评估依赖于 CVSS 评估。我们在第 6 节讨论基于《CVSS 手册》的评估。基于《CVSS 手册》的类别研究包括 S34（见[3]），该研究评估了分析人员在确定 CVSS v3 基分数的度量时，哪些信息是有用的，包括与可利用性相关的子度量 AV、AC、UI 和 PR。同样，S04（见[80]）和 S10（见[115]）批评 NVD 的 CVSS 评分过高，并提出了与手动推导的子指标一起使用的替代方程。

Deterministic assessments will always provide the same output for a particular input [87]. Studies in this category typically examine rule-based systems such as S06 ([11, 12, 16, 107]), which proposes and tests an automated tool for generating an exploit for a vulnerability. In S06, the exploit generation (EG) process is defined in terms of finding inputs that meet a particular “exploitability property” using a set of pre-defined logic (i.e., rules). Section 7 discusses Deterministic assessments.
确定性评估将始终为特定输入提供相同的输出[87]。该类研究通常考察基于规则的系统，如 S06[11, 12, 16, 107]，该系统提出并测试了一种用于生成漏洞利用的自动化工具。在 S06 中，利用生成（EG）过程被定义为使用一组预定义逻辑（即规则）来寻找满足特定“可利用性属性”的输入。第 7 节讨论了确定性评估。

Probabilistic assessments are assessment methods that rely on a statistical, Probabilistic analysis of a set of vulnerabilities. In our study, we sub-divided Probabilistic methods into methods based on LMs such as SVM and neural networks, and Other (O) Probabilistic models. Examples of LM studies include S32 (see [15]), which examines the effectiveness of machine learning models for predicting exploit likelihood by comparing models built using different sets of features. LMs are discussed in Section 8. Other Probabilistic models include S02 (see [43]), in which the authors propose using Probabilistic models to estimate the Temporal and Environmental CVSS metrics. Other Probabilistic models are covered in Section 9.
概率评估是一种依赖于一组漏洞的统计和概率分析的评估方法。在我们的研究中，我们将概率方法细分为基于 LMs（如 SVM 和神经网络）的方法和其他（O）概率模型。LM 研究的例子包括 S32（见[15]），该研究通过比较使用不同特征集构建的模型，检验了机器学习模型预测利用可能性的有效性。LM 在第 8 节中讨论。其他概率模型包括 S02（见[43]），其中作者提出使用概率模型来估计时间和环境 CVSS 指标。其他概率模型在第 9 节中介绍。

6 Manual CVSS and CVSS-based Metrics
6 手动 CVSS 和基于 CVSS 的度量指标

The CVSS SIG provides a user guide including decision trees for each CVSS Base score metric (AV, AC, PI, UI, C, I, A, and S) [26]. CVSS, as specified, is a manual, expertise-based assessment of vulnerabilities [4, 14]. As we will see in Sections 7 and 8, rule-based and machine learning models have been proposed to automate the CVSS scoring process [50, 77, 144]. However, these automated assessments are not in the original CVSS specification.
CVSS SIG 提供了包含每个 CVSS 基础评分指标（AV、AC、PI、UI、C、I、A 和 S）决策树的用户指南[26]。CVSS，如指定，是对漏洞的手动、基于专家的评估[4, 14]。正如我们在第 7 节和第 8 节中将要看到的，已经提出了基于规则和机器学习模型来自动化 CVSS 评分过程[50, 77, 144]。然而，这些自动化评估并未包含在原始的 CVSS 规范中。

The NVD [89] provides a CVSS Base score for all vulnerabilities in the CVE list [21], and there are few vulnerability datasets available to researchers that are not connected to the CVE list [15]. The NVD-provided Base score includes the overall score and the calculation for each metric of the Base score (e.g., AV, AC, PR, and UI) and, subsequently, the calculated Exploitability subscore [89]. Maintainers of proprietary datasets such as IBM X-Force may provide their calculation of CVSS scores [116]. These assessments are at least partly expertise based, following the original CVSS specification.
NVD [89] 为 CVE 列表 [21] 中的所有漏洞提供 CVSS 基础评分，并且可供研究人员使用的与 CVE 列表 [15] 无关的漏洞数据集很少。NVD 提供的基础评分包括总体评分以及基础评分每个指标的计算（例如，AV、AC、PR 和 UI）以及随后计算的利用性子评分 [89]。像 IBM X-Force 这样的专有数据集维护者可能提供他们自己的 CVSS 评分计算 [116]。这些评估至少部分基于专业知识，遵循原始的 CVSS 规范。

We identified three groups of studies examining CVSS scores. The first group, discussed in Section 6.1, focuses on understanding how CVSS scores are manually assessed via user studies of the expertise-based CVSS assessment method. In Section 6.2, we discuss evaluations and criticisms of CVSS scores output by the CVSS assessment process, which primarily focus on scores from the NVD but, in some cases, examine other sources such as IBM X-Force [66, 116]. Finally, in Section 6.3, we discuss changes that have been proposed based on some of the criticisms in Section 6.2.
我们确定了三个研究 CVSS 评分的研究组。第一组，在第 6.1 节中讨论，重点关注通过基于专家知识的 CVSS 评估方法的用户研究来理解 CVSS 评分是如何手动评估的。在第 6.2 节中，我们讨论了 CVSS 评估过程输出的 CVSS 评分的评价和批评，这些评价主要关注 NVD 的评分，但在某些情况下，也考察了 IBM X-Force 等其它来源[66, 116]。最后，在第 6.3 节中，我们讨论了基于第 6.2 节中一些批评提出的变更。

6.1 How CVSS Is Manually Assessed
6.1 如何手动评估 CVSS

In S34, Allodi et al. [3] evaluate what information is helpful in determining the CVSS Base score with a user study of CVSS v3. Using the CVSS scores provided in the CVSS v3 example document [24] as ground truth, the researchers provided study participants with a tutorial on determining CVSS scores. The control group was then asked to determine CVSS scores based on the original vulnerability descriptions from NVD alone. The treatment group was given vulnerability descriptions with additional information from the CVSS example document [24]. The authors examined four categories of information: information about the vulnerable asset (i.e., information on the type of system directly affected by the vulnerability), attack procedures that could be used against the vulnerability, vulnerability type information characterizing the technical root cause and result of exploiting the vulnerability, and known threats (i.e. whether there is evidence that malicious actors have exploited the vulnerability on production systems). The researchers assessed how the addition and removal of each category of information impacted participants’ error rates. They found that information on assets reduced the error rate for the A (Availability) metric of the Impact group, but asset information had little other effect. Information on the attack reduced error rates for the AV and AC metrics in the Exploitability group and the C (Confidentiality) metric of the Impact group. Information on vulnerability type was related to a reduced error rate for AC, UI, and PR metrics in the Exploitability group. However, information about known threats was related to an increased error rate for AV and AC in the Exploitability group and C and A in the Impact group. Individual differences in performance between participants were observed—for example, information on vulnerability type reduced error rates more for individuals with more security expertise [3].
在 S34 中，Allodi 等人[3]通过 CVSS v3 的用户研究评估了在确定 CVSS 基础评分中哪些信息是有帮助的。使用 CVSS v3 示例文档[24]中提供的 CVSS 评分作为基准，研究人员向研究参与者提供了确定 CVSS 评分的教程。对照组被要求仅根据 NVD 的原始漏洞描述来确定 CVSS 评分。处理组则获得了包含 CVSS 示例文档[24]中额外信息的漏洞描述。作者考察了四个类别的信息：关于易受攻击资产的信息（即受漏洞直接影响系统的类型信息）、可用于针对漏洞的攻击程序、描述漏洞技术根本原因和结果的漏洞类型信息，以及已知威胁（即是否有证据表明恶意行为者在生产系统中利用了该漏洞）。研究人员评估了添加和删除每个类别信息对参与者错误率的影响。他们发现，关于资产的信息降低了影响组 A（可用性）指标的误差率，但资产信息对其他方面影响不大。关于攻击的信息降低了可利用性组 AV 和 AC 指标的误差率以及影响组 C（机密性）指标的误差率。关于漏洞类型的信息与可利用性组 AC、UI 和 PR 指标的误差率降低有关。然而，关于已知威胁的信息与可利用性组 AV 和 AC 以及影响组 C 和 A 的误差率增加有关。观察到参与者之间在性能上的个体差异——例如，关于漏洞类型的信息对具有更多安全专业知识的人的误差率降低作用更大[3]。

In S23 (see [117]), the authors perform text mining on descriptions of CVEs in the NVD and analyze the correlation between terms and the values of the different CVSS v2 metrics (AV, AC, AU, C, I, and A) using Spearman’s rho. Their results somewhat support the analysis in S34. In S23, the authors found that the term “attack” had a moderate positive correlation (0.30) with AV and a moderate negative correlation (–0.35) with AU, whereas the correlation with AC was weak (0.02). However, terms associated with a particular type of attack, such as “cross-site-script” and “script” had a higher correlation with AC (0.42 and 0.36, respectively). Other terms that had a moderate or strong correlation with AV included “remote,” which had a positive correlation (0.53), as well as “local” and “user,” which had negative correlations (–0.70 and –0.49, respectively). Other terms that had a moderate correlation with AC included “html” and “web,” which both had positive correlations (0.40 and 0.36, respectively). Additionally, the terms “authent” and “user” had moderate to high positive correlations with AU (0.65 and 0.43, respectively). Correlation is not causation. We cannot be sure, based on S23 alone, how much these terms contribute to an individual’s ability to score the metric. However, when combined with work such as S34, we see a fuller picture.
在 S23（见[117]），作者对 NVD 中 CVE 描述进行文本挖掘，并使用 Spearman 的 rho 分析不同 CVSS v2 指标（AV、AC、AU、C、I 和 A）的术语与值之间的相关性。他们的结果在一定程度上支持 S34 中的分析。在 S23 中，作者发现“攻击”一词与 AV 有中等正相关（0.30），与 AU 有中等负相关（–0.35），而与 AC 的相关性较弱（0.02）。然而，与特定类型攻击相关的术语，如“跨站脚本”和“脚本”，与 AC 的相关性更高（分别为 0.42 和 0.36）。其他与 AV 有中等或强相关性的术语包括“远程”，其具有正相关（0.53），以及“本地”和“用户”，它们分别具有负相关（–0.70 和–0.49）。其他与 AC 有中等相关性的术语包括“html”和“web”，它们都具有正相关（分别为 0.40 和 0.36）。此外，“authent”和“user”这两个术语与 AU 有中等至高正相关（分别为 0.65 和 0.43）。相关性不是因果关系。仅基于 S23，我们无法确定这些术语对个人得分能力贡献了多少。然而，当与 S34 等研究相结合时，我们看到了更完整的图景。

In S44, Allodi et al. [4] further examine how the actual expertise of the individuals applying CVSS influences the results. The authors use CVSS v3 for their experiment. They divided the participants into three groups based on expertise. The first group was graduate students in computer science with no security expertise. The second group was graduate students in computer science who had taken security courses but did not have industry expertise, and the third group was security professionals. The assessment was based on 30 examples randomly selected from 100 vulnerabilities used by the CVSS SIG when developing the standard [4]. The CVSS scores had therefore been determined previously by the CVSS SIG and could be used as “ground truth.” The authors found that computer science students who had taken security courses performed significantly better than students who had not taken security courses for the AC metric and the Impact metrics (C, I, and A). Computer science students who had taken security courses also had “borderline” better performance with the AV and PR metrics. Allodi et al. found no statistically significant difference between the two groups of students for the UI metrics. The authors found that security professionals were less likely to err on the UI metric compared with both groups of students. On the AV metric, there was a borderline difference between professionals and students who had taken security courses. However, the authors found no statistically significant difference in the performance of the students who had taken security courses and the security professionals across the AC, PR, C, I, and A metrics. Their results suggest that security knowledge is helpful in producing CVSS assessments. However, students with security experience were relatively competent, producing compatible scores with professionals for four of the six metrics.
在 S44 中，Allodi 等人[4]进一步研究了申请 CVSS 的个人实际专业知识如何影响结果。作者们使用 CVSS v3 进行实验。他们根据专业知识将参与者分为三组。第一组是没有任何安全专业知识的计算机科学研究生。第二组是已经修过安全课程但没有行业专业知识的计算机科学研究生，第三组是安全专业人士。评估基于从 CVSS SIG 在制定标准时使用的 100 个漏洞中随机选取的 30 个示例[4]。因此，CVSS 评分先前已由 CVSS SIG 确定，可作为“真实情况”使用。作者发现，修过安全课程的计算机科学学生在 AC 指标和影响指标（C、I 和 A）上表现显著优于没有修过安全课程的学生。修过安全课程的计算机科学学生在 AV 和 PR 指标上也有“边缘”更好的表现。Allodi 等人两组学生在 UI 指标上没有发现统计学上显著差异。作者发现，与两组学生相比，安全专业人士在 UI 指标上犯错的概率较低。在 AV 指标上，专业人员和接受过安全课程的学生之间存在边缘差异。然而，作者在 AC、PR、C、I 和 A 指标上发现，接受过安全课程的学生与安全专业人士的表现没有统计学上显著差异。他们的结果表明，安全知识有助于产生 CVSS 评估。然而，具有安全经验的学生相对熟练，在六个指标中的四个指标上产生了与专业人士相匹配的分数。

In S11 (see [57]), the authors propose combining multiple expert analyses of CVSS scores via the Delphi method. The Delphi method is a consensus-building technique frequently used in areas such as Systems engineering [105]. The authors of S11 illustrate the possibility of their approach through an illustrative example with four CVEs. The authors also provide the questionnaire that would be used in their method.
在 S11（见[57]）中，作者们提出通过德尔菲法结合多个 CVSS 评分的专家分析。德尔菲法是一种在系统工程[105]等领域经常使用的共识构建技术。S11 的作者通过四个 CVE 的实例说明了他们方法的可能性。作者们还提供了他们方法中将要使用的问卷。

6.2 Evaluations and Criticisms
6.2 评估与批评

We found three primary groups of evaluations of CVSS scores collected from industry organizations such as the NVD. The first set of evaluations, which we discuss in Section 6.2.1, looks at the Reliability of CVSS—in other words, how consistently a particular vulnerability is scored using CVSS. The second group examines the relationship between CVSS and other exploit-related indicators, such as the presence of an exploit in ExploitDB. Finally, in Section 6.2.3, we discuss four studies that examine the distribution of CVSS scores in the NVD—for example, the percentage of scores that are classified with “Low,” “Medium,” and “High” severity or exploitability.
我们发现从 NVD 等行业组织收集的 CVSS 评分有三个主要评价组。第一组评价，我们在第 6.2.1 节中讨论，考察 CVSS 的可靠性——换句话说，就是使用 CVSS 对特定漏洞进行评分的一致性。第二组研究 CVSS 与其他与利用相关的指标之间的关系，例如 ExploitDB 中是否存在利用。最后，在第 6.2.3 节中，我们讨论了四项研究，这些研究考察了 NVD 中 CVSS 评分的分布——例如，被评为“低”、“中”和“高”严重性或利用性的评分百分比。

6.2.1 Reliability of CVSS Scores.
6.2.1 CVSS 评分的可靠性。

S15 (see [56]) and S39 (see [66]) looked at the reliability of the CVSS scores provided in the NVD. In S15, the focus of the reliability analysis is on the overall severity score, with additional analysis of information that should be added or removed as it relates to each of the sub-scores (AV, AC, and UI). We discuss the proposed changes to the underlying CVSS system from S15 in Section 6.3.2. However, the reliability analysis of S15 may be used to triangulate the analysis from S39, which examines exploitability-specific elements of CVSS in their reliability analysis. Both studies focus on CVSS v2. Neither study was able to statistically disprove the reliability of the CVSS scores from the NVD.
S15（参见[56]）和 S39（参见[66]）研究了 NVD 提供的 CVSS 评分的可靠性。在 S15 中，可靠性分析的重点是总体严重程度评分，并附加分析了与每个子评分（AV、AC 和 UI）相关的应添加或删除的信息。我们在第 6.3.2 节中讨论了 S15 对 CVSS 系统提出的变更。然而，S15 的可靠性分析可用于三角测量 S39 的分析，S39 分析了 CVSS 的可靠性分析中的可利用性特定元素。这两项研究都关注 CVSS v2。两项研究都无法从统计上否定 NVD 提供的 CVSS 评分的可靠性。

In S15 (see [56]), the authors perform a survey of 304 security experts from industry and academia to evaluate the overall reliability of the CVSS v2 severity scores in the NVD, as well as to evaluate the sub-metrics and structure of the CVSS framework. To understand the accuracy of the CVSS scores in the CVE list, each respondent was asked to provide a severity score for 10 vulnerabilities from the NVD. Of the 10 vulnerabilities presented to each respondent, 3 vulnerabilities were the same for all respondents to enable the authors to estimate consensus between experts, whereas 7 vulnerabilities were selected randomly from vulnerabilities in the CVE list. A total of 2,131 unique vulnerabilities were assessed across the 304 experts. The experts were provided with the description of each vulnerability, as well as the Exploitability values for AV, AC, and AU (with a brief explanation of the attribute) and Impact values for C, I, and A. The survey did not include the equation for calculating the Exploitability, Impact, or Base severity scores. Instead, the survey requested that the experts provide their own value between 1 and 10. The authors note that 38% of survey answers differed from the score provided by NVD, and claim “This is certainly a higher figure than many users of the scoring system would be comfortable with” [56]. However, it is not clear if this discrepancy is specific to the CVSS, to the scores in the NVD, or to expertise-based systems generally. The authors found no statistically significant difference between the scores provided by experts and the overall severity score from the NVD.
在 S15（见[56]），作者对来自工业和学术界的 304 位安全专家进行了调查，以评估 NVD 中 CVSS v2 严重性评分的整体可靠性，以及评估 CVSS 框架的子指标和结构。为了了解 CVE 列表中 CVSS 评分的准确性，每位受访者被要求为 NVD 中的 10 个漏洞提供严重性评分。在向每位受访者展示的 10 个漏洞中，有 3 个漏洞对所有受访者都相同，以便作者估计专家之间的共识，而另外 7 个漏洞则是从 CVE 列表中的漏洞中随机选择的。共有 2,131 个独特的漏洞被 304 位专家评估。专家们提供了每个漏洞的描述，以及 AV、AC 和 AU 的可利用性值（附带属性的简要说明）和 C、I 和 A 的影响值。调查不包括计算可利用性、影响或基本严重性评分的公式。相反，调查要求专家在 1 到 10 之间提供自己的值。作者指出，38%的调查答案与 NVD 提供的评分不同，并声称“这肯定是一个许多评分系统用户不太愿意接受的数字” [56]。然而，不清楚这种差异是否仅限于 CVSS，还是 NVD 的评分，或者是基于专家的系统。作者发现，专家提供的评分与 NVD 的整体严重性评分之间没有统计学上的显著差异。

In S39, the authors expand on the work in S15 by examining on CVSS scores from five databases: NVD, the proprietary IBM X-Force Exchange database, OSVDB, the Vulnerability Notes database from the CERT group at Carnegie Mellon (CERT-VN), and the alert database provided by Cisco for its products (Cisco). For scores from the OSVDB, CVSS scores credited to a variety of sources are included. The authors exclude scores credited to the NVD to reduce potential bias. The authors acknowledge that OSVDB, CERT-VN, and Cisco all indicate that their scores may be influenced by information from other sources, such as the NVD. The authors point to the differences between the scores in each database as evidence of independent scoring. The authors use Bayesian analysis to develop a ground truth for each CVSS sub-metric (AV, AC, AU, C, I, and A) based on the CVSS scores from the five databases. Based on this ground truth, the authors found the NVD to be the most accurate across all metrics (93% on average) and for the Exploitability metrics specifically (AV: 99%, AC: 88%, AU: 99%). The authors noted that the greatest disagreement occurred in the Access Complexity (AC) exploitability score, particularly for lower AC scores, and that Exploitability sub-metrics generally had higher disagreement than Impact sub-metrics.
在 S39 中，作者通过检查来自五个数据库的 CVSS 评分来扩展 S15 中的工作：NVD、IBM X-Force Exchange 专有数据库、OSVDB、卡内基梅隆大学 CERT 小组的漏洞注释数据库（CERT-VN）以及思科为其产品提供的警报数据库（思科）。对于 OSVDB 的评分，包括来自各种来源的 CVSS 评分。作者排除了归因于 NVD 的评分以减少潜在的偏差。作者承认 OSVDB、CERT-VN 和思科都表示他们的评分可能受到来自其他来源的信息的影响，例如 NVD。作者指出每个数据库中评分的差异作为独立评分的证据。作者使用贝叶斯分析根据五个数据库的 CVSS 评分为每个 CVSS 子指标（AV、AC、AU、C、I 和 A）开发一个基准。基于这个基准，作者发现 NVD 在所有指标上（平均 93%）以及具体在可利用性指标上（AV：99%，AC：88%，AU：99%）是最准确的。作者指出，最大的分歧发生在访问复杂性（AC）的可利用性评分上，尤其是对于较低的 AC 评分，并且可利用性子指标通常比影响子指标有更高的分歧。

6.2.2 Exploitability Scores from the NVD Compared to Publicly Available Exploit-Based Datasets.
6.2.2 NVD 的利用性评分与公开可用的基于漏洞的数据集相比。

Statistical analyses of the CVSS scores from the NVD in relation to other exploit information in S08, S03, S17, and S55 have suggested that the CVSS Exploitability score is not strongly connected with the existence of exploits in exploit databases or the likelihood of exploitation. However, the statistical analysis in S47 (see [103]) suggests that the relationship between exploits in exploit databases and the individual CVSS metrics (AV, AC, AU, C, I, A) may be stronger, particularly when factors such as the company maintaining the software (e.g., Microsoft) are controlled for.
对 NVD 的 CVSS 评分与 S08、S03、S17 和 S55 中的其他漏洞信息进行统计分析表明，CVSS 漏洞利用评分与漏洞数据库中漏洞的存在或被利用的可能性没有强烈关联。然而，S47 中的统计分析（见[103]）表明，漏洞数据库中的漏洞与 CVSS 的各个指标（AV、AC、AU、C、I、A）之间的关系可能更强，尤其是在控制了如软件维护公司（例如，微软）等因素的情况下。

In S08, Allodi and Massacci [5, 6] examine the relationship between CVSS v2 scores, whether a vulnerability is associated with an exploit in ExploitDB, whether a vulnerability is associated with an exploit from a commercial exploit kit, and whether a vulnerability is associated with an exploit signature in Symantec Intrusion Detection and Anti-Malware products. The exploit signatures from Symantec (SYM) were used as the “ground truth.” The commercial exploit kit information in S08 includes automated (i.e., code script) exploits extracted from malicious websites. In S08, Allodi and Massaci [5, 6] found that the CVSS v2 scores in NVD showed little variability and that only a few of the possible values for the different sub-scores were used. The authors note, “The CVSS Base score alone is a poor risk factor from a statistical perspective.” However, the authors also found that considering both CVSS scores and other factors, such as an exploit in ExploitDB or EKITS, the relationship with the SYM data (i.e., risk of exploitation) improved.
在 S08 中，Allodi 和 Massacci[5, 6]研究了 CVSS v2 评分与漏洞是否与 ExploitDB 中的利用程序相关、漏洞是否与商业利用工具包中的利用程序相关以及漏洞是否与 Symantec 入侵检测和反恶意软件产品中的利用签名相关的关系。Symantec（SYM）的利用签名被用作“基准真实情况”。S08 中的商业利用工具包信息包括从恶意网站中提取的自动化（即代码脚本）利用程序。在 S08 中，Allodi 和 Massaci[5, 6]发现 NVD 中的 CVSS v2 评分变化不大，并且只有少数不同子评分的可能值被使用。作者指出：“仅从统计角度来看，CVSS 基本评分本身是一个较差的风险因素。”然而，作者还发现，考虑 CVSS 评分和其他因素，如 ExploitDB 或 EKITS 中的利用程序，与 SYM 数据（即利用风险）的关系得到了改善。

Similarly, in S03, Bozorgi et al. [14] use the CVSS Exploitability scores from the NVD as the control against which they compare their own AUTO-LM system, which we will discuss in Section 8. The labels for the AUTO-LM system were the exploit availability labels from OSVDB [14]. The authors compare the distribution of the CVSS Exploitability provided by the NVD against the signed distance to the maximum margin hyperplane separating positive and negative examples in their SVM model (i.e., their LM). The authors illustrate with histograms how their score produces a clearer distinction between vulnerabilities that the OSVDB indicates have an exploit compared to vulnerabilities that do not have an exploit.
同样，在 S03 中，Bozorgi 等人[14]使用 NVD 的 CVSS 可利用性评分作为他们与自己 AUTO-LM 系统比较的控制标准，我们将在第 8 节中讨论该系统。AUTO-LM 系统的标签是来自 OSVDB 的可利用性标签[14]。作者将 NVD 提供的 CVSS 可利用性评分与他们的 SVM 模型（即他们的 LM）中正负样本分离的最大间隔超平面的签名距离进行比较。作者通过直方图展示了他们的评分如何使 OSVDB 指示具有漏洞的漏洞与不具有漏洞的漏洞之间的区别更加清晰。

Converting CVSS scores from the NVD into a binary exploitability score has also yielded low precision and recall in relation to the existence of exploits and exploit signatures from public databases. In S17, Younis and Malaiya [136] compare CVSS v2 against the Microsoft Rating System (MSRS) using exploits from ExploitDB as ground truth. The MSRS was a predecessor to the current Microsoft Exploitability Index [84] and Microsoft severity score [85], which are provided by Microsoft when disclosing vulnerabilities in their products to help users prioritize security patches. In S17, the authors use the median CVSS score of 8.6 as the cutoff for the confusion matrix of exploitable and not-exploitable vulnerabilities. For the MSRS, the authors used the median value of 1 as the threshold for whether a vulnerability was exploitable. In other words, vulnerabilities with an MSRS of 1 were considered exploitable, whereas vulnerabilities with a rating of 2 or 3 were considered not exploitable. Using this approach, the authors determined that CVSS had a precision of 7% and recall of 97% for Internet Explorer, and a precision of 20% and recall of 65% for Windows 7. For Internet Explorer, this threshold of the MSRS resulted in a precision of 7% and a recall of 85%. For Windows 7, the threshold for the MSRS resulted in a precision of 15% and a recall of 83%. The authors argue that the low precision and recall indicate that CVSS and MSRS are not good indicators of exploitability, and new metrics are needed.
将 NVD 的 CVSS 评分转换为二进制可利用性评分，在公共数据库中关于漏洞和漏洞签名方面也产生了低精度和召回率。在 S17 中，Younis 和 Malaiya[136]使用 ExploitDB 中的漏洞作为基准，比较了 CVSS v2 与微软评分系统（MSRS）。MSRS 是当前微软可利用性指数[84]和微软严重性评分[85]的前身，微软在披露其产品中的漏洞时提供这些评分，以帮助用户优先考虑安全补丁。在 S17 中，作者将 CVSS 的中位数评分 8.6 作为可利用和不可利用漏洞混淆矩阵的阈值。对于 MSRS，作者将中位数值 1 作为漏洞是否可利用的阈值。换句话说，MSRS 评分为 1 的漏洞被认为是可利用的，而评分为 2 或 3 的漏洞被认为是不可利用的。采用这种方法，作者确定 CVSS 对 Internet Explorer 的精确度为 7%，召回率为 97%，对 Windows 7 的精确度为 20%，召回率为 65%。对于 Internet Explorer，MSRS 的此阈值导致准确率为 7%，召回率为 85%。对于 Windows 7，MSRS 的阈值导致准确率为 15%，召回率为 83%。作者认为，低准确率和召回率表明 CVSS 和 MSRS 不是可利用性的良好指标，需要新的度量标准。

Similarly, a preliminary comparison of CVSS v3 with proprietary measures in S55 (see [116]) examined the precision and recall of the Base Exploitability score, setting the threshold at each possible value from 0 to 3. In other words, they examined the precision and recall if a vulnerability with a Base Exploitability score of 0 or higher was considered “exploitable,” then they examined the precision and recall if a vulnerability had a Base Exploitability score of 1 or higher, and so on. For the ground truth, the authors use a combined dataset of exploit signatures from Symantec products; information extracted from Bugtraq, Tenable, Skybox, and AlienVault OTX vulnerability databases; and exploits extracted from the Contagio dataset, a publicly available list of exploit kits and malicious websites used in academic studies [5, 6, 71, 74, 93, 143]. Using a Base Exploitability threshold of 0 or 1 had approximately 85% recall,⁵ whereas using a threshold of 3 resulted in less than 20% recall. Precision values were below 20% for all thresholds. The authors of S17 and S55 [116] both use their evaluation of CVSS to argue that better exploitability measures are needed.
同样，对 CVSS v3 与 S55（见[116]）中的专有措施的初步比较，考察了基本可利用性得分的精确度和召回率，将阈值设定在从 0 到 3 的每个可能值。换句话说，他们考察了如果将基本可利用性得分为 0 或更高的漏洞视为“可利用”的精确度和召回率，然后考察了基本可利用性得分为 1 或更高的漏洞的精确度和召回率，依此类推。对于真实情况，作者使用来自 Symantec 产品的利用签名组合数据集；从 Bugtraq、Tenable、Skybox 和 AlienVault OTX 漏洞数据库中提取的信息；以及从 Contagio 数据集中提取的利用，这是一个公开的利用工具包和恶意网站列表，用于学术研究[5, 6, 71, 74, 93, 143]。使用基本可利用性阈值为 0 或 1 的召回率约为 85%，而使用阈值为 3 的召回率低于 20%。所有阈值下的精确度值都低于 20%。S17 和 S55[116]的作者都使用他们对 CVSS 的评估来论证需要更好的可利用性度量。

In contrast to the analysis examining the CVSS Base Exploitability score as a whole, statistical analysis by Roumani and Nwankpa [103] in S47 found that the CVSS v2 exploitability (AV, AC, AU) and Impact (C, I, A) sub-metrics from the Base score all had statistically significant relationships with the “hazard” that an exploit would be made available in ExploitDB. The authors controlled for the affected product type, number of affected software versions, number of past exploits, year of disclosure, size of the software vendor, and R&D budget of the software vendor. This suggests that the relationship between CVSS exploitability-related metrics and exploit availability may be more nuanced and requires further investigation.
与整体分析 CVSS 基础可利用性分数相比，Roumani 和 Nwankpa[103]在 S47 中进行的统计分析发现，CVSS v2 可利用性（AV、AC、AU）和影响（C、I、A）子指标与 ExploitDB 中可利用的“危险”之间存在统计学上的显著关系。作者控制了受影响的产品类型、受影响软件版本的数量、过去漏洞的数量、披露年份、软件供应商规模和研发预算。这表明 CVSS 可利用性相关指标与漏洞可用性之间的关系可能更为复杂，需要进一步研究。

6.2.3 Distribution of CVSS Scores in the NVD.
6.2.3 NVD 中 CVSS 评分的分布

Another common critique of CVSS relates to the overall distribution of exploitability and severity values of the CVSS scores provided by the NVD. S04 (see [80]), S10 (see [115]), and S12 all critique CVSS scores in the NVD for being disproportionately “High,” which they attribute to problems with the CVSS calculation method. We discuss their proposed changes to the CVSS calculation method further in Section 6.3.1. However, in this section, we examine their criticisms and compare their analysis with the work by Gallon [45] in S05, which is less critical of the system and found a different distributional imbalance.
另一个对 CVSS 的常见批评与 NVD 提供的 CVSS 评分的利用性和严重性值的整体分布有关。S04（见[80]）、S10（见[115]）和 S12 都批评 NVD 的 CVSS 评分在“高”这一等级上不成比例，他们将此归因于 CVSS 计算方法的问题。我们在第 6.3.1 节中进一步讨论了他们对 CVSS 计算方法的建议更改。然而，在本节中，我们考察了他们的批评，并将他们的分析与 S05 中 Gallon[45]的工作进行比较，Gallon 对系统的批评较少，并发现了一种不同的分布不平衡。

In S04, the authors argue, “In our opinion, the number of vulnerabilities with ‘Medium’ severity ranking should be the largest and the number of vulnerabilities with ‘High’ or ‘Low’ severity ranking [should be] much smaller” [80]. In S10, the authors argue that severity scores should have a more diverse range of values and should be more evenly distributed [115]. The authors of S12 also call for increased diversity, “The CVSS empirical values given by CVSS-SIG cannot distinguish software vulnerabilities that have identical scores but different severities” [81].
在 S04 中，作者们认为，“我们认为，中等严重程度排名的漏洞数量应该是最大的，而高或低严重程度排名的漏洞数量[应该]小得多” [80]。在 S10 中，作者们认为严重程度得分应该有更广泛的价值范围，并且应该更加均匀分布 [115]。S12 的作者们也呼吁增加多样性，“CVSS-SIG 给出的 CVSS 经验值无法区分具有相同得分但严重程度不同的软件漏洞” [81]。

The vulnerabilities examined in S04, S10, and S12 contain considerable overlap and have similar distributions. In S04, the authors evaluate 34,093 CVE vulnerabilities published from 1999 to 2008, of which 6.8% had a “Low” CVSS severity, 47.8% had a “Medium” CVSS severity, and 45.5% had a “High” CVSS severity. In S10, the authors evaluate CVSS scores from 9,455 vulnerabilities in the NVD published between November 1, 2010, and October 31, 2012. In S10, 7.9% of the vulnerabilities had “Low” severity, 53.2% of vulnerabilities had “Medium” severity, and 38.0% of the vulnerabilities had “High” severity. Additionally, in S10, the authors analyze the distribution of all metrics of the CVSS score, including AV, AC, and AU, showing that greater than 80% of vulnerabilities in their sample of 9,455 vulnerabilities from the NVD had an AV of “Network” (the highest value) and required no authentication (AU). However, AC was more evenly distributed. In S12, the authors examine 54,432 vulnerabilities published between 2002 and 2012 [81]. In S12, Luo et al. [81] point to correlations between the sub-metrics of the Base CVSS v2 score (AV, AC, AU, C, I, and A) from the NVD, as determined by a chi-squared test, as an indicator that vulnerabilities are not being scored independently.
S04、S10 和 S12 中考察的漏洞存在相当大的重叠，分布相似。在 S04 中，作者评估了 1999 年至 2008 年间发布的 34,093 个 CVE 漏洞，其中 6.8%的漏洞 CVSS 严重程度为“低”，47.8%的漏洞 CVSS 严重程度为“中”，45.5%的漏洞 CVSS 严重程度为“高”。在 S10 中，作者评估了 2010 年 11 月 1 日至 2012 年 10 月 31 日在 NVD 发布的 9,455 个漏洞的 CVSS 评分。在 S10 中，7.9%的漏洞具有“低”严重程度，53.2%的漏洞具有“中”严重程度，38.0%的漏洞具有“高”严重程度。此外，在 S10 中，作者分析了 CVSS 评分的所有指标的分布，包括 AV、AC 和 AU，显示在他们的样本中，来自 NVD 的 9,455 个漏洞中超过 80%的漏洞 AV 为“网络”（最高值）且无需认证（AU）。然而，AC 分布较为均匀。在 S12 中，作者考察了 2002 年至 2012 年间发布的 54,432 个漏洞[81]。在 S12 中，罗等 [81] 指出 NVD 中 Base CVSS v2 评分的子指标（AV、AC、AU、C、I 和 A）之间的相关性，这些相关性是通过卡方检验确定的，作为漏洞评分不是独立进行的指标。

The findings in S04, S10, and S12 contrasts with the findings by Gallon [45] in S05. The authors examine 40,026 vulnerabilities in the NVD published between 1999 and 2009. The authors found a distribution with 45% of vulnerabilities having “Low” severity, 46% of vulnerabilities having “Medium” severity, and only 9% of vulnerabilities having “High” severity. The authors also found that the diversity in the combinations of sub-metric values was relatively low, particularly the Impact sub-metrics (C, I, A). Unlike in the other studies, the authors of S05 do not inherently consider this a defect of the scores in the NVD or of the CVSS itself. The authors then examine how the Environmental Impact metrics included in the CVSS framework may alter the scores, such as to improve diversity, finding that the Environmental Impact scores are more likely to decrease the overall severity score.
S04、S10 和 S12 的研究结果与 Gallon[45]在 S05 中的研究结果形成对比。作者们检查了 NVD 在 1999 年至 2009 年间发布的 40,026 个漏洞。作者们发现，45%的漏洞具有“低”严重性，46%的漏洞具有“中”严重性，只有 9%的漏洞具有“高”严重性。作者们还发现，子度量值组合的多样性相对较低，尤其是影响子度量（C、I、A）。与其它研究不同，S05 的作者们并没有将这一点视为 NVD 评分或 CVSS 本身的缺陷。然后，作者们考察了 CVSS 框架中包含的环境影响度量如何可能改变评分，例如提高多样性，发现环境影响评分更有可能降低整体严重性评分。

6.3 Proposed Changes/Improvements
6.3 建议的更改/改进

Based on analyses and criticisms discussed in Section 6.2, at least four studies suggest potential changes that could be made to the NVD. We divide the proposed changes into two groups. First, in Section 6.3.1, we discuss two studies, S04 (see [80]) and S10 (see [115]), which propose altering the equations used to calculate CVSS, including the equation used to calculate the Base Exploitability score, such as by altering how each metric (e.g. AV, AC) is weighted. They do not propose altering the underlying metrics. In Section 6.3.2, we discuss two studies, 10 (see [81]) and S15 (see [56]), which propose adding or altering the metrics themselves. In S10, S04, and S12, which are based on criticisms of the distribution of CVSS scores in the NVD as discussed in Section 6.2.3, the authors evaluate their changes by illustrating how they produce a distribution of severity scores different from the distribution of scores in the NVD. In S15 (see [56]), the proposed changes are based on survey responses. The survey responses are used to suggest that the existing categories of CVSS are inadequate.
基于第 6.2 节中讨论的分析和批评，至少有四项研究提出了可以对 NVD 进行改进的潜在变化。我们将这些建议的改进分为两组。首先，在第 6.3.1 节中，我们讨论了两项研究，S04（见[80]）和 S10（见[115]），它们提出了修改用于计算 CVSS 的方程，包括用于计算基本可利用性得分的方程，例如通过改变每个指标（例如 AV、AC）的权重。它们不提出修改基础指标。在第 6.3.2 节中，我们讨论了两项研究，10（见[81]）和 S15（见[56]），它们提出了添加或修改指标本身。在 S10、S04 和 S12 中，这些研究基于第 6.2.3 节中讨论的 NVD 中 CVSS 分数分布的批评，作者通过说明他们的变化如何产生与 NVD 中分数分布不同的严重程度分数来评估他们的变化。在 S15（见[56]）中，提出的改进基于调查反馈。调查反馈被用来表明现有的 CVSS 类别是不充分的。

6.3.1 Equation Changes. 6.3.1 方程式变更。

As discussed in Section 6.2.3, S04 (see [80]), S10 (see [115]), and S12 all critique CVSS scores in the NVD for being disproportionately “High.” S04 and S10 suggest alternative ways of weighting and calculating CVSS Base scores and sub-scores, including the Exploitability score. However, they do not indicate that the underlying determination of AV, AC, and AU should be performed differently from the original CVSS score. All of these studies look at CVSS v2, where the Base Exploitability score was calculated as

20 x A V x A C x A U

, as discussed in Section 3.2.
如第 6.2.3 节所述，S04（见[80]）、S10（见[115]）和 S12 都批评 NVD 的 CVSS 评分过高。S04 和 S10 提出了替代的权重和计算 CVSS 基础分数和子分数的方法，包括可利用性分数。然而，它们并没有表明 AV、AC 和 AU 的潜在确定应该与原始 CVSS 评分不同。所有这些研究都着眼于 CVSS v2，其中基础可利用性分数的计算如第 3.2 节所述为

20 x A V x A C x A U

。

In S04 (see [80]), the authors change the Exploitability score by a factor of 10 (to

2 x A V x A C x A U

) so that the Base Exploitability score has a range of 0–1 instead of 0–10, while making additional changes to the Impact score. The authors evaluate their overall proposed score against the overall CVSS severity score using AV, AC, and AU scores provided by the NVD. The authors found that their changes produce a greater percentage of “Medium” severity vulnerabilities when compared to the original CVSS calculation. The authors argue a higher percentage of medium vulnerabilities is preferable in a severity score. In S10 (see [115]), the authors change the Exploitability score to

6 x A V x A C x A U

. The combined changes to the Exploitability, Impact, and overall severity scores proposed in S10 result in a more even distribution of the severity score across the “Low,” “Medium,” and “High” levels, compared to CVSS scores in the NVD.
在 S04（见[80]），作者将可利用性得分乘以 10（至

2 x A V x A C x A U

），使得基本可利用性得分的范围为 0-1，而不是 0-10，同时对影响得分进行了额外修改。作者使用 NVD 提供的 AV、AC 和 AU 得分，将他们提出的整体得分与整体 CVSS 严重性得分进行比较。作者发现，与原始 CVSS 计算相比，他们的修改产生了更多百分比的中等严重性漏洞。作者认为，在严重性得分中，中等漏洞的百分比更高是更可取的。在 S10（见[115]），作者将可利用性得分改为

6 x A V x A C x A U

。S10 中提出的对可利用性、影响和整体严重性得分的综合修改，使得严重性得分在“低”、“中”和“高”三个级别上的分布更加均匀，与 NVD 中的 CVSS 得分相比。

6.3.2 Metric Changes. 6.3.2 度量变化。

Based on their analysis of the distribution of CVSS scores discussed in Section 6.2.3, in S12, Luo et al. [81] propose a metric calculated based on the constituent CVSS sub-scores (AV, AC, AU, C, I, A) from the NVD, combined with temporal factors, such as the time since the vulnerability was disclosed. Unlike CVSS scores and the metrics in S04 and S10, which can be independently calculated for each CVSS score, the metric in S12 is calculated relative to other vulnerabilities within the same dataset. The authors demonstrate that their metric produces a different distribution of values than CVSS when applied to 54,432 vulnerabilities from the NVD. Luo et al. were not alone in their concerns, and many of the changes now incorporated into CVSS v3 were due to critiques similar to those of Luo et al. For example, the Scope variable was added to capture whether exploiting the vulnerability could have impacts outside the vulnerable component [23].
基于他们对第 6.2.3 节中讨论的 CVSS 评分分布的分析，在 S12 中，Luo 等人[81]提出了一种基于 NVD 的 CVSS 子评分（AV、AC、AU、C、I、A）的指标，并结合时间因素，如漏洞披露以来的时间。与 CVSS 评分和 S04、S10 中的指标不同，这些指标可以独立地对每个 CVSS 评分进行计算，S12 中的指标是相对于同一数据集中的其他漏洞进行计算的。作者们证明了他们的指标在应用于 NVD 的 54,432 个漏洞时，产生的值分布与 CVSS 不同。Luo 等人并非唯一对此表示担忧的人，CVSS v3 中现在纳入的许多变化都是由于与 Luo 等人类似的批评。例如，添加了 Scope 变量来捕捉利用漏洞是否可能对受影响组件之外产生影响[23]。

As discussed in Section 6.2.1, in S15 (see [56]), the authors perform a survey of 304 security experts from industry and academia to (1) evaluate the overall accuracy of the CVSS severity scores in the NVD and (2) whether the underlying metrics of the Base score in CVSS v2 (AV, AC, AU, C, I, and A) “are appropriate from a theoretical perspective” [56]. To understand the “appropriateness” of the metrics of the Base score in CVSS v2 (AV, AC, AU, C, I, and A), the authors asked open-ended questions to the survey respondents about whether any additional metrics were needed and whether the existing metrics required revision. The authors identified eight categories of suggested additions and revisions: Prevalence of the vulnerable application, Cost of Impact, Availability of an exploit, Possibility of automization, Availability of a patch, Availability of an exploit, Availability of detection system signatures, Effectiveness of detection system signatures, and the use of Vulnerabilities in combination with each other. Notably, 8 of the 38 respondents who discussed “Cost of Impact” indicated that “Cost of Impact” should be considered as part of or replacing the “AU” metric, and one respondent indicated that “Cost of Impact” should be considered as part of or replacing the “AC” metric, even though both AV and AC are Exploitability metrics. The authors note that most of these proposed metrics are Environmental metrics, some of which, such as Availability of an Exploit or Cost of Impact, are explicitly covered in the existing Temporal and Environmental metrics of CVSS.
如第 6.2.1 节所述，在 S15（参见[56]）中，作者对来自工业和学术界的 304 名安全专家进行了调查，以（1）评估 NVD 中 CVSS 严重性评分的整体准确性；（2）以及 CVSS v2 中基础评分的潜在指标（AV、AC、AU、C、I 和 A）“从理论角度来看是否适当”[56]。为了理解 CVSS v2 中基础评分指标（AV、AC、AU、C、I 和 A）的“适当性”，作者向调查受访者提出了开放式问题，询问是否需要额外的指标以及现有的指标是否需要修订。作者确定了八个建议的添加和修订类别：易受攻击应用程序的普遍性、影响成本、漏洞利用的可用性、自动化的可能性、补丁的可用性、检测系统签名可用性、检测系统签名有效性以及漏洞组合使用。值得注意的是，在讨论“影响成本”的 38 名受访者中，有 8 人表示“影响成本”应被视为“AU”指标的一部分或替代品，还有 1 人表示“影响成本”应被视为“AC”指标的一部分或替代品，尽管 AV 和 AC 都是可利用性指标。作者指出，这些提出的指标大多数是环境指标，其中一些，如漏洞的可利用性或影响成本，在 CVSS 现有的时间和环境指标中已有明确涵盖。

7 Automated Deterministic
7 自动确定性

We divide the automated exploitability assessment tools based on whether the determination of vulnerability exploitability is primarily a Deterministic, rule-based process or if the determination is Probabilistic. A total of 21 out of 23 studies on Deterministic, automated exploitability assessment tools in this survey base the assessment on Program State information gathered via program analysis, which we discuss in Section 7.1. A similar approach, using network analysis, is discussed in Section 7.2. However, the key underlying component of the network-based metric, a graph of the system state machine, is equally applicable to the analysis of vulnerabilities in other contexts. The tool is similar to the Program State based analysis tools. Finally, we identified one unique attacker-based Deterministic assessment, which we discuss in Section 7.3.
我们将自动化可利用性评估工具分为基于漏洞可利用性确定主要是确定性、基于规则的流程还是概率性的确定。在本调查的 23 项确定性自动化可利用性评估工具研究中，共有 21 项基于通过程序分析收集的程序状态信息进行评估，我们将在第 7.1 节中讨论。第 7.2 节讨论了类似的方法，使用网络分析。然而，基于网络的度量标准的关键基础组件，即系统状态机的图，同样适用于其他环境中漏洞的分析。该工具与基于程序状态的分析工具类似。最后，我们确定了一种独特的基于攻击者的确定性评估，我们将在第 7.3 节中讨论。

7.1 Program State Based 7.1 基于程序状态

In this section, we discuss automated, Deterministic assessments that examine how a sequence of program states may lead to an “exploitability property” [12] being met. The exploitability property is based on the intended output, such as the type of exploit to be produced or the sub-metrics of the CVSS score. The exploitability property is defined in terms of the types of program analysis performed by the tool, such as static binary analysis or memory monitoring.
在本节中，我们讨论了自动的确定性评估，这些评估检查程序状态序列如何导致满足“可利用性属性” [12]。可利用性属性基于预期输出，例如要生成的漏洞类型或 CVSS 评分的子指标。可利用性属性是在工具执行程序分析的类型（如静态二进制分析或内存监控）的术语中定义的。

For example, the exploitability properties in S06 (see [11, 12, 16, 107]) are defined in terms of elements of the internal program state space such as functions and register calls—for example, “the IP register holds a value that corresponds to some function f of user input i such as f may be a call to tolower on the input i and the resulting IP points to shellcode” [12]. The authors use static analysis and binary instrumentation to extract assembly language functions from the program executable (also referred to as a “binary”) [11, 16], gathering information about the program state space. The information about program state space is used by an SMT (Satisfiability Modulo Theory) solver [12] to determine if there is an execution trace that results in the exploitability property being met. The tool uses the SMT result to produce an exploit.
例如，S06 中的可利用性属性（参见[11, 12, 16, 107]）是在内部程序状态空间元素（如函数和寄存器调用）的术语下定义的——例如，“IP 寄存器持有与用户输入 i 相关的某个函数 f 的值，例如 f 可能是对 tolower 的调用，并且结果 IP 指向 shellcode” [12]。作者使用静态分析和二进制插装从程序可执行文件（也称为“二进制”）[11, 16]中提取汇编语言函数，收集有关程序状态空间的信息。程序状态空间的信息被 SMT（可满足性模理论）求解器[12]用于确定是否存在导致满足可利用性属性的执行轨迹。该工具使用 SMT 结果生成漏洞利用。

As a different example, in S43 (see [144]), the properties are defined based on CVSS and determined based on outputs of instrumented binary analysis. For example, the property based on the AV metric was determined using the existence and observed behavior of function calls such as socket and connect when a dynamic trigger for the vulnerability is run against the target application.
作为一个不同的例子，在 S43（见[144]）中，属性是基于 CVSS 定义的，并基于仪器化二进制分析的结果确定的。例如，基于 AV 指标的属性是通过在针对目标应用程序运行漏洞的动态触发器时，检查 socket 和 connect 等函数调用的存在和观察到的行为来确定的。

Table 5 highlights characteristics of the studies of automated, Deterministic exploitability assessments. The first two columns of Table 5 indicate the study ID and associated bibliography entries. The third column indicates the years in which the studies were published. The fourth and fifth columns pertain to the tool outputs, which we will discuss in Section 7.1.1, including the type of exploit that the tool can generate, if applicable. Columns 6 through 9 show the primary inputs to the tools, which we discuss in Section 7.1.2. Column 10 is the type of vulnerability the tool is designed to analyze. Column 11 is the language(s) of the vulnerable applications to which the tools can be applied, which relates to the type of vulnerability. Columns 10 and 11 are discussed in Section 7.1.3. Columns 12 through 14 highlight aspects of the evaluation performed in each study, which are discussed in Section 7.1.4. Column 12 indicates the types of programs (applications) that the tool is intended to be used against. Column 13 indicates the size of the dataset used in the evaluation in terms of the number of vulnerabilities. Column 14 indicates the Time to Run the tool per vuln (unless otherwise noted). The rows of Table 5 are ordered based on their outputs and then on the year(s) in which the work was published since, as we will discuss in Section 7.1.1, the tool output is a key distinguishing factor between studies.
表 5 突出了自动化、确定性可利用性评估研究的特点。表 5 的前两列表明了研究 ID 和相关参考文献条目。第三列指出了研究发表的年份。第四和第五列涉及工具输出，我们将在 7.1.1 节中讨论，包括工具可以生成的攻击类型（如果适用）。第 6 至 9 列显示了工具的主要输入，我们将在 7.1.2 节中讨论。第 10 列是工具设计用于分析的安全漏洞类型。第 11 列是工具可以应用于的易受攻击应用程序的语言（们），这与漏洞类型相关。第 10 和 11 列在 7.1.3 节中讨论。第 12 至 14 列突出了每项研究中进行的评估方面，这些将在 7.1.4 节中讨论。第 12 列指出了工具打算用于对抗的程序（应用）类型。第 13 列指出了评估中使用的数据集大小，以漏洞数量来衡量。第 14 列表示每个漏洞运行工具所需的时间（除非另有说明）。表 5 的行按其输出顺序排列，然后按发表年份排序，因为，正如我们在第 7.1.1 节中将要讨论的，工具的输出是区分研究的关键因素。

Table 5.

ID	Bib. 参考文献	Year 年	Out-put 输出	(Output) Exploit Type 输出类型	Inputs 输入				Vuln. Type(s) 漏洞类型	Lang.	Eval.
					Inst. 学院		Vuln 漏洞				Eval.
					source code	executable	vuln. stmt. loc.	dynamic trigger			Target Software	# Vuln.	Time to Run
S06	[11, 12, 16, 107] [11, 12, 16, 107]	2011, 2012, 2014	Exploit 利用	Control Flow Hijack $^{a}$ 控制流劫持 $^{a}$		X		X	Memory 内存	C/C++	Command-line programs	29	$<$ 1m to 3hr 41m
S07	[53, 54, 55] [53, 54, 55]	2012, 2018, 2019	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory 内存	C/C++	OS (Linux kernel), language interpreters	10	27m to 53m
S09	[58, 59] [58, 59]	2012, 2014	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory 内存	C/C++	Command-line & user-level programs	33	$<$ 1m to 4 hr $^{b}$
S31	[102] [102]	2017	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory (heap based) 内存（基于堆）	C/C++	Windows services & libraries	8	10m to 20hr 27m
S35	[124, 142] [124, 142]	2018, 2020	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory (heap based) 内存（基于堆）	C/C++	User-level programs	24	<1m to 22m
S36	[17, 127, 128] [17, 127, 128]	2018, 2019	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory (heap based) 内存（基于堆）	C/C++	OS (Linux kernel)	27	1m to 2hr $^{c}$
S37	[47] [ 47 ]	2018	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory 内存	binary-based	Web browsers	5	$<$ 1m to 2m
S46	[29] [29]	2020	Exploit 利用	Control Flow Hijack $^{a}$ 控制流劫持 $^{a}$		X		X	Memory (heap based) 内存（基于堆）	binary-based	User-level programs	20	15m to 41m
S49	[133] [ 133 ]	2020	Exploit 利用	Control Flow Hijack $^{a}$ 控制流劫持 $^{a}$		X		X	Memory 内存	C/C++	Language Virtual Machines	11	11h 57m to 13h 26m
S56	[64] [64]	2022	Exploit 利用	Control Flow Hijack 控制流劫持	X	X		X	Memory 内存	C/C++	Command-line programs	38	8hr $^{d}$
S57	[140] [ 140 ]	2022	Exploit 利用	Control Flow Hijack 控制流劫持		X		X	Memory (heap based) 内存（基于堆）	C/C++	OS (Linux kernel)	17	5m $^{d}$
S53	[78] [ 78 ]	2021	Exploit 利用	Triggering Race Condition (PoC) 触发竞态条件（PoC）		X	X		Concurrency bug 并发错误	C/C++	OS (Linux kernel, Microsoft Windows)	10	$<$ 1m to 2m
S16	[44] [ 44 ]	2015	Exploit 利用	Input that is not sanitized (PoC) 未经过净化的输入（PoC）		X	X		Sensitive statement (user defined) $^{a}$ 敏感声明（用户定义） $^{a}$	Java	Android apps	26	2m to 33m $^{b}$
S21	[60, 95, 96, 97] [60, 95, 96, 97]	2015, 2018, 2020, 2021	Exploit 利用	Unexpected Behavior (PoC) 意外行为（PoC）	X		X		Vuln. Third-Party Components 漏洞。第三方组件	Java	Programs using third-party libraries	627	$<$ 1m to $>$ 1hr
S58	[67] [67]	2022	Exploit 利用	Unexpected Behavior (PoC) 意外行为（PoC）	X		X		Vuln. Third-Party Components 漏洞。第三方组件	Java	Programs using third-party libraries	42	$<$ 1m to 10m $^{c}$
S25	[1, 2] [1, 2]	2016, 2018	Exploit 利用	Injection, Exec. After Redirect 注入，执行后重定向	X		X		Input Validation, Business Logic 输入验证，业务逻辑	php	Web application	26	$<$ 1m to 2hr 19m $^{b}$
S29	[46] [ 46 ]	2017	Exploit 利用	DoS; Injection $^{a}$ DoS; 注入 $^{a}$	X		X		Null Ptr. Validation; Data Checks $^{a}$ 空指针。验证；数据检查 $^{a}$	Java	Android apps	2092	$<$ 1m to 3hr 27m $^{b}$
S54	[20] [ 20 ]	2021	Exploit 利用	Injection, Path Manipulation 注射，路径操纵	X		X		Input Validation, Hardcoded Key, Dangerous Func., Open Redirect, Info. Disclosure 输入验证，硬编码密钥，危险函数，开放重定向，信息泄露	php	Web applications	403	NP
S13	[135, 137] [135, 137]	2014, 2016	Exploit Info. 利用信息	NA	X		X		Any/Unspecified 任何/未指定	C/C++	OS (Linux kernel), Services (Apache)	111	NP
S33	[51] [51]	2017	Exploit Info. 利用信息	NA		X		X	Memory (heap based) 内存（基于堆）	C/C++	“Real-world” programs	9	$<$ 2m (avg.) 5m (max.)
S43	[144] [144]	2019	CVSS Scores CVSS 评分	NA		X		X	Any/Unspecified 任何/未指定	C/C++	OS (Linux kernel), Services (Apache, FTP)	98	NP

Table 5. Studies on Automated, Rule-Based Program Analysis Tools for Assessing Exploitability
表 5. 关于用于评估可利用性的基于规则的自动化程序分析工具的研究

NA indicates not applicable; NP indicates not provided.
NA 表示不适用；NP 表示未提供。

^{a}

Studies S06, S29, S46, S49, and S16 include a template mechanism whereby additional types of vulnerabilities and exploits can be analyzed.
研究 S06、S29、S46、S49 和 S16 包含一种模板机制，通过该机制可以分析额外的漏洞和利用方式。

^{b}

In S09, S16, S25, and S29, the Time to Run is reported “per application” rather than “per vulnerability.”
S09、S16、S25 和 S29 中，运行时间报告为“按应用程序”，而不是“按漏洞”。

^{c}

S36 and S58 stopped their tool at the maximum timestamp, even if the tool had not achieved its goal.
S36 和 S58 在最大时间戳处停止了工具，即使工具没有达到其目标。

^{d}

In S56 and S57, the tool runtime was capped at 8 hours and 5 minutes, respectively, since the tools could, theoretically, run indefinitely.
在 S56 和 S57 中，工具运行时间分别被限制在 8 小时 5 分钟，因为理论上这些工具可以无限期运行。

^{e}

In S06, the maximum range value at over 3 hours was an outlier. The vulnerability with the second-highest Time to Run only required 16 minutes.
S06 中，超过 3 小时的最高范围值是一个异常值。第二个最高运行时间的漏洞只需 16 分钟。

7.1.1 Outputs. 7.1.1 输出。

Within the studies examining automated, Deterministic tools based on the program state space, 18 of the 21 studies (S06, S07, S09, S31, S35, S36, S36, S46, S49, S56, S57, S53, S15, S21, S58, S25, S29, and S54) examine tools which produce, as output of the exploitability analysis, an input to the system under test that can be used to trigger the exploitability property. These are sometimes referred to as EG tools [12]. The types of exploits developed by the tools in our survey are listed under “Exploit Type” (the fourth column) of Table 5. The horizontal line in Table 5 differentiates between studies where the main focus is EG tools and studies which focus on other tools.
在研究基于程序状态空间的自动化、确定性工具的文献中，21 项研究中有 18 项（S06、S07、S09、S31、S35、S36、S36、S46、S49、S56、S57、S53、S15、S21、S58、S25、S29 和 S54）考察了在可利用性分析输出中产生系统测试输入的工具，这些输入可用于触发可利用性属性。这些工具有时被称为 EG 工具[12]。我们调查中工具开发的漏洞类型列在表 5 的“漏洞类型”（第四列）下。表 5 中的水平线区分了主要关注 EG 工具的研究和其他工具的研究。

The work in S56 (see [64]) illustrates the relationship between EG tools and other tools. In addition to building an EG tool, the authors determine a vulnerability’s severity score based on the properties of the exploits generated as part of EG. The properties assessed to determine the severity score in S56 overlap with those examined to determine severity in S33 (see [51]), a non-EG study. For example, both studies use the number of bytes over-written by an exploit of a buffer overflow vulnerability. Similarly, in S13 (see [135, 137]), another non-EG study, the authors determine vulnerability exploitability based on reachability analysis very similar to the EG tools examined in S21 (see [60, 95, 96, 97]) and S16 (see [44]), which only provide a “PoC” exploit to determine the reachability of vulnerabilities in third-party libraries. However, the tool in S13 does not provide the exploit.
S56（见[64]）中的工作阐述了 EG 工具与其他工具之间的关系。除了构建 EG 工具外，作者还根据 EG 中生成的漏洞的属性确定漏洞严重程度评分。在 S56 中用于确定严重程度评分的属性与用于确定 S33（见[51]）中的严重程度的属性重叠，S33 是一项非 EG 研究。例如，两项研究都使用了缓冲区溢出漏洞的漏洞利用覆盖的字节数。同样，在 S13（见[135, 137]）这项另一项非 EG 研究中，作者根据与 S21（见[60, 95, 96, 97]）和 S16（见[44]）中检查的 EG 工具非常相似的可达性分析来确定漏洞的可利用性，而 S21 和 S16 只提供“PoC”漏洞利用来确定第三方库中漏洞的可达性。然而，S13 中的工具不提供漏洞利用。

The three studies which do not focus on EG (S43, S33, and S13) still produce an output indicating whether a different exploitability property is met. For example, in S43 (see [144]), the authors analyze whether an execution trace triggers specific functions associated with lower or higher CVSS v3 AV, AC, PR, and UI scores. Among other findings, the authors found that statements using the phrase chmod, the name for the utility for changing privileges in Linux-based OS [52], are associated with higher PR (Privileges Required) scores.
三项研究（S43、S33 和 S13）未聚焦于 EG，但仍能输出是否满足不同的可利用性属性。例如，在 S43 中（参见[144]），作者分析了执行跟踪是否触发与较低或较高 CVSS v3 AV、AC、PR 和 UI 评分相关的特定功能。在其他发现中，作者发现使用“chmod”这一术语的语句（Linux 操作系统更改权限的实用程序名称[52]）与较高的 PR（所需权限）评分相关。

7.1.2 Inputs. 7.1.2 输入。

All of the automated, rule-based exploitability assessment tools require at least two inputs—an instance of the application containing the vulnerability (Inst.) and information about the vulnerability itself (Vuln.). The instance of the application may be in the form of source code, or in the form of an executable (often referred to as a “binary”) that has been compiled from the source code. While starting with source code may seem to provide more options since the source code can be turned into the executable form more easily than the executable form can be reverse-engineered, He et al. [51] argue that starting with the executable is more advantageous since “In practice, it is common that program source code is unavailable” Six studies (S13, S21, S58, S25, S29, and S54) start with only the source code for the instance of the vulnerable application. One study (S56) uses both the source code and an executable form. Fifteen studies (S06, S07, S09, S16, S31, S33, S35, S36, S37, S43, S46, S49, S53, S56, and S57) use a compiled (binary) executable form of the instance of the vulnerable application. S31 also requires a set of test cases for the entire target application, such as a regression test suite [55] as part of their inputs.
所有基于规则的自动化可利用性评估工具至少需要两个输入——包含漏洞的应用实例（Inst.）和关于漏洞本身的信息（Vuln.）。应用实例可能以源代码的形式存在，或者以可执行文件（通常称为“二进制”）的形式存在，该可执行文件是从源代码编译而来的。虽然从源代码开始似乎提供了更多的选择，因为源代码比可执行文件更容易转换为可执行形式，但 He 等人[51]认为从可执行文件开始更有优势，因为“在实践中，程序源代码通常不可用”。六项研究（S13、S21、S58、S25、S29 和 S54）仅以脆弱应用实例的源代码开始。一项研究（S56）同时使用源代码和可执行形式。十五项研究（S06、S07、S09、S16、S31、S33、S35、S36、S37、S43、S46、S49、S53、S56 和 S57）使用脆弱应用实例的编译（二进制）可执行形式。 S31 还要求为整个目标应用程序提供一组测试用例，例如作为输入的一部分的回归测试套件[55]。

Information about the vulnerability, however, is either the location of the vulnerable statement within the codebase (vuln. stmt. loc.), which is typically extracted through static analysis, or an application input that triggers unintended behavior that indicates a vulnerability, which may be referred to as a dynamic trigger. A dynamic trigger is typically produced via through dynamic analysis that does not have access to source code. As can be seen in Table 5, vuln. stmt. loc. is typically used by tools that focus on the analysis of the source code, whereas a dynamic trigger is more frequently used with binary analysis.
关于漏洞的信息，然而，是代码库中易受攻击语句的位置（vuln. stmt. loc.），这通常通过静态分析提取，或者是一个触发意外行为的应用输入，该行为表明存在漏洞，这可能被称为动态触发器。动态触发器通常通过没有访问源代码的动态分析生成。如表 5 所示，vuln. stmt. loc.通常被专注于源代码分析的工具体使用，而动态触发器则更常与二进制分析一起使用。

7.1.3 Vuln. Type(s) and Language.
7.1.3 漏洞类型和语言。

As can be seen in Table 5, program analysis for assessing memory vulnerabilities in C/C++ programs is an area of considerable prior research. Memory vulnerabilities, particularly heap vulnerabilities [54, 55, 140], require analysis of a program in the context of the system within which it will be run. For example, the exploitability of heap vulnerabilities depends on the memory allocator in the system within which the vulnerable application is running. Consequently, most tools targeting heap vulnerabilities involve dynamic analysis in which the program is running in a particular context. In contrast, tools examining other types of vulnerabilities in other languages tend to rely on static analysis.
如表 5 所示，对 C/C++程序中内存漏洞进行程序分析是一个先前研究较多的领域。内存漏洞，尤其是堆漏洞[54, 55, 140]，需要分析程序在运行其的系统环境中的上下文。例如，堆漏洞的可利用性取决于运行有漏洞应用程序的系统的内存分配器。因此，大多数针对堆漏洞的工具涉及在特定上下文中运行的程序动态分析。相比之下，检查其他语言中其他类型漏洞的工具往往依赖于静态分析。

7.1.4 Evaluation. 7.1.4 评估。

Table 5 highlights three key aspects of each tool’s implementation and evaluation: the types of software targeted by each tool (Target Software); the number of vulnerabilities (# Vuln.) examined in the evaluation which we use as a common measure for the size of the dataset in each study; and the range of values for the Time to Run each tool, which is assessed per vulnerability unless otherwise noted. For studies that included more than one paper, all papers in the same study targeted similar software. In Table 5, we include the # Vuln. and Time to Run from the largest, most recent evaluation. We discuss the setup of the different evaluations in terms of Target Software and size of the dataset (# Vuln examined) in Section 7.1.4.1. We will discuss the Effectiveness of the different approaches on these datasets in Section 7.1.4.2, alongside the Time to Run.
表 5 突出了每个工具实现和评估的三个关键方面：每个工具针对的软件类型（目标软件）；评估中检查到的漏洞数量（#漏洞数），我们将其用作衡量每个研究中数据集大小的共同标准；以及每个工具运行时间的值范围，除非另有说明，否则按漏洞进行评估。对于包含多篇论文的研究，同一研究中所有论文针对的软件相似。在表 5 中，我们包括了最大、最新评估中的#漏洞数和运行时间。我们将在第 7.1.4.1 节中讨论不同评估的设置，包括目标软件和数据集大小（#检查的漏洞数）。我们将在第 7.1.4.2 节中讨论不同方法对这些数据集的有效性，同时讨论运行时间。

Setup. As is seen in Table 5, the different Program State based methods examine a wide range of target systems as part of their evaluation. This section provides an overview of the Target Software for each evaluation and the Number of Vulnerabilities examined as part of the evaluation, and we discuss how these setup parameters may have influenced the results in Section 7.1.4.2:
设置。如表 5 所示，不同的基于程序状态的方法在评估过程中考察了广泛的目标系统。本节概述了每个评估的目标软件以及作为评估一部分所检查的漏洞数量，并在第 7.1.4.2 节中讨论了这些设置参数可能对结果的影响。

Target Software: 目标软件：

The Target Software for the tool(s) used in the study is shown in the 12th column of Table 5. Where the authors do not specify a class of software program the tool is intended for, we generalized the Target Software based on the systems analyzed as part of the empirical evaluation. For example, S35 and S46 were evaluated against programs from Capture-the-Flag competitions. S35 refers to their targets simply as “programs,” whereas S46 refers to the CTF programs as “user-level applications”; therefore, we refer to the target application as “user-level programs.” As seen in Table 5, seven of the studies focused on services, command-line, or user-level vulnerabilities. Six focus on Operating Systems. Two studies examine language interpreters and language virtual machines. Two studies focus on client programs of third-party libraries. Two studies focus on Android applications (apps). Two studies focus on web applications. One study examined web browsers.
研究中所使用的工具的目标软件在表 5 的第 12 列中显示。当作者没有指定工具旨在使用的软件程序类别时，我们根据实证评估中分析的系统对目标软件进行了概括。例如，S35 和 S46 被评估与 Capture-the-Flag 竞赛的程序进行对比。S35 将他们的目标简单地称为“程序”，而 S46 将 CTF 程序称为“用户级应用”；因此，我们将目标应用称为“用户级程序”。如表 5 所示，七项研究集中在服务、命令行或用户级漏洞上。六项研究关注操作系统。两项研究考察语言解释器和语言虚拟机。两项研究关注第三方库的客户程序。两项研究关注 Android 应用（app）。两项研究关注 Web 应用。一项研究考察了 Web 浏览器。

Number of Vulnerabilities (# vuln).
漏洞数量（# 漏洞）。

Column 13 of Table 5 shows the number of vulnerabilities used in the largest evaluation for each system. Many of these datasets are relatively small. The largest number of vulnerabilities analyzed were extracted from 835 Android apps using static analysis tools in study S29 (see [46]).
表 5 的第 13 列显示了每个系统在最大评估中使用的安全漏洞数量。许多这些数据集相对较小。分析出的最大漏洞数量是从 S29 研究中使用静态分析工具从 835 个 Android 应用中提取的（见[46]）。

Results. This section focuses on effectiveness and the time a tool takes to run. Effectiveness, as measured in terms of precision and recall, is less consistently reported in the literature, particularly for EG tools. However, studies often evaluate other attributes, For example, impact-related attributes such as the amount of data that can be corrupted in a memory exploit [51, 55, 64]. While impact evaluations may be appropriate for the tool in the evaluation, they are outside the scope for exploitability, as noted in Section 3. The most consistent metrics examined were metrics of the Time to Run the tools. Even within the Time to Run metric, comparing evaluations can be challenging due to differences in the evaluation setups.
结果。本节主要关注工具的有效性和运行时间。在文献中，有效性（以精确度和召回率衡量）的报道不太一致，尤其是对于 EG 工具。然而，研究通常评估其他属性，例如与影响相关的属性，如内存利用中可能被破坏的数据量[51, 55, 64]。虽然对于评估中的工具来说，影响评估可能是合适的，但正如第 3 节所述，它们超出了可利用性的范围。最一致的指标是工具运行时间的指标。即使在运行时间指标中，由于评估设置的不同，比较评估仍然具有挑战性。

Effectiveness. 有效性。

Measures of Effectiveness, such as precision and recall, are important considerations for practitioners when using vulnerability assessment tools [97]. However, the differences in setup between evaluations and small dataset size (e.g., seven of the studies examine fewer than 20 vulnerabilities as seen in Table 5) complicate comparisons between studies. Of the 17 Deterministic, Program State based studies where the exploitability of the vulnerabilities is known, at least 10 studies report no false negatives (S13, S33, S06, S07, S09, S31, S37, S49, S56, and S53), suggesting a recall of 100%. Additionally, of the studies examining tools that output an exploit, only S25 (see [1, 2]) reports false positives (i.e., false information introduced into the report) or precision. In S25, only five false positives were produced, which covered 2 of the 29 total vulnerabilities examined. Other studies that do provide either precision or statistics on incorrect information that could be interpreted as false positives are S43 (see [144]), which only reports two incorrect responses, and S13 (see [135]), which reports precision of 47.83% when run against the Apache HTTP server, and 78% against the Linux kernel.
有效性度量，如精确率和召回率，是实践者在使用漏洞评估工具时的重要考虑因素[97]。然而，评估之间的设置差异以及小数据集规模（例如，表 5 中显示，有 7 项研究检查了少于 20 个漏洞）使得研究之间的比较变得复杂。在 17 项已知漏洞可利用性的确定性、程序状态研究（其中已知漏洞的可利用性）中，至少有 10 项研究报告没有假阴性（S13、S33、S06、S07、S09、S31、S37、S49、S56 和 S53），表明召回率为 100%。此外，在检查输出漏洞工具的研究中，只有 S25（见[1, 2]）报告了假阳性（即报告中的错误信息）或精确率。在 S25 中，只产生了五个假阳性，涵盖了 29 个总漏洞中的 2 个。其他提供精确率或可能被视为假阳性的错误信息统计的研究包括 S43（见[144]），仅报告了两个错误响应，以及 S13（见[135]），报告了精确率为 47%。在 Apache HTTP 服务器上为 83%，在 Linux 内核上为 78%。

Time to Run. 运行时间。

The use of small datasets may be partly explained by the time it takes to run many of these tools, shown in the last column of Table 5. S57 and S58 set a maximum time limit for how long to run the tool. For the other studies, we provide either the range of the time to run or the average Time to Run if the publication does not provide the range. S43 and S13 do not include an evaluation of Time to Run. A lower range does not inherently mean a tool is always faster. As indicated by the Target Software and # Vuln. columns, different tools were evaluated against different vulnerabilities in different systems. Tools run against simpler, CTF-based “user-level programs” in S35 and S46 will likely have a smaller state space and shorter Time to Run than tools run against the Language Virtual Machines examined in S49. The discrepancy in performance between systems is examined in S09 (see [58, 59]), where the authors found that their tool could produce exploits in less than 5 seconds on simple example code, but required 4 hours to run when applied to a larger piece of software, Foxit PDF Reader, containing more than a million lines of code [59]. Similarly, in S06, the authors found that their tool required nearly 4 hours to run on one executable that had been encoded (“packed”) [16]. In comparison, the second-highest runtime in S06 was only 16 minutes. As noted by Ponta et al. [97], in S21, based on their work at the company SAP, the time required to run a tool influences when and how it can be used. While the vulnerability detection component of their tool only required 76 seconds to run and could be included in frequent scans, the exploitability analysis often required hours to run and was more likely to be included in deeper scans run shortly before release or in manual analysis performed after a release.
小数据集的使用部分可以解释为运行这些工具所需的时间，如表 5 最后一列所示。S57 和 S58 为运行工具的最大时间限制设定了上限。对于其他研究，我们提供运行时间的范围或平均运行时间，如果出版物没有提供范围。S43 和 S13 没有包括运行时间的评估。较低的运行时间范围并不一定意味着工具总是更快。如目标软件和#漏洞列所示，不同的工具针对不同系统中的不同漏洞进行了评估。在 S35 和 S46 中针对基于 CTF 的“用户级程序”运行的工具可能具有更小的状态空间和更短的运行时间，而针对 S49 中检查的语言虚拟机运行的工具则可能不是这样。系统之间性能的差异在 S09 中得到了考察（参见[58, 59]），作者发现他们的工具可以在简单的示例代码上在不到 5 秒内生成漏洞利用，但当应用于包含超过一百万行代码的更大软件，如 Foxit PDF Reader 时，则需要 4 小时才能运行[59]。同样，在 S06 中，作者发现他们的工具在一个已编码（压缩）的可执行文件上运行需要近 4 小时[16]。相比之下，S06 中第二高的运行时间仅为 16 分钟。正如 Ponta 等人[97]所指出的，在 S21 中，基于他们在 SAP 公司的工作，工具的运行时间会影响其何时以及如何被使用。虽然他们的工具中的漏洞检测组件只需 76 秒即可运行，并可以包含在频繁的扫描中，但利用性分析通常需要数小时才能运行，更可能包含在发布前不久运行的深度扫描中或在发布后进行的手动分析中。

7.2 Network System State Based
7.2 基于网络系统状态

We identified one study, S01 (see [123]), which predates the program analysis work but leverages a similar, Deterministic vulnerability exploitability metric as part of their network security analysis. They model the network as a graph of system states and measure the exploitability of a vulnerability as the in-degree of the vulnerability within the system state graph. Although this graph is based on a network and not a program, their graph of system states is quite similar to the graphs used to analyze the system state of a program in other Deterministic systems such as S16 (see [44]) and S13 (see [135, 137]). The assessment in S01 focuses on system-level metrics that their vulnerability-level metric contributes to rather than evaluating the vulnerability exploitability metric.
我们确定了一项研究，S01（见[123]），该研究早于程序分析工作，但将其网络安全性分析中利用了类似的、确定性的漏洞可利用性指标。他们将网络建模为系统状态的图，并将漏洞的可利用性度量为其在系统状态图中的入度。尽管这个图是基于网络而不是程序，但他们的系统状态图与其他确定性系统（如 S16[44]和 S13[135, 137]）中分析程序系统状态所使用的图相当相似。S01 的评估侧重于系统级指标，而不是评估漏洞可利用性指标，而是他们的漏洞级指标所贡献的。

7.3 Attacker Based 7.3 攻击者基于

We identified one unique study, S27 (see [31]), in which “exploitability” was defined based on attacker characteristics and activity. In S27 (see [31]), the authors propose a metric Threat Agent Count (TAC) that indicates the quantity of threat actors capable of exploiting a particular vulnerability. The authors evaluate their metric by comparing six vulnerability prioritization policies: (1) a FIFO (first-in-first-out policy; (2) prioritizing vulnerabilities with the highest CVSS score first; (3) prioritizing vulnerabilities with the highest score from the adapted CVSS framework proposed in S04 (see [80]) first; (4) prioritizing fixing vulnerabilities with the highest TAC score first; (5) prioritizing fixing vulnerabilities based on highest CVSS score, then highest TAC score in case of a tie; and (6) prioritizing fixing vulnerabilities based on highest TAC value, then the highest CVSS value in the case of a tie. They evaluate and compare the policies based on how much the policy exposes a software system to vulnerabilities with a known exploit. Exposure is measured as a function of time t where

E_{t}

is the set of vulnerabilities with known exploits that have yet to be fixed, such that

E x p o s u r e (t) = \sum_{i = 0}^{t} | E_{i} |

. The authors used 1,000 randomly selected vulnerabilities to represent the vulnerable Information System, which was evaluated via simulations of 1,000 timesteps t. At each timestep, 1 vulnerability was fixed according to the policy under evaluation. The authors report their results at the end of each quarter of the simulation (i.e., every 250 t). The policy of prioritizing vulnerabilities based on TAC, then by CVSS in the case of a tie (policy 6), had the lowest exposure at the end of each quarter, followed by prioritization entirely based on TAC, where in the case of ties, the first-found vulnerability is removed first (policy 4). This leads the authors to claim that TAC “is a significantly better predictor of exploitable vulnerabilities than CVSS score” [31].
我们确定了一项独特的研究，S27（见[31]），其中“可利用性”是根据攻击者特征和活动来定义的。在 S27（见[31]）中，作者提出了一个指标威胁代理计数（TAC），它表示能够利用特定漏洞的威胁行为者的数量。作者通过比较六种漏洞优先级策略来评估他们的指标：（1）先进先出（FIFO）策略；（2）优先处理 CVSS 评分最高的漏洞；（3）优先处理来自 S04（见[80]）中提出的改进 CVSS 框架的最高评分的漏洞；（4）优先修复 TAC 评分最高的漏洞；（5）在 CVSS 评分相同的情况下，优先修复 CVSS 评分最高、TAC 评分次之的漏洞；（6）在 TAC 值相同的情况下，优先修复 CVSS 值最高的漏洞。他们根据策略暴露软件系统于已知漏洞可利用性的程度来评估和比较这些策略。暴露被测量为时间 t 的函数，其中

E_{t}

是已知漏洞利用但尚未修复的漏洞集合，即

E x p o s u r e (t) = \sum_{i = 0}^{t} | E_{i} |

。作者使用了 1,000 个随机选择的漏洞来代表易受攻击的信息系统，该系统通过 1,000 个时间步长 t 的模拟进行评估。在每个时间步长，根据评估的政策修复 1 个漏洞。作者在每个季度结束时报告他们的结果（即，每 250t）。基于 TAC 优先处理漏洞的政策，在出现平局时按 CVSS 排序（政策 6），在每个季度结束时具有最低的暴露度，其次是完全基于 TAC 的优先处理，在出现平局时，首先移除最早发现的漏洞（政策 4）。这导致作者声称 TAC“是可利用漏洞的显著更好的预测因子，比 CVSS 评分要好” [ 31]。

8 Automated Probabilistic Assessments: Learning Models
8 自动概率评估：学习模型

Another body of research builds automated prediction models that take input from existing datasets and convert the data into a set of features, which are then analyzed using machine learning or neural networks to produce a value indicative of a high-level concept. We collectively refer to these machine learning and neural network based models as LMs. Table 6 summarizes the LM identified in our survey. The first two columns indicate the study ID and corresponding bibliography entry, as shown previously in Table 4. The next column indicates the Year(s) in which the studies were published. Columns 4 through 9 indicate aspects of the data used as the “ground truth” (GT) for training/testing of supervised learning models, which we discuss in Section 8.1. Columns 10 through 14 indicate types of data used as Features (FT) for the models, which we discuss in Section 8.2. The data sources for GT and FT will be covered in their respective sections and are shown in Column 15. The number of vulnerabilities in the dataset used to train and test the model is shown in Column 16. Where more than one model is examined, we focus on the newest model proposed in each study. If it is unclear which model is the newest based on publication date, we use the largest. Finally, in the last column of Table 6, we list the Modeling Technique(s), such as SVM, which are examined in each study. We briefly discuss trends in modeling techniques in Section 8.3. However, we leave a detailed analysis of the technical details of modeling techniques to surveys focused on the LM techniques such as the works by Sotos Martinez et al. [113], Kotenko et al. [70], and Le et al. [76] described in Section 2.
另一项研究构建了自动预测模型，这些模型从现有数据集中获取输入，并将数据转换为一系列特征，然后使用机器学习或神经网络对这些特征进行分析，以产生一个表示高级概念的值。我们把这些基于机器学习和神经网络的模型统称为 LM。表 6 总结了我们在调查中确定的 LM。前两列表示研究 ID 和相应的参考文献条目，如前文表 4 所示。下一列表示研究发表的年份。第 4 列至第 9 列表示用作监督学习模型训练/测试的“地面真实”（GT）数据方面，我们在第 8.1 节中进行了讨论。第 10 列至第 14 列表示用作模型特征（FT）的数据类型，我们在第 8.2 节中进行了讨论。GT 和 FT 的数据来源将在各自的章节中介绍，并在第 15 列中显示。用于训练和测试模型的数据库中的漏洞数量在第 16 列中显示。当考察多个模型时，我们关注每个研究中提出的最新的模型。如果根据发表日期无法确定哪个模型是最新的，我们则使用最大的模型。最后，在表 6 的最后一列，我们列出了每个研究中检验的建模技术，例如 SVM。我们在第 8.3 节中简要讨论了建模技术的趋势。然而，我们将建模技术的技术细节的详细分析留给专注于 LM 技术的调查，如 Sotos Martinez 等人的作品[113]、Kotenko 等人的作品[70]和 Le 等人的作品[76]，这些在第二章中描述。

Table 6.

In Table 6, we separate the LMs into four groups based on GT and FT, as indicated by the single and double lines, to highlight high-level trends in LM exploitability assessment research. These divisions are based on the following distinctions:
在表 6 中，我们根据 GT 和 FT 将 LM 分为四组，如单线和双线所示，以突出 LM 可利用性评估研究中的高级趋势。这些划分基于以下区别：

GT—Base CVSS “exploitability” vs. Other Exploit indicators:
GT—Base CVSS “可利用性”与其他漏洞指标对比：

There appears to be a clear distinction between models that target the sub-metrics of the CVSS score, including Exploitability-related sub-metrics, and models that target other exploit-related information.
存在明显区分，针对 CVSS 评分的子指标（包括与可利用性相关的子指标）的模型，以及其他与利用相关的信息的模型。

FT—Program analysis vs. Metadata and natural language processing:
FT—程序分析对比元数据和自然语言处理：

As can be seen in Table 6, studies tend to focus either on features gathered through program analysis or on features gathered from other sources such as vulnerability reports and social media posts. Non–program analysis features can be in the form of metadata, such as the number of references or the CWE (Common Weakness Enumeration) scores associated with a vulnerability, or can be extracted via Natural Language Processing (NLP) techniques.
如表 6 所示，研究往往侧重于通过程序分析收集的特征，或者侧重于从其他来源收集的特征，如漏洞报告和社交媒体帖子。非程序分析特征可以是元数据的形式，例如与漏洞相关的参考文献数量或 CWE（通用弱点枚举）得分，或者可以通过自然语言处理（NLP）技术提取。

As can be seen in the “Year” (third) column of Table 6, the use of program analysis based FT to predict CVSS scores has only become a popular research topic in recent years (since 2021). In comparison, the use of Program Analysis based FT to predict exploits goes back to 2016, whereas the use of vulnerability reports and other FT for both CVSS and other exploit-related predictions goes back to 2015 and 2010, respectively. We discuss types of GT and FT in more detail in Sections 8.1 and 8.2.
如表 6 的“年份”（第三）列所示，基于 FT 的程序分析预测 CVSS 分数的使用仅在近年来（自 2021 年以来）成为热门研究课题。相比之下，基于 FT 的程序分析预测漏洞的使用可以追溯到 2016 年，而使用漏洞报告和其他 FT 进行 CVSS 和其他漏洞相关预测的使用分别可以追溯到 2015 年和 2010 年。我们在第 8.1 节和第 8.2 节中更详细地讨论了 GT 和 FT 的类型。

8.1 What Is Used as the Ground Truth Labels for Training/Testing (GT)?
8.1 训练/测试中使用的真实标签是什么（GT）？

All of the models in this section are supervised machine learning or neural network models, requiring training data that includes a set of features on which the model is based and labels indicating the ground truth for each instance (e.g., the exploitability for each vulnerability). Empirical studies of machine learning models typically perform their evaluation by withholding a subset of the labeled data to use in testing the model once it has been trained.
本节中的所有模型均为监督机器学习或神经网络模型，需要包含模型所基于的特征集和每个实例的地面真实标签的训练数据（例如，每个漏洞的可利用性）。机器学习模型的实证研究通常通过保留一部分标记数据来评估模型，在模型训练完成后，使用这部分数据来测试模型。

The models can be split into two distinct groups based on their GT. The first group is models that use the base CVSS score directly or indirectly from NVD as the target of their model, which we discuss in Section 8.1.1. The second group uses one or more of a combination of indicators, including exploits from a database like ExploitDB, Exploit Signatures, specially dedicated exploit flags in vulnerability databases, and other exploit-related indicators. As shown in Table 6, authors can combine multiple sources as part of the same GT, hampering further orthogonal classification. We discuss this second group of Other Exploit-Related GT in Section 8.1.2
模型可以根据其 GT 分为两个不同的组。第一组是直接或间接使用 NVD 的 CVSS 基础分数作为模型目标的模型，我们将在第 8.1.1 节中讨论。第二组使用一个或多个指标的组合，包括来自 ExploitDB 等数据库的漏洞利用、漏洞签名、专门用于漏洞数据库的漏洞利用标志以及其他与漏洞利用相关的指标。如表 6 所示，作者可以将多个来源组合为同一 GT，从而阻碍进一步的正交分类。我们将在第 8.1.2 节中讨论这一组其他与漏洞利用相关的 GT。

As we discuss in Section 6, and as is documented in work on LMs such as EPSS where the maintainers report the features that contribute most to the LM, exploit and exploitability indicators are only loosely related to each other and are also related to factors such as the organization who maintains the software (e.g. Microsoft). Hence, there is no clear “winner” among types and sources of GT. Furthermore, as can be seen in the Data Source(s) column (Column 15) of Table 6, the existing work primarily uses publicly available datasets, which are known to provide higher coverage of large organizations such as Microsoft. More work is needed to understand how these models generalize to less well known organizations and smaller projects, or with data collected internally, which may be too sensitive to release publicly.
如我们在第 6 节中讨论的，正如在 EPSS 等 LM（语言模型）的研究工作中所记录的，维护者报告了对 LM 贡献最大的特征，利用和可利用性指标彼此之间只有松散的联系，并且也与维护软件的组织（例如微软）等因素相关。因此，在 GT（生成式翻译）的类型和来源中并没有明确的“赢家”。此外，如表 6 中的数据源（第 15 列）所示，现有工作主要使用公开可用的数据集，这些数据集已知可以提供对微软等大型组织的高覆盖率。需要更多的工作来理解这些模型如何推广到不太知名的组织和较小的项目，或者使用内部收集的数据，这些数据可能过于敏感而不宜公开发布。

8.1.1 Base CVSS. 8.1.1 基础 CVSS

As discussed in Section 3, CVSS is one of the most common standards for software assessment. Since we are primarily concerned with the exploitability part of the assessment, we focus on papers in which learning algorithms are used to target the sub-metrics of CVSS, including those related to exploitability. As seen in Table 6, the sub-metrics of the Base CVSS score provided by the NVD are used as a GT for S14, S38, S41, S42, S48, S51, S52, and S59.
如第 3 节所述，CVSS 是软件评估中最常见的标准之一。由于我们主要关注评估的可利用性部分，因此我们关注使用学习算法针对 CVSS 的子指标的研究论文，包括与可利用性相关的子指标。如表 6 所示，NVD 提供的 CVSS 基础分数的子指标被用作 S14、S38、S41、S42、S48、S51、S52 和 S59 的 GT。

8.1.2 Other Exploit-Related Indicators.
8.1.2 其他利用相关指标。

While the Exploitability sub-metrics of the Base CVSS score are an industry-recognized standard for “exploitability,” there is less consensus on which other GT are indicators of “exploitability.” In S03, S18, S26, S28, S30, S40, and S45, the authors indicate that their models predict concepts such as “‘exploitability” or “whether a vulnerability is exploitable.” However, S19, S20, S32, and EPSS do not indicate that they are explicitly “exploitability” models, instead indicating that they are models of “likelihood of exploitation” (i.e., the likelihood that an attacker will exploit a vulnerability). In S03, the authors assert that these concepts of “exploitability” and “likelihood of exploitation” should be highly related [14], whereas other researchers have found that indicators of exploitability may only explain part of the likelihood of exploitation [5, 103]. However, these models focus on the same practical observations, regardless of what they represent. As can be seen in Table 6, the “ground truth” used for training/testing (GT) “exploitability” models such as in S18 or S26 may be the same GT used in “likelihood of exploitation” models such as in S20 or S32. Meanwhile, S32 examines CVSS Exploitability scores as FT for the model. The lack of usability testing or practitioner feedback on exploitability LM in the academic literature further complicates the conceptual classification of these models—that is, whether the FT or the GT should be considered “exploitability.” Consequently, we focus on the actual observations used in each of the LMs both for the ground truth (GT) and for the features (FT) used to predict GT.
尽管基线 CVSS 评分的可利用性子指标是业界公认的“可利用性”标准，但对于哪些其他 GT 是“可利用性”的指标，共识较少。在 S03、S18、S26、S28、S30、S40 和 S45 中，作者指出他们的模型预测了诸如“可利用性”或“漏洞是否可利用”等概念。然而，S19、S20、S32 和 EPSS 并没有明确指出它们是“可利用性”模型，而是指出它们是“利用可能性”模型（即攻击者利用漏洞的可能性）。在 S03 中，作者断言这些“可利用性”和“利用可能性”的概念应该高度相关[14]，而其他研究人员发现，可利用性的指标可能只能解释利用可能性的部分[5, 103]。然而，这些模型关注的都是相同的实际观察，无论它们代表什么。如表 6 所示，用于训练/测试“可利用性”模型（如 S18 或 S26）的“真实情况”（GT）可能与用于“利用可能性”模型（如 S20 或 S32）的相同 GT。同时，S32 将 CVSS 可利用性评分作为模型的 FT。学术文献中缺乏对可利用性 LM 的可用性测试或从业者反馈，进一步复杂化了这些模型的概念分类——即 FT 或 GT 是否应被视为“可利用性”。因此，我们关注每个 LM 中用于地面真实（GT）和用于预测 GT 的特征（FT）的实际观察结果。

Exploitability Changes over Time: Temporal vs. Non-Temporal Models. As noted previously by Le et al. [76] in their analysis of LM, temporal aspects of exploitability are a theme in LM. At least three studies (S03, S20, and S55) targeted a temporal value. We note which studies include a temporal model in the fifth column of Table 6.
可利用性随时间变化：时间模型与非时间模型。正如 Le 等人[76]在分析 LM 时之前所指出的，可利用性的时间方面是 LM 中的一个主题。至少有三项研究（S03、S20 和 S55）针对时间值。我们在表 6 的第 5 列中注明了哪些研究包括时间模型。

In both S03 (see [14]) and S20 (see [34, 35]), the authors examine and compare a model for non-temporal, binary classification (i.e., whether an exploit exists), and a prediction model for Time to Exploit—that is, the difference between when an exploit is publicly available and when the vulnerability was disclosed. Both authors note that the time to exploit may be more useful for practitioners in determining how to budget their resources. However, both authors found the non-temporal model to have higher accuracy than the temporal model. In S03, the accuracy decreased from 89% to 79% [14] between the non-temporal and temporal models. In S20, Edkrantz [34] found that optimal models for the non-temporal classification had 81% to 82% accuracy, whereas their attempts to construct the temporal model resulted in precision, recall, and F1 scores so low (all below 0.6) that the authors did not investigate the temporal model further.
在 S03（见[14]）和 S20（见[34, 35]）中，作者们考察并比较了非时序二分类模型（即是否存在漏洞）和时间到漏洞预测模型——即漏洞公开时间和漏洞披露时间之间的差异。两位作者都指出，时间到漏洞可能对实践者确定如何分配资源更有用。然而，两位作者发现非时序模型比时序模型具有更高的准确性。在 S03 中，非时序和时序模型之间的准确性从 89%下降到 79%[14]。在 S20 中，Edkrantz[34]发现非时序分类的最优模型准确率为 81%至 82%，而他们构建时序模型的尝试导致精确率、召回率和 F1 分数都极低（所有都低于 0.6），因此作者们没有进一步研究时序模型。

In S55 (see [116]), the authors had better results than in S03 or S20 but predicted a slightly different concept. In S55, Suciu et al. [116] develop a model to predict “over time the likelihood that a functional exploit will be developed,” a concept they refer to as “expected exploitability.” As described by the authors, “functional exploits go beyond proofs-of-concept (PoCs) [emphasis added] to achieve the full security impact prescribed by the vulnerability” [116]. The models of S55 had precision greater than 0.8 and recall greater than 0.6 with an overall AUC of 0.73 [116].
在 S55（见[116]），作者们的结果优于 S03 或 S20，但预测了一个略有不同的概念。在 S55 中，Suciu 等人[116]开发了一个模型来预测“随着时间的推移，一个功能漏洞被开发的可能性”，他们将这个概念称为“预期可利用性”。正如作者所描述的，“功能漏洞超越了概念验证（PoCs[强调]），实现了漏洞规定的全部安全影响” [116]。S55 的模型具有大于 0.8 的精确度和大于 0.6 的召回率，整体 AUC 为 0.73 [116]。

Types. The non-CVSS types of indicator used as GT labels can be divided into four categories: Exploit Datasets, Exploit Signature Datasets, Exploit Flags in Vulnerability Datasets, and four indicators that were unique to a particular model, which we classify as “Other” and discuss individually. Different data sources within each of these four categories can also influence the performance of the resulting model. For example, as shown by Allodi and Massacci [5, 6] as early as 2012 in S08 and examined further by other research (e.g., S19 [104]), different exploit datasets may contain exploits for different vulnerabilities. Furthermore, as shown in Table 6, many papers use multiple data sources. Where multiple sources are used, a vulnerability only has to be determined as “exploitable” by one of the sources to be labeled as a True Positive [10, 104, 116].
类型。用作 GT 标签的、非 CVSS 类型的指标可以分为四类：利用数据集、利用签名数据集、漏洞数据集中的利用标志，以及特定模型独有的四个指标，我们将它们归类为“其他”并分别讨论。这四个类别中的不同数据源也可能影响最终模型的表现。例如，正如 Allodi 和 Massacci [5, 6] 在 2012 年的 S08 中所示，并在其他研究（例如，S19 [ 104]）中得到进一步考察，不同的利用数据集可能包含针对不同漏洞的利用。此外，如表 6 所示，许多论文使用了多个数据源。当使用多个来源时，只要其中一个来源将漏洞确定为“可利用的”，该漏洞就被标记为真阳性 [10, 104, 116]。

Exploit Datasets. 利用数据集。

The most common observation used as a GT label is whether an exploit is available in a dataset, being used in eight of the studies. As seen in the last column of Table 6, the most common source of exploits is ExploitDB (used for GT in S19, S20, S32, S40, S45, and S18). In S55 (see [116]), the authors consider exploits from ExploitDB to be PoC exploits rather than “functional” exploits. Instead, the authors of S55 used exploits from commercial databases: Metasploit,⁶ a penetration testing tool with a corresponding database, as well as other commercial exploit tools including Canvas⁷ and the D2 Elliot Web Exploitation Framework.⁸ In S55, the authors also used exploits and other malware samples from the Contagio dataset, which has been used in other research [74, 93, 143]. Similarly, in S30, the authors use exploits from a variety of datasets created in prior work, including exploits previously used in the Mayhem publication from S06.
最常见的用于 GT 标签的观察是数据集中是否存在漏洞，这在八项研究中被使用。如表 6 最后一列所示，最常见的漏洞来源是 ExploitDB（在 S19、S20、S32、S40、S45 和 S18 中用于 GT）。在 S55（见[ 116]）中，作者认为 ExploitDB 中的漏洞是 PoC 漏洞，而不是“功能性”漏洞。相反，S55 的作者使用了来自商业数据库的漏洞：Metasploit， ⁶ 一款带有相应数据库的渗透测试工具，以及其他商业漏洞工具，包括 Canvas ⁷ 和 D2 Elliot Web 漏洞利用框架。 ⁸ 在 S55 中，作者还使用了来自 Contagio 数据集的漏洞和其他恶意软件样本，该数据集在其他研究中也被使用[ 74, 93, 143]。同样，在 S30 中，作者使用了先前工作中创建的多种数据集中的漏洞，包括在 S06 的 Mayhem 出版物中之前使用的漏洞。

Exploit Signatures. 利用特征。

Another set of observations used as training/test data are sets of exploit signatures. As we discuss in Section 2.1, an exploit signature uniquely identifies a particular exploit as part of security tools such as Intrusion Detection systems. Exploit signatures are associated with “likelihood of exploitation” as much as exploitability [62, 88]. Exploit signature datasets take two forms:
另一组用作训练/测试数据的观察结果是利用签名集。正如我们在第 2.1 节中讨论的，利用签名唯一地识别了特定利用作为安全工具（如入侵检测系统）的一部分。利用签名与“利用可能性”相关，程度与可利用性相当[62, 88]。利用签名数据集有两种形式：

—

First is the existence of the signatures themselves, most frequently gathered from Symantec (a commercial vendor) as was done in S26 (see [7, 8]) and S55 (see [116]). Signatures are also available through the SNORT⁹ and Suricata¹⁰ Intrusion Detection platforms, which were used in earlier versions of EPSS [62] as part of their dataset from Proofpoint.
首先是有签名的存在，这些签名最常见的是从赛门铁克（一家商业供应商）收集的，如 S26（见[7, 8]）和 S55（见[116]）中所述。签名也通过 SNORT ⁹ 和 Suricata ¹⁰ 入侵检测平台提供，这些平台在 EPSS 早期版本[62]中作为其数据集的一部分被使用。

—

The second form of signature data, the frequency with which the signatures are observed, is only used in two studies whose models are not explicitly exploitability models. Signature detection frequency is considered to be a more robust estimate of the likelihood of exploitation than the signatures alone [88]. However, telemetry data on the frequency with which vulnerabilities are exploited and the context in which exploitation occurs is inherently sensitive [33] and requires special permissions to access. The authors of S19 (see [104]) used the WINE (Worldwide Intelligence Network Environment) dataset [33], a set of attack signature observations collected by Symantec between 2008 and 2014 [129] as part of their GT. More recently, developers and maintainers of the EPSS model (see [40, 62]) have worked with AlienVault and Proofpoint, two commercial Intrusion Detection System vendors, to obtain more recent data on which exploits had been actively observed.
第二类签名数据，即签名出现的频率，仅在两个模型不是明确可利用性模型的文献中使用。将签名检测频率视为比单独的签名更稳健的利用可能性估计 [88]。然而，关于漏洞被利用的频率及其利用发生的环境的遥测数据本质上具有敏感性 [33]，并需要特殊权限才能访问。S19 的作者（参见[104]）使用了 WINE（全球情报网络环境）数据集 [33]，这是赛门铁克在 2008 年至 2014 年间收集的一系列攻击签名观测数据 [129]，作为他们 GT 的一部分。最近，EPSS 模型（参见[40, 62]）的开发者和维护者与 AlienVault 和 Proofpoint 两家商业入侵检测系统供应商合作，以获取有关哪些利用已被积极观察的最新数据。

Exploit Flag in Vulnerability Database.
漏洞数据库中的利用标志

Some vulnerability databases include a specific flag about whether an exploit is known to exist for a particular vulnerability or has been seen in the wild, even if the database does not include the exploit itself or how the information was collected. For example, the OSVDB included information about vulnerabilities, including whether a vulnerability had “an available, rumored, or private exploit” as well as the date an exploit was first recorded [14]. Since the OSVDB is no longer available, this observation is used less frequently. However, in S50 (see [141]), published in 2020, the authors use the Vulners vulnerability database, which also has information on whether an exploit is available [141].
一些漏洞数据库包括一个特定标志，表示是否已知存在针对特定漏洞的利用程序，或者是否在野外观察到过，即使数据库本身不包含利用程序或收集信息的方式。例如，OSVDB 包含有关漏洞的信息，包括漏洞是否有“可用的、传闻中的或私有的利用程序”，以及首次记录利用程序的日期[14]。由于 OSVDB 已不再可用，这个观察结果使用频率较低。然而，在 2020 年发表的 S50（见[141]）中，作者使用了 Vulners 漏洞数据库，该数据库也包含有关利用程序是否可用的信息[141]。

Other. 其他。

Four studies (S19, S55, S03, and S40) use GT data unique to those studies:
四项研究（S19、S55、S03 和 S40）使用了这些研究独有的 GT 数据：

—

In S19, Sabottke et al. [104] consider vulnerabilities with a Microsoft Exploitability Index [84] of 0 or 1 to be “exploited” as part of their GT.
在 S19 中，Sabottke 等人[104]将 Microsoft Exploitability Index [84]为 0 或 1 的漏洞视为其 GT 中的“已利用”。

—

In S55 (see [116]), in addition to other GT described previously, the authors used the Exploit Code Maturity levels of the Temporal CVSS score from commercial sources¹¹ including IBM X-Force Exchange and Tenable Nessus. The authors of S55 also use NLP rules to extract evidence of exploits and of exploitation in the wild from databases, including BugTraq, Tenable, Skybox, and AlienVault OTX.
在 S55（见[116]）中，除了之前描述的其他 GT 之外，作者还使用了来自商业来源的 Temporal CVSS 评分的 Exploit Code Maturity 级别，包括 IBM X-Force Exchange 和 Tenable Nessus。S55 的作者还使用 NLP 规则从数据库中提取利用证据和野外利用的证据，包括 BugTraq、Tenable、Skybox 和 AlienVault OTX。

—

In their model for time to exploit in S03, Bozorgi et al. [14] use additional data from prior work by Frei et al. [42] to obtain more accurate information on when vulnerability exploits were released.
在他们的 S03 利用时间模型中，Bozorgi 等人[14]利用 Frei 等人[42]先前工作的额外数据，以获得关于漏洞利用发布时间的更准确信息。

—

In the second of the two models examined in S40 (see [10]), the authors identify 12 vulnerabilities used by a threat actor known as APT 28. The authors note that vulnerabilities targeted by APT 28 all appear in one of the following services: Adobe Flash, Java, Windows, Microsoft Office, and Microsoft Word. The authors then label all vulnerabilities in these five services as likely to be exploited, assuming APT 28 will continue to invest resources in exploiting the same services.
在 S40（见[10]）中考察的两个模型中的第二个模型中，作者确定了被称为 APT 28 的威胁行为者使用的 12 个漏洞。作者指出，APT 28 所针对的漏洞都出现在以下服务之一中：Adobe Flash、Java、Windows、Microsoft Office 和 Microsoft Word。然后，作者将这些五个服务中的所有漏洞标记为可能被利用，假设 APT 28 将继续投资资源来利用相同的服务。

All forms of exploit-related GT have limitations. For example, some types of vulnerability may be over- or under-represented in exploit signature data due to how current Intrusion Detection and Anti-Malware systems are implemented [40, 62]. Input validation vulnerabilities that result in XSS (Cross-Site Scripting), the signature may target the attacker’s behavior of attempting to input a script rather than targeting a specific vulnerability. These generic signatures may not be mapped to a particular vulnerability, which can result in the under-representation of XSS in some datasets. However, exploit signatures are considered by some authors as a better indicator that a useful exploit can be created [116].
所有形式的利用相关 GT 都有局限性。例如，由于当前入侵检测和反恶意软件系统的实现方式，某些类型的漏洞可能在利用签名数据中过度或不足地表示[40, 62]。导致 XSS（跨站脚本）的输入验证漏洞，签名可能针对攻击者尝试输入脚本的行为了，而不是针对特定的漏洞。这些通用签名可能无法映射到特定的漏洞，这可能导致某些数据集中 XSS 的不足表示。然而，一些作者认为，利用签名被认为是创建有用利用的更好指标[116]。

8.2 What Categories of Information Are Commonly Used as Features (FT)?
8.2 常用作特征（FT）的信息类别有哪些？

We categorize the features used in the different LMs into several categories, each of which is described in more detail in this section. Table 6 also shows the features used in each paper.
我们将不同语言模型中使用的特征分为几个类别，每个类别在本节中都有更详细的描述。表 6 还展示了每篇论文中使用的特征。

8.2.1 Program Analysis. 8.2.1 程序分析。

The first category of features used in models are those derived through program analysis. The studies identified in our survey that use program analysis as a feature include S52 (see [75, 86]), S59 (see [126]), S18 (see [134]), and S30 ([119]) as shown in Table 6. The use of program analysis techniques in these LMs indicates some similarity between these models and the Deterministic tools discussed in Section 7.1, which also rely on program analysis.
模型中使用的第一个特征类别是通过程序分析得到的。在我们的调查中，将程序分析作为特征的研究包括 S52（见[75, 86]）、S59（见[126]）、S18（见[134]）和 S30（[119]），如表 6 所示。在这些 LM 中使用程序分析技术表明，这些模型与第 7.1 节中讨论的确定性工具之间存在一些相似性，这些工具也依赖于程序分析。

In S52 (see [75, 86]) and S59 (see [126]), both rely on neural network models to extract features from source code. As indicated in the “# Vuln” column of Table 6, for both S52 and S59, the LM performs its classification at the level of the actual statements in a commit (S52) or functions (S65) rather than the entire vulnerability, which may contain multiple broken statements or functions. As seen in Table 6, the statements or functions map to a smaller number of vulnerabilities from the CVE list in the NVD, compared to other models.
在 S52（见[75, 86]）和 S59（见[126]）中，两者都依赖于神经网络模型从源代码中提取特征。如表 6 的“# Vuln”列所示，对于 S52 和 S59，LM 在其分类是在提交的实际语句（S52）或函数（S65）级别进行的，而不是整个漏洞，这可能包含多个损坏的语句或函数。如表 6 所示，与其它模型相比，语句或函数映射到 NVD CVE 列表中的漏洞数量更少。

In the first study in S52 (see [86]), features are extracted and classified at the commit level. In contrast, the second study takes a more detailed approach—classifying the individual vulnerable statements within each commit. The authors of S52 then examine different models for extracting the context of the vulnerable statement, such as examining the full function rather than just the vulnerability statement. The authors found that adding function context improved classification for all scores, but improved classification for AV and AC less than other categories. For example, AV and AC showed 6.4% and 6.5% improvements when function context was used instead of only the vulnerable statement, whereas AU showed a 9% improvement (see [75]).
在 S52 的第一项研究中（参见[86]），在提交级别提取并分类了特征。相比之下，第二项研究采取了更细致的方法——对每个提交中的单个易受攻击语句进行分类。S52 的作者随后检验了提取易受攻击语句上下文的不同模型，例如检查整个函数而不是仅仅检查易受攻击的语句。作者发现，添加函数上下文提高了所有评分的分类效果，但 AV 和 AC 的分类效果提高幅度小于其他类别。例如，当使用函数上下文而不是仅使用易受攻击的语句时，AV 和 AC 分别提高了 6.4%和 6.5%，而 AU 提高了 9%（参见[75]）。

In S59, which focuses on vulnerabilities in the Linux kernel instead of basing the model on the functional source code, the model extracts function descriptions from the source code that have been formatted according to the kernel-doc format [126]. Precision and recall for the models in S59 for different exploitability values (AV, AC, AU) were relatively high, ranging between 86% and 95%.
在 S59 中，该研究聚焦于 Linux 内核中的漏洞，而不是基于功能源代码构建模型，模型从按照 kernel-doc 格式（[126]）格式化的源代码中提取函数描述。S59 中不同可利用性值（AV、AC、AU）的模型的精确度和召回率相对较高，介于 86%至 95%之间。

In S18, Younis et al. [134] examine the discriminative power of eight metrics extracted from the function call graph of two applications: Apache HTTP server and the Linux kernel. The authors also analyze the predictive power of these metrics when different feature selection and machine learning algorithms are applied. The authors found that Count Path, which measures the number of paths in the call graph that go through the vulnerable function [109, 134], had the highest discriminative power according to Welch’s t-test [134]. SLOC (Source Lines of Code), a measure of code size, and Called-by Functions (also known as Out-Degree), which measures the number of functions called by the vulnerable function [134], also had statistically significant discriminative power. The discriminative power was not statistically significant for any other features, including Cyclomatic Complexity and Calling Functions (also known as In-Degree). The authors then applied Correlation-based, Wrapper subset evaluation, and Principal Component Analysis feature selection techniques with Logistic Regression, Naive Bayes, Random Forest, and SVM machine learning algorithms. Each model performed best when paired with a different feature selection technique. The models’ precision on the Apache HTTP server ranged from 44% to 84%, whereas the models’ recall ranged from 60% to 83%. However, for the Linux kernel, none of the models had a recall score over 70%, the threshold set by the authors for acceptable model performance [134]. The authors suggest that these program analysis features are less predictive for the Linux kernel because controlling the OS provides higher value to an attacker.
在 S18 中，Younis 等人[134]检验了从两个应用程序（Apache HTTP 服务器和 Linux 内核）的功能调用图中提取的八个指标的判别能力。作者还分析了当应用不同的特征选择和机器学习算法时，这些指标的预测能力。根据 Welch 的 t 检验[134]，作者发现 Count Path（衡量通过易受攻击函数的调用图中路径的数量[109, 134]）具有最高的判别能力。SLOC（源代码行数），衡量代码大小的指标，以及 Called-by Functions（也称为 Out-Degree），衡量易受攻击函数调用的函数数量[134]，也具有统计学上显著的判别能力。其他任何特征（包括 Cyclomatic Complexity 和 Calling Functions，也称为 In-Degree）的判别能力在统计学上均不显著。然后，作者应用了基于相关性的、包装子集评估和主成分分析特征选择技术，与逻辑回归、朴素贝叶斯、随机森林和 SVM 机器学习算法相结合。每个模型在与不同的特征选择技术搭配时表现最佳。模型在 Apache HTTP 服务器上的精确度从 44%到 84%不等，而模型的召回率从 60%到 83%不等。然而，对于 Linux 内核，没有一个模型的召回率超过 70%，这是作者设定的可接受模型性能的阈值[134]。作者建议，这些程序分析特征对 Linux 内核的预测性较低，因为控制操作系统对攻击者提供了更高的价值。

In S30 (see [119]), the authors build a tool, which they refer to as Exniffer, to assess the exploitability of memory corruption vulnerabilities detected due to system crashes. As described by the authors, “A crash as a result of a safety-critical bug may not necessarily be exploitable as it may not depend on the malicious inputs, whereas a crash due to a security bug necessarily depends on the malicious inputs” [119]. Similar to EG models for memory corruption vulnerabilities, the authors start with an executable that is run in an instrumented environment. When a crash occurs, the authors extract “static features” such as the x86 instruction being executed at the time of the crash (e.g., mov eax,[ecx]), and the type of exception thrown (e.g., memory access violation, floating point exception). The authors also extract “dynamic features” using the PIN analysis program to simulate LBR (Last-Branch-Record), a program tracing functionality available in many processors. These dynamic features include whether the crash occurred during the execution of a loop and the type of branch instruction (e.g., jump, function call, function return) most recently executed. Exniffer labeled the data using a variety of sources, including prior work and exploits found online and downloaded by the authors. They use an SVM algorithm to classify vulnerabilities as exploitable or non-exploitable, and apply RFE (Recursive Feature Elimination) to rank features. The top three features included corruption of a backtrace, Null Memory operand, and Executable Extended Instruction Pointer memory segment. Overall, their tool had a precision of 0.96 and a recall of 0.81.
在 S30（见[119]），作者构建了一个工具，他们称之为 Exniffer，用于评估由于系统崩溃检测到的内存损坏漏洞的可利用性。正如作者所描述的，“由于安全关键性错误而导致的崩溃可能不一定可利用，因为它可能不依赖于恶意输入，而由于安全错误导致的崩溃必然依赖于恶意输入” [119]。与内存损坏漏洞的 EG 模型类似，作者从一个在仪器化环境中运行的可执行文件开始。当发生崩溃时，作者提取“静态特征”，例如崩溃时正在执行的 x86 指令（例如，mov eax,[ecx]）和抛出的异常类型（例如，内存访问违规，浮点异常）。作者还使用 PIN 分析程序提取“动态特征”，以模拟 LBR（最后分支记录），这是许多处理器中可用的程序跟踪功能。这些动态特征包括崩溃是否发生在循环执行期间以及最近执行的分支指令类型（例如，跳转，函数调用，函数返回）。 Exniffer 使用多种来源标记数据，包括作者在线找到并下载的先前工作和漏洞。他们使用 SVM 算法将漏洞分类为可利用或不可利用，并应用 RFE（递归特征消除）来排序特征。前三个特征包括回溯损坏、空内存操作数和可执行扩展指令指针内存段。总体而言，他们的工具的精确度为 0.96，召回率为 0.81。

8.2.2 Vulnerability Reports.
8.2.2 漏洞报告。

The most common source of features in vulnerability prediction models is vulnerability information available in reports such as those provided by the NVD. Some of this information is available in pre-processed categories, such as CVSS Base score vector elements, vulnerability type, and the product/vendor responsible for maintaining the software. However, these models also use features extracted from un-categorized text, such as the description field in the NVD reports, using NLP techniques. Models using data from vulnerability reports generally have strong performance. Sources and results vary between studies, suggesting that the success of these models is dependent on other factors being evaluated in the same study, such as the GT. For example, Suciu et al. [116] report an AUC higher than 0.9 for their temporal metric for exploit development, which also include Exploit Data based features, whereas (non-temporal) experiments which led to EPSS [62] using slightly different GT sources and fewer Exploit Data based features had an AUC between 0.78 and 0.85.
漏洞预测模型中最常见的特征来源是 NVD 等报告中提供的信息。其中一些信息以预处理的类别形式提供，例如 CVSS 基础分数向量元素、漏洞类型以及负责维护软件的产品/供应商。然而，这些模型也使用从未分类文本中提取的特征，例如 NVD 报告中的描述字段，利用 NLP 技术。使用漏洞报告数据的模型通常性能强劲。不同研究的结果和来源各异，表明这些模型的成功取决于同一研究中评估的其他因素，如 GT。例如，Suciu 等人[116]报告了他们的利用开发时间度量 AUC 超过 0.9，这还包括基于 Exploit Data 的特征，而使用略微不同的 GT 来源和较少基于 Exploit Data 特征的（非时间性）实验导致了 EPSS[62]，其 AUC 在 0.78 到 0.85 之间。

One concern with vulnerability report based models is ensuring that the feature(s) of the model do not contain information about the GT that would be unavailable for vulnerabilities for which the GT was unknown. For example, in S45 (see [132], the authors note that when training data is labeled with information, such as exploits in ExploitDB, and references that may include the link to the label data (e.g., an exploit from ExploitDB) are used as features for the LM; the features used will inherently contain the predicted variable. Using a wide range of features, they found that the use of reference-based features only improved a model if those references included the exploits used for GT.
关于基于漏洞报告模型的担忧之一是确保模型的特征不包含关于 GT 的信息，这些信息对于 GT 未知漏洞是不可用的。例如，在 S45（参见[132]），作者指出，当训练数据被标记为包含信息，如 ExploitDB 中的漏洞利用，以及可能包括标签数据链接（例如，来自 ExploitDB 的漏洞利用）的参考被用作 LM 的特征时；所使用的特征将固有地包含预测变量。他们使用广泛的功能，发现只有当这些参考包括用于 GT 的漏洞利用时，基于参考的特征才能提高模型。

8.2.3 Social Network. 8.2.3 社交网络

The use of social network features was first proposed by Sabottke et al. [104] in S19, who build features based on X (formerly known as Twitter) posts that mention vulnerabilities, specifically CVEs. In addition to applying NLP to the content of the posts themselves, the authors of S19 used the following features: number of tweets about the vulnerability, # users tweeting about the vulnerability with minimum T followers, # users tweeting about the vulnerability with minimum T friends, # retweets/replies, # replies, # tweets favorited, Avg # hashtags mentions per tweet, Avg # URLs mentions per tweet, Avg # user mentions per tweet, # verified accounts tweeting about the vulnerability, Avg age of accounts tweeting about the vulnerability, and Avg # of tweets per account. In S19, the authors found that incorporating text and social network features from X may improve the precision of LMs for likelihood of exploitation.
社交网络特征的使用首先由 Sabottke 等人[104]在 S19 中提出，他们基于提及漏洞（特别是 CVEs）的 X（以前称为 Twitter）帖子构建特征。除了将 NLP 应用于帖子内容本身之外，S19 的作者还使用了以下特征：关于漏洞的推文数量、提及漏洞的至少 T 个关注者的用户数量、提及漏洞的至少 T 个朋友的用户数量、转发/回复数量、回复数量、被点赞的推文数量、每条推文平均提及的标签数量、每条推文平均提及的 URL 数量、每条推文平均提及的用户数量、提及漏洞的认证账户数量、提及漏洞的账户平均年龄，以及每个账户的平均推文数量。在 S19 中，作者发现将 X 的文本和社交网络特征纳入可能提高 LMs 对利用可能性的精确度。

In S32 (see [15]), the authors compare whether a model built from text features extracted from X, as done in S19, performs better than a model using only text features from the NVD and CVSS score based features. In contrast with the prior work in S19, the authors of S32 find that the X information does not consistently improve the model in terms of precision and recall, and its value is particularly questionable in scenarios where additional infrastructure will need to be established to extract information from X. S55 similarly did not find X features to be good predictors and excluded them from their primary model. The different findings in the precision and recall between S32 and S19 also highlight a problem found in several studies, including S19’s attempts to compare their work with S03—the replicability of results from exploitability LM is low.
在 S32（见[15]），作者比较了从 X 中提取的文本特征构建的模型，如 S19 中所述，是否比仅使用 NVD 和 CVSS 评分特征的文本特征构建的模型表现更好。与 S19 中的先前工作相比，S32 的作者发现 X 信息并不始终能提高模型的精确度和召回率，并且在需要建立额外基础设施从 X 中提取信息的场景中，其价值尤其值得怀疑。S55 同样没有发现 X 特征是好的预测指标，并将它们排除在他们的主要模型之外。S32 和 S19 在精确度和召回率方面的不同发现也突显了在包括 S19 试图将其工作与 S03 进行比较在内的几项研究中发现的问题——利用性 LM 的结果可重复性低。

In S26, the authors examine how “darkweb” data from TOR network sites can be used to predict vulnerabilities using a dataset known as D2Web. As part of this work, they also examine the social network between users within the D2Web data, examining how measures, such as the in-degree and out-degree of the D2Web social network data of users who mention a vulnerability.
在 S26 中，作者探讨了如何利用名为 D2Web 的数据集，通过 TOR 网络网站上的“暗网”数据来预测漏洞。作为这项工作的一个部分，他们还考察了 D2Web 数据中用户之间的社交网络，研究如何衡量提及漏洞的用户在 D2Web 社交网络数据中的入度（in-degree）和出度（out-degree）。

8.2.4 Exploit Data. 8.2.4 利用数据。

In comparison with the other studies that use exploit information as FT, in S55, Suciu et al. [116] propose a novel model primarily based on exploit information. In S55, the model is designed to predict when a functional exploit will be developed (i.e., an exploit that can achieve an attacker’s goals). Their model extracts FT from exploits that have been already released but which may only cause unexpected behavior, such as a crash. The exploits used as FT are referred to as PoC exploits, which are extracted from ExploitDB, BugTraq, and Vulners databases. The FT in S55 include the programming language of the PoC, # of language-reserved keywords used in the PoC, measures of code size and complexity of the PoC, n-grams (e.g., words) extracted from the PoC, and n-grams extracted from text and comments in the PoC. The authors found that models based on combined feature sets with PoC exploit features, language n-grams from vulnerability write-ups, CVSS scores, product/vendor information, and CWE types performed better in terms of AUC than using the PoC exploit information FT alone or using the other FT without exploit information.
与使用漏洞信息作为 FT 的其他研究相比，在 S55 中，Suciu 等人[116]提出了一种主要基于漏洞信息的新型模型。在 S55 中，该模型旨在预测何时将开发出功能漏洞（即能够实现攻击者目标的漏洞）。他们的模型从已经发布但可能只会导致意外行为（如崩溃）的漏洞中提取 FT。用作 FT 的漏洞被称为 PoC 漏洞，它们是从 ExploitDB、BugTraq 和 Vulners 数据库中提取的。S55 中的 FT 包括 PoC 的编程语言、在 PoC 中使用的语言保留关键词的数量、PoC 的代码大小和复杂度度量、从 PoC 中提取的 n-gram（例如，单词）以及从 PoC 中的文本和注释中提取的 n-gram。作者发现，基于结合特征集的模型，包括 PoC 漏洞特征、漏洞描述中的语言 n-gram、CVSS 评分、产品/供应商信息和 CWE 类型，在 AUC 方面比单独使用 PoC 漏洞信息 FT 或使用其他 FT 而不包含漏洞信息的表现更好。

8.2.5 Timeline. 8.2.5 时间线。

In three studies, the FT include features based on the time of the publication for vulnerability information, and other available information on vulnerability and exploit timelines. Most of these features are based on the timeline FT used in S03 (see [14]). In S03, Bozorgi et al. [14] use FT that include key dates extracted from the OSVDB and CVE, the difference between the CVE last modified date of the CVE documentation and the date the CVE was created, and the difference between the last modified date and the date the vulnerability was published on OSVDB. These last two features—the differences between the last modified date and the creation and disclosure dates—were among the top 10 weighted features, which were therefore also used by Sabottke et al. [104] in S19 to evaluate how much different types of features may contribute to an LM for exploitability. Similarly, in S50, Zhang and Li [141] extracted features from the Vulners database, including the difference between the modified date and the original published date for the vulnerability, and the difference between the last seen date and the original publish date. EPSS (see [40, 62]) also uses a timeline measure—the number of days since the CVE was published, which is in the top 30 contributing features for EPSS [40].
在三项研究中，FT 包括了基于发布时间的安全漏洞信息和其他有关漏洞和利用时间表的可获得信息。这些特征中的大多数基于 S03 中 FT 使用的时序（参见[14]）。在 S03 中，Bozorgi 等人[14]使用 FT，包括从 OSVDB 和 CVE 中提取的关键日期，CVE 文档最后修改日期与 CVE 创建日期之间的差异，以及最后修改日期与漏洞在 OSVDB 上发布日期之间的差异。这些最后两个特征——最后修改日期与创建和披露日期之间的差异——是前 10 个加权特征之一，因此也被 Sabottke 等人[104]在 S19 中使用来评估不同类型的特征可能对可利用性 LM 的贡献程度。同样，在 S50 中，张和李[141]从 Vulners 数据库中提取了特征，包括漏洞修改日期与原始发布日期之间的差异，以及最后看到日期与原始发布日期之间的差异。 EPSS（见[40, 62]）还使用时间线度量——自 CVE 发布以来的天数，这是 EPSS[40]的前 30 个贡献特征之一。

8.3 Modeling Technique(s) Examined
8.3 被考察的建模技术

The final column of Table 5 highlights type types of modeling techniques examined in each study. As can be seen in the table, neural networks and large-language models, such as Bi-directional Long Short Term Memory, Gated Recurrent Unit, and the CodeBERT model, have been increasingly popular avenues of research since 2019. However, gaps remain in determining the efficacy and usability of these models in a real-world scenario.
表 5 的最后一列突出了每项研究中考察的建模技术类型。如表中所示，自 2019 年以来，神经网络和大型语言模型，如双向长短期记忆网络、门控循环单元以及 CodeBERT 模型，已成为越来越受欢迎的研究途径。然而，在确定这些模型在实际场景中的有效性和可用性方面仍存在差距。

9 Other Automated, Probabilistic Assessments (Non-lm)
9 其他自动化、概率性评估（非 lm）

We identified three studies, S02 (see [43]), S24 (see [99, 100]), and S22 (see [111]), which propose or evaluate exploitability assessments that rely on Probabilistic models but are not based on machine learning. In S02 (see [43]) and S24 (see [99, 100]), the authors use an equation based on a Pareto Distribution observed by Frei et al. [41]. In S02, the authors use the Frei equation to estimate part of the Temporal metrics from CVSS. S24 uses the Frei et al. equation to build Non-linear Statistical Models to estimate the probability of being exploited as a function of time.
我们确定了三项研究，S02（见[43]）、S24（见[99, 100]）和 S22（见[111]），这些研究提出或评估了基于概率模型但非基于机器学习的可利用性评估。在 S02（见[43]）和 S24（见[99, 100]）中，作者使用了 Frei 等人[41]观察到的帕累托分布方程。在 S02 中，作者使用 Frei 方程来估计 CVSS 的部分时间度量。S24 使用 Frei 等人的方程构建非线性统计模型，以估计随时间变化的被利用概率。

In S22, however, the authors use a statistical survival model to determine appropriate rules to be used for vulnerability prioritization. While S22 uses similar techniques to other studies, such as the hazard models used to evaluate CVSS in S47 (see [103]), their approach to determining exploitability is different from other studies, placing S22 in a unique sub-category within the “Other Probabilistic” model sub-category.
在 S22 中，然而，作者使用统计生存模型来确定用于漏洞优先级排序的适当规则。虽然 S22 使用与其他研究类似的技术，例如在 S47 中用于评估 CVSS 的危害模型（见[103]），但他们在确定可利用性的方法上与其他研究不同，将 S22 置于“其他概率”模型子类别中的一个独特子类别。

9.1 Exploit Availability Distribution Function from Frei et al.
9.1 Frei 等人提出的利用可用性分布函数

In 2006, Frei et al. [41] examined vulnerability and exploit information based on vulnerabilities published in the OSVDB and NVD from 1996 to 2006. Their analysis focused on four points in the vulnerability lifecycle: the time of vulnerability Discovery, the time of vulnerability Disclosure, the (earliest) time of Exploit availability for the vulnerability, and the time of Patch availability. The timeline for each vulnerability was extracted through a number of sources including three public organizations: the CERT organization at Carnegie Mellon and the French Security Incident Response Team, and the milw0rm hacktivist group, as well as four companies that provide security-related software: ISS (Internet Security Systems) X-Force (now IBM X-Force), Secunia, Symantec SecurityFocus, Packetstorm, and Metasploit [41]. As part of their analysis, they identified a function matching the distribution of Exploit Availability in terms of time t since Disclosure:
2006 年，Frei 等人[41]基于 1996 年至 2006 年 OSVDB 和 NVD 发布的漏洞信息，考察了漏洞和利用信息。他们的分析集中在漏洞生命周期的四个方面：漏洞发现时间、漏洞披露时间、漏洞利用（最早）可用时间以及补丁可用时间。每个漏洞的时间线是通过多个来源提取的，包括三个公共组织：卡内基梅隆大学的 CERT 组织、法国安全事件响应团队以及 milw0rm 黑客组织，以及四家提供安全相关软件的公司：ISS（互联网安全系统）X-Force（现为 IBM X-Force）、Secunia、Symantec SecurityFocus、Packetstorm 和 Metasploit[41]。作为他们分析的一部分，他们确定了一个与自披露以来时间 t 的 Exploit 可用性分布相匹配的函数：

F (t) = 1 - {(\frac{0.0016}{t})}^{0.260} .

The Exploit Availability function determined by Frei et al. was used by two studies in our survey, S02 (see [43]) and S24 (see [99, 100]) to estimate exploitability.
Frei 等人确定的漏洞可用性函数被我们调查中的两项研究使用，S02（见[43]）和 S24（见[99, 100]）用于估计漏洞利用性。

As discussed in Section 3, the NVD does not provide estimates of the Temporal or Environmental scores. In S02, Frühwirth and Mannisto [43] analyze the difference in CVSS v2 metrics when they use probability distributions to approximate the Temporal and Environmental metrics. For the “Exploit Code Maturity Metric,” the primary exploitability-specific Temporal or Environmental metric in CVSS v2, the authors use the equation from Fre. et al. [41]. In S02 (see [43]), the authors found that on a set of 720 vulnerabilities recorded between January 5 and March 20, 2009, the CVSS score computed using their probability-based system for the Environmental and Temporal metrics resulted in differences ranging from a 2.5 point decrease to a 2.5 increase from the score computed using default values, with 60% of scores having a 0.5% to 1% decrease from the default values. Overall, this increased the number of “Low” severity vulnerabilities, which the authors argue would result in cost savings, assuming that lower-severity vulnerabilities are less expensive to fix [43].
如第 3 节所述，NVD 不提供时间或环境评分的估计。在 S02 中，Frühwirth 和 Mannisto[43]分析了当他们使用概率分布来近似时间和环境指标时，CVSS v2 指标的差异。对于“漏洞利用代码成熟度指标”，CVSS v2 中的主要可利用性特定时间和环境指标，作者使用了 Fre.等人的方程[41]。在 S02（见[43]），作者发现，在 2009 年 1 月 5 日至 3 月 20 日记录的 720 个漏洞集合中，使用基于概率的系统计算的环境和时间指标得出的 CVSS 评分与使用默认值计算出的评分相比，差异从 2.5 分下降到 2.5 分上升，其中 60%的评分从默认值下降了 0.5%至 1%。总体而言，这增加了“低”严重性漏洞的数量，作者认为这将导致成本节约，假设低严重性漏洞的修复成本较低[43]。

Similarly, in S24 (see [99, 100]), the authors use the same model fit by Frei et al. to estimate an “exploitability” score they describe as the likelihood that a vulnerability will be exploited before being patched or disclosed. The authors combine this exploitability score with the overall CVSS severity score and an estimate of the likelihood that a vulnerability will be patched, also based on Frei’s work, to develop non-linear statistical models to estimate the probability of the exploitation of vulnerability as a function of time. This model is intended to improve on their previous model (see [99]), which used the CVSS Base Exploitability metric as part of their process. The authors of S24 refer to their overall models of the probability of exploitation as “exploitability” models [100]. In S24, the authors evaluate their risk models using

R^{2}

and Residual analysis.

R^{2}

values can range between 0 and 1, where 1 indicates that the targeted phenomenon (e.g., risk) can be perfectly explained by the predictors [30]. The adjusted

R^{2}

value is used to account for larger numbers of predictor variables in a model. The adjusted

R^{2}

for the initial model in S24 was 0.85 [99], whereas the most advanced model had an adjusted

R^{2}

of 0.96 [99].
同样，在 S24（见[99, 100]）中，作者们使用了 Frei 等人提出的相同模型来估算一个他们描述为“可利用性”的分数，即漏洞在被修补或公开之前被利用的可能性。作者们将这个可利用性分数与整体 CVSS 严重性评分以及一个基于 Frei 工作的漏洞被修补可能性的估计相结合，开发出非线性统计模型来估计漏洞被利用的概率作为时间的函数。这个模型旨在改进他们之前使用的模型（见[99]），该模型将 CVSS 基本可利用性指标作为其过程的一部分。S24 的作者们将他们关于利用概率的整体模型称为“可利用性”模型[100]。在 S24 中，作者们使用

R^{2}

和残差分析来评估他们的风险模型。

R^{2}

的值介于 0 和 1 之间，其中 1 表示目标现象（例如，风险）可以被预测因子完美解释[30]。调整后的

R^{2}

值用于考虑模型中预测变量数量的增加。S24 中初始模型的调整后

R^{2}

为 0。85 [ 99]，而最先进的模型调整后的

R^{2}

为 0.96 [ 99]。

9.2 Developing Prioritization Rules Based on Exploit Likelihood Analysis
9.2 基于漏洞可能性分析制定优先级规则

In S22 (see [111]), the authors use a Cox Proportional Hazard survival model to examine how vulnerability characteristics, such as Severity, relate to the amount of time between when a vulnerability is discovered and when an exploit is published. The authors found that Disclosure Status (whether the vulnerability had been disclosed at the time of discovery) was the strongest predictor of the length of time between when a vulnerability is discovered an the exploit is published, followed by severity. The authors propose a set of rules based on the results of their analysis, such as “Immediately disclosed, highly severe, remote vulnerability targeted at open-source, infrastructure software” should be a top priority for patching [111]. The authors suggest that managers could use a similar modeling and rule-development process to prioritize patching efforts.
在 S22（见[111]）中，作者使用 Cox 比例风险生存模型来检验脆弱性特征，如严重程度，与发现漏洞到发布漏洞利用之间的时间长度之间的关系。作者发现，披露状态（在发现时漏洞是否已披露）是预测发现漏洞到发布漏洞利用之间时间长度最强的预测因子，其次是严重程度。作者根据他们的分析结果提出了一组规则，例如“立即披露、高度严重、针对开源、基础设施软件的远程漏洞”应优先进行修补[111]。作者建议管理者可以使用类似的建模和规则开发过程来优先考虑修补工作。

10 Limitations 10 限制

Given the volume of exploitability research, some published studies may not appear in this survey. However, although this is not a formal Systematic Literature Review, we used the SYMBALS methodology to reduce the likelihood that we would miss key papers.
鉴于可利用性研究的数量，一些已发表的研究可能不会出现在本调查中。然而，尽管这不是一个正式的系统文献综述，我们使用了 SYMBALS 方法来降低我们遗漏关键论文的可能性。

We also may have introduced bias in our selection of papers. However, we had at least two researchers involved in our paper selection process, as noted in Section 4 to reduce the risk of introducing bias from a single individual.
我们也可能在论文选择过程中引入了偏差。然而，正如第 4 节所述，我们至少有两名研究人员参与论文选择过程，以降低由单个人引入偏差的风险。

In Table 5, we use the range of values reported for Time to Run as a comparable statistic. We recognize that Range is not the most descriptive statistic [30]. However, Range was the statistic we could extract most consistently across studies for Time to Run and provides some estimation of variability [30] as well as the known “worst-case scenario” for runtimes. We use footnotes in Table 5 and the explanation in Section 7.1.4 to provide additional information on studies where the minimum and maximum are particularly unrepresentative, such as S06.
在表 5 中，我们使用报告的运行时间值范围作为可比的统计数据。我们认识到范围并不是最描述性的统计数据[30]。然而，范围是我们能够最一致地从研究中提取的统计数据，对于运行时间提供了对变异性的某些估计[30]，以及已知的“最坏情况”的运行时间。我们在表 5 中使用脚注，并在第 7.1.4 节中的解释中提供有关最小值和最大值特别不具代表性的研究（如 S06）的附加信息。

Finally, a potential limitation of the study is that the characteristics and categorization in our study are unreliable or incomplete. We do not claim that our categorization or the characteristics examined in this study are the only possible categories and characteristics of note in exploitability research. However, we base our categorization on prior work to improve its reliability. One of the areas where our categorization may require alteration based on future research is in the Manual CVSS category. We found that several properties of CVSS, such as the accuracy of manual exploitability assessment techniques, had only been evaluated in the context of the original CVSS specification. Studies comparing CVSS with other systems use the CVSS Exploitability scores from the NVD as a whole, and there is limited analysis examining individual properties of the CVSS system. This is not a new problem, as evidenced by other surveys such as that of Pendleton et al. [91], where CVSS was given its own category, and that of Le et al. [76] where CVSS-based outputs dominate several sub-categories. In the future, as more research examines the individual properties of the methods for producing CVSS scores, this category should be broken down further. In our survey, for studies that use different methods for assessing CVSS scores, such as the automated method in S43 (see [144]), we categorize these studies based on their methodology rather than “Manual CVSS.”
最后，本研究的潜在局限性在于，我们研究中的特征和分类可能不可靠或不完整。我们不声称我们在此研究中考察的分类或特征是可利用性研究中唯一可能的类别和特征。然而，我们基于先前的工作来提高其可靠性。我们分类可能需要根据未来研究进行调整的一个领域是在手动 CVSS 类别中。我们发现，CVSS 的几个属性，如手动可利用性评估技术的准确性，仅在原始 CVSS 规范背景下进行了评估。比较 CVSS 与其他系统的研究使用 NVD 的 CVSS 可利用性评分作为整体，并且有限的分析考察了 CVSS 系统的个别属性。这并非新问题，正如 Pendleton 等人[91]的调查所证明的那样，CVSS 被赋予了自己的类别，以及 Le 等人[76]的调查，其中基于 CVSS 的输出主导了几个子类别。未来，随着更多研究探讨生成 CVSS 评分的方法的个体特性，这一类别应进一步细分。在我们的调查中，对于使用不同方法评估 CVSS 评分的研究，如 S43 中的自动化方法（见[144]），我们根据其方法而非“手动 CVSS”对这些研究进行分类。”

11 Discussion 11 讨论

We begin our discussion in Section 11.1 examining high-level trends in exploitability assessment research. In Section 11.2, we examine how the categories of exploitability assessment techniques might be helpful to practitioners, depending on the values and context in which the techniques are being deployed. Finally, in Section 11.3, we focus on gaps and future research opportunities.
我们将在第 11.1 节开始讨论，探讨可利用性评估研究的高级趋势。在第 11.2 节中，我们考察了根据可利用性评估技术的价值和部署环境，这些技术类别可能对实践者有何帮助。最后，在第 11.3 节中，我们关注差距和未来的研究机会。

11.1 Temporal Trends 11.1 时间趋势

In this section, we examine temporal trends and cross-cutting factors in exploitability assessment systems. Since Table 4 is ordered by publication year, it is a useful reference for observing and discussing temporal trends.
在这一节中，我们探讨了可利用性评估系统中的时间趋势和交叉因素。由于表 4 按出版年份排序，它对于观察和讨论时间趋势非常有用。

CVSS—Less Novel, More Standard.
CVSS—不那么新颖，更标准化。

As can be seen in Table 4, earlier research (e.g., research prior to 2015) tended to focus on evaluating and proposing CVSS-based models. Later works such as S43, S55, and S56 seem to have accepted CVSS as a baseline against which to compare other exploitability assessments.
如表 4 所示，早期研究（例如，2015 年之前的研究）往往侧重于评估和提出基于 CVSS 的模型。后来的工作，如 S43、S55 和 S56，似乎已将 CVSS 作为比较其他可利用性评估的基准。

Rise of the (Learning) Machines.
机器（学习）的崛起。

Another trend, seen in Table 4, is that while the volume of CVSS-based evaluations has declined, LM has increased. This trend is further reflected in the number of recent surveys focused exclusively on LM, as shown in Section 2.2.
另一个趋势，如表 4 所示，尽管基于 CVSS 的评估量有所下降，但 LM 有所增加。这一趋势在 2.2 节中提到的专注于 LM 的近期调查数量中得到了进一步体现。

Increased Focus on Program Analysis.
程序分析关注度提升。

As noted in Section 8, within the Probabilistic LM models, we see an increasing trend toward using program analysis based features. This may be due to improvements in program analysis techniques, which have also spurred innovation in Deterministic assessments, leading to deterministic program analysis techniques such as S21 being used in industry settings [97].
如第 8 节所述，在概率性 LM 模型中，我们观察到使用基于程序分析的特征的趋势正在增加。这可能是由于程序分析技术的改进，这也促进了确定性评估的创新，导致诸如 S21 之类的确定性程序分析技术在工业环境中得到应用[97]。

11.2 Aligning with Vulnerability Prioritization Values from Industry
11.2 与行业漏洞优先级价值观保持一致

As described by de Smale et al. [28], practitioners make tradeoffs in determining what vulnerabilities to prioritize or even what information to collect for vulnerability management. What methods and tools are useful for a particular organization will depend on the tradeoffs they make. We use the three sets of value tradeoffs identified by de Smale et al. obtained via interviews with practitioners to frame our discussion of the different exploitability assessment techniques. In this discussion, we are not proposing that one technique should be used instead of another. We use de Smale’s sets of opposing values to highlight how some categories of technique may align with specific values.
如 de Smale 等人[28]所述，实践者在确定优先考虑哪些漏洞或甚至收集哪些信息以进行漏洞管理时进行权衡。对特定组织有用的方法和工具将取决于他们所做的权衡。我们使用 de Smale 等人通过访谈实践者获得的三个价值权衡集来构建我们对不同可利用性评估技术的讨论。在这次讨论中，我们并不是提议用一种技术代替另一种技术。我们使用 de Smale 的相反价值集来强调某些技术类别可能如何与特定价值相一致。

Independent analysis vs. Trust. The first set of values discussed by de Smale et al. [28] is how much organizations trust information provided by others, compared with the time and effort required to perform an in-depth, independent analysis of the vulnerability in the organization’s context. CVSS scores provided by a third party, such as the NVD, require a high amount of trust. As noted by Ponta et al. [97] in their “Lessons Learned” as part of transitioning S21 into industry practice, developers may be particularly skeptical of metadata-based assessments, and as shown in Table 6, many of the proposed LM approaches for exploitability assessment are based on metadata from vulnerability reports. However, if an organization has sufficient data from its sources, it may be able to build and use a more independent LM. Using Deterministic tools may provide more independent information for that organization’s context but do require Time to Run. Similarly, evaluating CVSS manually requires expertise and time but can be done independently, with the Environmental metrics of CVSS specifically designed to be tailored to a particular context.
独立分析 vs. 信任。de Smale 等人[28]讨论的第一组价值观是组织对他人提供的信息信任程度，与在组织环境中对漏洞进行深入独立分析所需的时间和精力相比。第三方提供的 CVSS 评分，如 NVD，需要很高的信任度。正如 Ponta 等人[97]在他们将 S21 过渡到行业实践中的“经验教训”中所述，开发者可能特别怀疑基于元数据的评估，如表 6 所示，许多针对可利用性评估的 LM 方法都是基于漏洞报告的元数据。然而，如果一个组织拥有足够的数据来源，它可能能够构建和使用一个更独立的 LM。使用确定性工具可能为该组织的特定环境提供更多独立信息，但确实需要运行时间。同样，手动评估 CVSS 需要专业知识和时间，但可以独立完成，CVSS 的环境指标专门设计为可以针对特定环境进行调整。

Proactive vs. Reactive. Smaller organizations tend to take a primarily reactive approach [28], often relying on information pushed to them by government organizations—and may therefore be more likely to rely heavily on CVSS scores from the NVD. Gathering sufficient information to leverage an LM model requires a highly proactive approach, whereas Deterministic methods lie somewhere in between. Having LM scores publicly available, such as with EPSS, may facilitate the use of LM results in a reactive context.
主动与被动。较小的组织往往采取主要被动的策略[28]，通常依赖政府机构推送给他们的信息——因此可能更倾向于高度依赖 NVD 的 CVSS 评分。收集足够的信息以利用 LM 模型需要高度主动的方法，而确定性方法则介于两者之间。LM 评分的公开，例如 EPSS，可能有助于在被动环境中使用 LM 结果。

Formalized vs. Ad-Hoc Processes. Probabilistic approaches, which require a collection of vulnerabilities to perform statistical analysis, may be challenging to incorporate into ad-hoc processes. As with Proactive vs. Reactive decisions, the availability of a pre-calculated EPSS (LM) score may mitigate the need for a formalized process but also reduces the ability to tailor the score to the organization’s context. CVSS-based scores can be more readily calculated on an ad-hoc basis. Similarly, the Program State based Deterministic assessments may be more readily incorporated into ad-hoc processes, although some work may be required to set up, configure, and run the program analysis tools. As discussed in Section 7.1.4, given the time required to run many of the Deterministic Program State based assessments, these approaches may be less useful if an organization is attempting to analyze all vulnerabilities at once on a regular basis as part of a formalized processes such as in Continuous Integration.
形式化过程与临时过程。需要收集漏洞集以进行统计分析的概率方法可能难以融入临时过程。正如主动与被动决策一样，预先计算的 EPSS（LM）分数可能减轻了对形式化过程的需求，但也降低了根据组织环境调整分数的能力。基于 CVSS 的分数可以更方便地在临时基础上计算。同样，基于程序状态的确定性评估可能更容易融入临时过程，尽管可能需要一些工作来设置、配置和运行程序分析工具。如第 7.1.4 节所述，考虑到许多基于确定性程序状态的评估所需的时间，如果组织试图定期作为形式化过程（如持续集成）的一部分同时分析所有漏洞，这些方法可能不太有用。

11.3 Research Gaps and Future Directions
11.3 研究空白与未来方向

We found studies that used other assessment methods as a control group in their experiments (i.e., to demonstrate that their proposed technique is an improvement). However, we found relatively little research on how assessment methods might work well together. One of the closest studies in this regard may be S22 (see [111]) discussed in Section 9.2, in which the authors propose building more deterministic rules for vulnerability prioritization based on the results of a statistical model. However, even in S22, the study focuses on the relationship between factors within the statistical model rather than evaluating the combination of techniques. The length of time and resources to run many of the Deterministic, Program State based models discussed in Section 7.1 may make it difficult for such tools to be used at a scale that could contribute to developing an LM. Future research may focus on the assessment techniques that could be used to better determine when to deploy Program State based models.
我们发现了一些研究，它们在实验中将其他评估方法作为对照组（即，为了证明他们提出的技术是一种改进）。然而，我们发现关于评估方法如何协同工作的研究相对较少。在这方面最接近的研究可能是 S22（见[111]），在 9.2 节中讨论了这项研究，其中作者提出根据统计模型的结果建立更多确定性规则来进行漏洞优先级排序。然而，即使在 S22 中，该研究也侧重于统计模型内部因素之间的关系，而不是评估技术的组合。7.1 节中讨论的许多基于程序状态的确定性模型的运行时间和资源可能使得这些工具难以在可能对开发 LM 做出贡献的规模上使用。未来的研究可能将重点放在评估技术，这些技术可以更好地确定何时部署基于程序状态的模型。

The usability of exploitability assessment techniques is under-studied. Studies S34 (see [3]) and S44 (see [4]) examine user concerns when applying CVSS. For Deterministic methods, S57 determined that there were misconceptions among experts about the relative merits and drawbacks of different features in EG tools for heap-based vulnerabilities and S21 (see [60, 95, 96, 97]) includes a summary of lessons learned from transitioning their Deterministic tool into practical use at SAP. However, we could not find studies looking into any usability-related aspects of the Probabilistic models and no formal usability studies of many of the Deterministic models.
可利用性评估技术的可用性研究不足。研究 S34（见[3]）和 S44（见[4]）考察了在应用 CVSS 时用户关注的问题。对于确定性方法，S57 确定专家对基于堆漏洞的 EG 工具中不同特征的相对优缺点存在误解，而 S21（见[60, 95, 96, 97]）总结了将他们的确定性工具过渡到实际应用在 SAP 中学到的经验教训。然而，我们未能找到研究概率模型任何可用性相关方面的研究，也没有许多确定性模型的形式可用性研究。

Finally, while work such as the reliability analysis of exploit stabilization in EG tools for heap overflow vulnerabilities in S57 (see [140]) and larger-scale analyses in studies such as S21 (see [60, 95, 96, 97]) begin to look at effectiveness using more systematic, scalable evaluations, most studies examining Program State based programs focus on smaller vulnerability datasets. While smaller datasets may be understandable due to the complications that arise from the state-space explosion, their generalizability is less known. To build datasets to provide accurate information such as general precision and recall [97], it may be beneficial to perform additional research into what practitioners mean by “exploitability” and what they expect from exploitability assessment.
最终，尽管像在 S57 中针对堆溢出漏洞的 EG 工具的利用稳定性可靠性分析（见[140]）以及 S21 等研究中的更大规模分析（见[60, 95, 96, 97]）开始使用更系统、可扩展的评估方法来考察有效性，但大多数基于程序状态的研究仍然集中在较小的漏洞数据集上。虽然由于状态空间爆炸引起的复杂性，较小的数据集可能是可以理解的，但它们的泛化能力却鲜为人知。为了构建提供准确信息，如一般精度和召回率[97]的数据集，可能有益于进一步研究实践者对“可利用性”的理解以及他们对可利用性评估的期望。

12 Conclusion 12 结论

We surveyed 59 studies and two standards covering 76 papers proposing or evaluating exploitability assessment methods for software vulnerabilities. These exploitability assessment methods can be divided into three groups: CVSS-based, Deterministic, and Probabilistic assessments using a similar structure to the metric taxonomy proposed by Pendleton et al. [91]. Deterministic, State-based assessments and Probabilistic LM assessments are the most prominent sub-categories, with more than 20 studies of assessments in each of the State-based and LM categories.
我们调查了涵盖 76 篇论文的 59 项研究和两项标准，这些论文提出了或评估了软件漏洞的可利用性评估方法。这些可利用性评估方法可以分为三组：基于 CVSS 的、确定性评估和基于概率的评估，其结构类似于 Pendleton 等人提出的度量分类法[91]。确定性、基于状态的评估和基于概率的 LM 评估是最突出的子类别，每个类别都有超过 20 项评估研究。

Over the years, exploitability assessment has evolved from a theoretical analysis to multiple practical implementations. We have, as a community, answered the question How canwe assess the exploitability of software vulnerabilities? What remains only partially explored, however, is How shouldwe assess the exploitability of software vulnerabilities? Are some assessment methods easier to integrate with existing practices? What are the long-term benefits of using different methods? Many questions remain unknown or under-explored.
多年来，可利用性评估已从理论分析演变为多种实际应用。作为社区，我们已经回答了“我们如何评估软件漏洞的可利用性？”的问题。然而，尚未完全探索的是“我们应如何评估软件漏洞的可利用性？”某些评估方法是否更容易与现有实践相结合？使用不同方法的长远利益是什么？许多问题仍然未知或未充分探索。

Acknowledgments 致谢

We thank all members of the Realsearch research group for their valuable feedback.
我们感谢 Realsearch 研究组所有成员宝贵的反馈。

Footnotes 脚注

https://nvd.nist.gov/general/visualizations/vulnerability-visualizations/cvss-severity-distribution-over-time

Go to Footnote 前往脚注

https://www.oracle.com/security-alerts/cvssscoringsystem.html

Go to Footnote 前往脚注

A robots.txt file indicates which pages the website maintainers intend to allow a crawler to visit [69].
robots.txt 文件表明网站维护者希望允许爬虫访问哪些页面[69]。

Go to Footnote 前往脚注

⁴

https://scholar.google.com/robots.txt

Go to Footnote 前往脚注

⁵

Values are estimated based on a chart from the work. Numeric results were not provided outside the bar chart. The preliminary evaluation motivated the main model proposed in the work, which we discuss in Section 8.
根据该作品中的图表估计的值。除了条形图外，没有提供数值结果。初步评估激发了该作品中提出的主要模型，我们将在第 8 节中讨论。

Go to Footnote 前往脚注

⁶

https://www.rapid7.com/products/metasploit/

Go to Footnote 前往脚注

⁷

https://www.immunityinc.com/products/canvas/

Go to Footnote 前往脚注

⁸

https://www.d2sec.com/elliot.html

⁹

¹⁰

¹¹

The Temporal score is not available from the NVD or many other public databases [116].
时间分数在 NVD 或许多其他公共数据库中不可用[116]。

Go to Footnote 前往脚注

References 参考文献

[1] [1] [1]

Abeer Alhuzali, Birhanu Eshete, Rigel Gjomemo, and V. N. Venkatakrishnan. 2016. Chainsaw: Chained automated workflow-based exploit generation. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16). ACM, New York, NY, USA, 12.
Abeer Alhuzali, Birhanu Eshete, Rigel Gjomemo 和 V. N. Venkatakrishnan. 2016. Chainsaw: 基于链式自动化工作流程的漏洞生成。在 2016 年 ACM SIGSAC 计算机与通信安全会议（CCS’16）论文集中。ACM，纽约，纽约州，美国，第 12 页。

Abstract 摘要

1 Introduction 1 引言

2 Background and Related Work2 背景及相关工作

2.1 Common Terms 2.1 常用术语

2.2 Related Surveys 2.2 相关调查

3 CVSS Background 3 CVSS 背景

3.1 Exploitability-Related CVSS Metrics3.1 可利用性相关的 CVSS 指标

3.2 Differences in Exploitability-Related Sub-Scores between CVSS v3 and CVSS v23.2 CVSS v3 与 CVSS v2 在可利用性相关子评分上的差异

3.3 CVSS in Practice 3.3 实际应用中的 CVSS

4 Methodology 4 方法论

4.1 Paper Collection Process4.1 论文收集过程

4.1.1 Inclusion/Exclusion Criteria.4.1.1 纳入/排除标准。

4.1.2 Phase 1: Keyword Search with Active Learning.4.1.2 第一阶段：基于主动学习的关键词搜索。

4.1.3 Phase 2: Snowballing.4.1.3 阶段二：滚雪球法

4.2 Organization and Categorization4.2 组织与分类

5 Categories of Assessment Methods in Each Study5 种每项研究的评估方法类别

6 Manual CVSS and CVSS-based Metrics6 手动 CVSS 和基于 CVSS 的度量指标

6.1 How CVSS Is Manually Assessed6.1 如何手动评估 CVSS

6.2 Evaluations and Criticisms6.2 评估与批评

6.2.1 Reliability of CVSS Scores.6.2.1 CVSS 评分的可靠性。

6.2.2 Exploitability Scores from the NVD Compared to Publicly Available Exploit-Based Datasets.6.2.2 NVD 的利用性评分与公开可用的基于漏洞的数据集相比。

6.2.3 Distribution of CVSS Scores in the NVD.6.2.3 NVD 中 CVSS 评分的分布

6.3 Proposed Changes/Improvements6.3 建议的更改/改进

6.3.1 Equation Changes. 6.3.1 方程式变更。

6.3.2 Metric Changes. 6.3.2 度量变化。

7 Automated Deterministic7 自动确定性

7.1 Program State Based 7.1 基于程序状态

7.1.1 Outputs. 7.1.1 输出。

7.1.2 Inputs. 7.1.2 输入。

7.1.3 Vuln. Type(s) and Language.7.1.3 漏洞类型和语言。

7.1.4 Evaluation. 7.1.4 评估。

7.2 Network System State Based7.2 基于网络系统状态

7.3 Attacker Based 7.3 攻击者基于

8 Automated Probabilistic Assessments: Learning Models8 自动概率评估：学习模型

8.1 What Is Used as the Ground Truth Labels for Training/Testing (GT)?8.1 训练/测试中使用的真实标签是什么（GT）？

8.1.1 Base CVSS. 8.1.1 基础 CVSS

8.1.2 Other Exploit-Related Indicators.8.1.2 其他利用相关指标。

8.2 What Categories of Information Are Commonly Used as Features (FT)?8.2 常用作特征（FT）的信息类别有哪些？

8.2.1 Program Analysis. 8.2.1 程序分析。

8.2.2 Vulnerability Reports.8.2.2 漏洞报告。

8.2.3 Social Network. 8.2.3 社交网络

8.2.4 Exploit Data. 8.2.4 利用数据。

8.2.5 Timeline. 8.2.5 时间线。

8.3 Modeling Technique(s) Examined8.3 被考察的建模技术

9 Other Automated, Probabilistic Assessments (Non-lm)9 其他自动化、概率性评估（非 lm）

9.1 Exploit Availability Distribution Function from Frei et al.9.1 Frei 等人提出的利用可用性分布函数

9.2 Developing Prioritization Rules Based on Exploit Likelihood Analysis9.2 基于漏洞可能性分析制定优先级规则

10 Limitations 10 限制

11 Discussion 11 讨论

11.1 Temporal Trends 11.1 时间趋势

11.2 Aligning with Vulnerability Prioritization Values from Industry11.2 与行业漏洞优先级价值观保持一致

11.3 Research Gaps and Future Directions11.3 研究空白与未来方向

12 Conclusion 12 结论

Acknowledgments 致谢

Footnotes 脚注

References 参考文献

Cited By 被引用次数

Index Terms 索引术语

Recommendations 建议

Predicting the severity and exploitability of vulnerability reports using convolutional neural nets预测漏洞报告的严重性和可利用性使用卷积神经网络

Assessing vulnerability exploitability risk using software properties评估利用软件属性进行漏洞可利用性风险评估

Using Attack Surface Entry Points and Reachability Analysis to Assess the Risk of Software Vulnerability Exploitability利用攻击面入口点和可达性分析来评估软件漏洞可利用性风险

Comments 注释

2 Background and Related Work
2 背景及相关工作

3.1 Exploitability-Related CVSS Metrics
3.1 可利用性相关的 CVSS 指标

3.2 Differences in Exploitability-Related Sub-Scores between CVSS v3 and CVSS v2
3.2 CVSS v3 与 CVSS v2 在可利用性相关子评分上的差异

4.1 Paper Collection Process
4.1 论文收集过程

4.1.1 Inclusion/Exclusion Criteria.
4.1.1 纳入/排除标准。

4.1.2 Phase 1: Keyword Search with Active Learning.
4.1.2 第一阶段：基于主动学习的关键词搜索。

4.1.3 Phase 2: Snowballing.
4.1.3 阶段二：滚雪球法

4.2 Organization and Categorization
4.2 组织与分类

5 Categories of Assessment Methods in Each Study
5 种每项研究的评估方法类别

6 Manual CVSS and CVSS-based Metrics
6 手动 CVSS 和基于 CVSS 的度量指标

6.1 How CVSS Is Manually Assessed
6.1 如何手动评估 CVSS

6.2 Evaluations and Criticisms
6.2 评估与批评

6.2.1 Reliability of CVSS Scores.
6.2.1 CVSS 评分的可靠性。

6.2.2 Exploitability Scores from the NVD Compared to Publicly Available Exploit-Based Datasets.
6.2.2 NVD 的利用性评分与公开可用的基于漏洞的数据集相比。

6.2.3 Distribution of CVSS Scores in the NVD.
6.2.3 NVD 中 CVSS 评分的分布

6.3 Proposed Changes/Improvements
6.3 建议的更改/改进

7 Automated Deterministic
7 自动确定性

7.1.3 Vuln. Type(s) and Language.
7.1.3 漏洞类型和语言。

7.2 Network System State Based
7.2 基于网络系统状态

8 Automated Probabilistic Assessments: Learning Models
8 自动概率评估：学习模型

8.1 What Is Used as the Ground Truth Labels for Training/Testing (GT)?
8.1 训练/测试中使用的真实标签是什么（GT）？

8.1.2 Other Exploit-Related Indicators.
8.1.2 其他利用相关指标。

8.2 What Categories of Information Are Commonly Used as Features (FT)?
8.2 常用作特征（FT）的信息类别有哪些？

8.2.2 Vulnerability Reports.
8.2.2 漏洞报告。

8.3 Modeling Technique(s) Examined
8.3 被考察的建模技术

9 Other Automated, Probabilistic Assessments (Non-lm)
9 其他自动化、概率性评估（非 lm）

9.1 Exploit Availability Distribution Function from Frei et al.
9.1 Frei 等人提出的利用可用性分布函数

9.2 Developing Prioritization Rules Based on Exploit Likelihood Analysis
9.2 基于漏洞可能性分析制定优先级规则

11.2 Aligning with Vulnerability Prioritization Values from Industry
11.2 与行业漏洞优先级价值观保持一致

11.3 Research Gaps and Future Directions
11.3 研究空白与未来方向

Predicting the severity and exploitability of vulnerability reports using convolutional neural nets
预测漏洞报告的严重性和可利用性使用卷积神经网络

Assessing vulnerability exploitability risk using software properties
评估利用软件属性进行漏洞可利用性风险评估

Using Attack Surface Entry Points and Reachability Analysis to Assess the Risk of Software Vulnerability Exploitability
利用攻击面入口点和可达性分析来评估软件漏洞可利用性风险