We found three primary groups of evaluations of CVSS scores collected from industry organizations such as the NVD. The first set of evaluations, which we discuss in Section
6.2.1, looks at the
Reliability of CVSS—in other words, how consistently a particular vulnerability is scored using CVSS. The second group examines the relationship between CVSS and other exploit-related indicators, such as the presence of an exploit in ExploitDB. Finally, in Section
6.2.3, we discuss four studies that examine the distribution of CVSS scores in the NVD—for example, the percentage of scores that are classified with “Low,” “Medium,” and “High” severity or exploitability.
我们发现从 NVD 等行业组织收集的 CVSS 评分有三个主要评价组。第一组评价,我们在第 6.2.1 节中讨论,考察 CVSS 的可靠性——换句话说,就是使用 CVSS 对特定漏洞进行评分的一致性。第二组研究 CVSS 与其他与利用相关的指标之间的关系,例如 ExploitDB 中是否存在利用。最后,在第 6.2.3 节中,我们讨论了四项研究,这些研究考察了 NVD 中 CVSS 评分的分布——例如,被评为“低”、“中”和“高”严重性或利用性的评分百分比。6.2.1 Reliability of CVSS Scores.
6.2.1 CVSS 评分的可靠性。
S15 (see [
56]) and S39 (see [
66]) looked at the reliability of the CVSS scores provided in the NVD. In S15, the focus of the reliability analysis is on the overall severity score, with additional analysis of information that should be added or removed as it relates to each of the sub-scores (AV, AC, and UI). We discuss the proposed changes to the underlying CVSS system from S15 in Section
6.3.2. However, the reliability analysis of S15 may be used to triangulate the analysis from S39, which examines exploitability-specific elements of CVSS in their reliability analysis. Both studies focus on CVSS v2. Neither study was able to statistically disprove the reliability of the CVSS scores from the NVD.
S15(参见[56])和 S39(参见[66])研究了 NVD 提供的 CVSS 评分的可靠性。在 S15 中,可靠性分析的重点是总体严重程度评分,并附加分析了与每个子评分(AV、AC 和 UI)相关的应添加或删除的信息。我们在第 6.3.2 节中讨论了 S15 对 CVSS 系统提出的变更。然而,S15 的可靠性分析可用于三角测量 S39 的分析,S39 分析了 CVSS 的可靠性分析中的可利用性特定元素。这两项研究都关注 CVSS v2。两项研究都无法从统计上否定 NVD 提供的 CVSS 评分的可靠性。In S15 (see [
56]), the authors perform a survey of 304 security experts from industry and academia to evaluate the overall reliability of the CVSS v2 severity scores in the NVD, as well as to evaluate the sub-metrics and structure of the CVSS framework. To understand the accuracy of the CVSS scores in the CVE list, each respondent was asked to provide a severity score for 10 vulnerabilities from the NVD. Of the 10 vulnerabilities presented to each respondent, 3 vulnerabilities were the same for all respondents to enable the authors to estimate consensus between experts, whereas 7 vulnerabilities were selected randomly from vulnerabilities in the CVE list. A total of 2,131 unique vulnerabilities were assessed across the 304 experts. The experts were provided with the description of each vulnerability, as well as the Exploitability values for AV, AC, and AU (with a brief explanation of the attribute) and Impact values for C, I, and A. The survey
did not include the equation for calculating the Exploitability, Impact, or Base severity scores. Instead, the survey requested that the experts provide their own value between 1 and 10. The authors note that 38% of survey answers differed from the score provided by NVD, and claim “This is certainly a higher figure than many users of the scoring system would be comfortable with” [
56]. However, it is not clear if this discrepancy is specific to the CVSS, to the scores in the NVD, or to expertise-based systems generally. The authors found no statistically significant difference between the scores provided by experts and the overall severity score from the NVD.
在 S15(见[56]),作者对来自工业和学术界的 304 位安全专家进行了调查,以评估 NVD 中 CVSS v2 严重性评分的整体可靠性,以及评估 CVSS 框架的子指标和结构。为了了解 CVE 列表中 CVSS 评分的准确性,每位受访者被要求为 NVD 中的 10 个漏洞提供严重性评分。在向每位受访者展示的 10 个漏洞中,有 3 个漏洞对所有受访者都相同,以便作者估计专家之间的共识,而另外 7 个漏洞则是从 CVE 列表中的漏洞中随机选择的。共有 2,131 个独特的漏洞被 304 位专家评估。专家们提供了每个漏洞的描述,以及 AV、AC 和 AU 的可利用性值(附带属性的简要说明)和 C、I 和 A 的影响值。调查不包括计算可利用性、影响或基本严重性评分的公式。相反,调查要求专家在 1 到 10 之间提供自己的值。 作者指出,38%的调查答案与 NVD 提供的评分不同,并声称“这肯定是一个许多评分系统用户不太愿意接受的数字” [56]。然而,不清楚这种差异是否仅限于 CVSS,还是 NVD 的评分,或者是基于专家的系统。作者发现,专家提供的评分与 NVD 的整体严重性评分之间没有统计学上的显著差异。In S39, the authors expand on the work in S15 by examining on CVSS scores from five databases: NVD, the proprietary IBM X-Force Exchange database, OSVDB, the Vulnerability Notes database from the CERT group at Carnegie Mellon (CERT-VN), and the alert database provided by Cisco for its products (Cisco). For scores from the OSVDB, CVSS scores credited to a variety of sources are included. The authors exclude scores credited to the NVD to reduce potential bias. The authors acknowledge that OSVDB, CERT-VN, and Cisco all indicate that their scores may be influenced by information from other sources, such as the NVD. The authors point to the differences between the scores in each database as evidence of independent scoring. The authors use Bayesian analysis to develop a ground truth for each CVSS sub-metric (AV, AC, AU, C, I, and A) based on the CVSS scores from the five databases. Based on this ground truth, the authors found the NVD to be the most accurate across all metrics (93% on average) and for the Exploitability metrics specifically (AV: 99%, AC: 88%, AU: 99%). The authors noted that the greatest disagreement occurred in the Access Complexity (AC) exploitability score, particularly for lower AC scores, and that Exploitability sub-metrics generally had higher disagreement than Impact sub-metrics.
在 S39 中,作者通过检查来自五个数据库的 CVSS 评分来扩展 S15 中的工作:NVD、IBM X-Force Exchange 专有数据库、OSVDB、卡内基梅隆大学 CERT 小组的漏洞注释数据库(CERT-VN)以及思科为其产品提供的警报数据库(思科)。对于 OSVDB 的评分,包括来自各种来源的 CVSS 评分。作者排除了归因于 NVD 的评分以减少潜在的偏差。作者承认 OSVDB、CERT-VN 和思科都表示他们的评分可能受到来自其他来源的信息的影响,例如 NVD。作者指出每个数据库中评分的差异作为独立评分的证据。作者使用贝叶斯分析根据五个数据库的 CVSS 评分为每个 CVSS 子指标(AV、AC、AU、C、I 和 A)开发一个基准。基于这个基准,作者发现 NVD 在所有指标上(平均 93%)以及具体在可利用性指标上(AV:99%,AC:88%,AU:99%)是最准确的。 作者指出,最大的分歧发生在访问复杂性(AC)的可利用性评分上,尤其是对于较低的 AC 评分,并且可利用性子指标通常比影响子指标有更高的分歧。
6.2.2 Exploitability Scores from the NVD Compared to Publicly Available Exploit-Based Datasets.
6.2.2 NVD 的利用性评分与公开可用的基于漏洞的数据集相比。
Statistical analyses of the CVSS scores from the NVD in relation to other exploit information in S08, S03, S17, and S55 have suggested that the CVSS Exploitability score is not strongly connected with the existence of exploits in exploit databases or the likelihood of exploitation. However, the statistical analysis in S47 (see [
103]) suggests that the relationship between exploits in exploit databases and the individual CVSS metrics (AV, AC, AU, C, I, A) may be stronger, particularly when factors such as the company maintaining the software (e.g., Microsoft) are controlled for.
对 NVD 的 CVSS 评分与 S08、S03、S17 和 S55 中的其他漏洞信息进行统计分析表明,CVSS 漏洞利用评分与漏洞数据库中漏洞的存在或被利用的可能性没有强烈关联。然而,S47 中的统计分析(见[103])表明,漏洞数据库中的漏洞与 CVSS 的各个指标(AV、AC、AU、C、I、A)之间的关系可能更强,尤其是在控制了如软件维护公司(例如,微软)等因素的情况下。In S08, Allodi and Massacci [
5,
6] examine the relationship between CVSS v2 scores, whether a vulnerability is associated with an exploit in ExploitDB, whether a vulnerability is associated with an exploit from a commercial exploit kit, and whether a vulnerability is associated with an exploit signature in Symantec Intrusion Detection and Anti-Malware products. The exploit signatures from Symantec (SYM) were used as the “ground truth.” The commercial exploit kit information in S08 includes automated (i.e., code script) exploits extracted from malicious websites. In S08, Allodi and Massaci [
5,
6] found that the CVSS v2 scores in NVD showed little variability and that only a few of the possible values for the different sub-scores were used. The authors note, “The CVSS Base score alone is a poor risk factor from a statistical perspective.” However, the authors also found that considering both CVSS scores and other factors, such as an exploit in ExploitDB or EKITS, the relationship with the SYM data (i.e., risk of exploitation) improved.
在 S08 中,Allodi 和 Massacci[5, 6]研究了 CVSS v2 评分与漏洞是否与 ExploitDB 中的利用程序相关、漏洞是否与商业利用工具包中的利用程序相关以及漏洞是否与 Symantec 入侵检测和反恶意软件产品中的利用签名相关的关系。Symantec(SYM)的利用签名被用作“基准真实情况”。S08 中的商业利用工具包信息包括从恶意网站中提取的自动化(即代码脚本)利用程序。在 S08 中,Allodi 和 Massaci[5, 6]发现 NVD 中的 CVSS v2 评分变化不大,并且只有少数不同子评分的可能值被使用。作者指出:“仅从统计角度来看,CVSS 基本评分本身是一个较差的风险因素。”然而,作者还发现,考虑 CVSS 评分和其他因素,如 ExploitDB 或 EKITS 中的利用程序,与 SYM 数据(即利用风险)的关系得到了改善。Similarly, in S03, Bozorgi et al. [14] use the CVSS Exploitability scores from the NVD as the control against which they compare their own AUTO-LM system, which we will discuss in Section
8. The labels for the AUTO-LM system were the exploit availability labels from OSVDB [
14]. The authors compare the distribution of the CVSS Exploitability provided by the NVD against the signed distance to the maximum margin hyperplane separating positive and negative examples in their SVM model (i.e., their LM). The authors illustrate with histograms how their score produces a clearer distinction between vulnerabilities that the OSVDB indicates have an exploit compared to vulnerabilities that do not have an exploit.
同样,在 S03 中,Bozorgi 等人[14]使用 NVD 的 CVSS 可利用性评分作为他们与自己 AUTO-LM 系统比较的控制标准,我们将在第 8 节中讨论该系统。AUTO-LM 系统的标签是来自 OSVDB 的可利用性标签[14]。作者将 NVD 提供的 CVSS 可利用性评分与他们的 SVM 模型(即他们的 LM)中正负样本分离的最大间隔超平面的签名距离进行比较。作者通过直方图展示了他们的评分如何使 OSVDB 指示具有漏洞的漏洞与不具有漏洞的漏洞之间的区别更加清晰。Converting CVSS scores from the NVD into a binary exploitability score has also yielded low precision and recall in relation to the existence of exploits and exploit signatures from public databases. In S17, Younis and Malaiya [136] compare CVSS v2 against the
Microsoft Rating System (MSRS) using exploits from ExploitDB as ground truth. The MSRS was a predecessor to the current Microsoft Exploitability Index [
84] and Microsoft severity score [
85], which are provided by Microsoft when disclosing vulnerabilities in their products to help users prioritize security patches. In S17, the authors use the median CVSS score of 8.6 as the cutoff for the confusion matrix of exploitable and not-exploitable vulnerabilities. For the MSRS, the authors used the median value of 1 as the threshold for whether a vulnerability was exploitable. In other words, vulnerabilities with an MSRS of 1 were considered exploitable, whereas vulnerabilities with a rating of 2 or 3 were considered not exploitable. Using this approach, the authors determined that CVSS had a precision of 7% and recall of 97% for Internet Explorer, and a precision of 20% and recall of 65% for Windows 7. For Internet Explorer, this threshold of the MSRS resulted in a precision of 7% and a recall of 85%. For Windows 7, the threshold for the MSRS resulted in a precision of 15% and a recall of 83%. The authors argue that the low precision and recall indicate that CVSS and MSRS are not good indicators of exploitability, and new metrics are needed.
将 NVD 的 CVSS 评分转换为二进制可利用性评分,在公共数据库中关于漏洞和漏洞签名方面也产生了低精度和召回率。在 S17 中,Younis 和 Malaiya[136]使用 ExploitDB 中的漏洞作为基准,比较了 CVSS v2 与微软评分系统(MSRS)。MSRS 是当前微软可利用性指数[84]和微软严重性评分[85]的前身,微软在披露其产品中的漏洞时提供这些评分,以帮助用户优先考虑安全补丁。在 S17 中,作者将 CVSS 的中位数评分 8.6 作为可利用和不可利用漏洞混淆矩阵的阈值。对于 MSRS,作者将中位数值 1 作为漏洞是否可利用的阈值。换句话说,MSRS 评分为 1 的漏洞被认为是可利用的,而评分为 2 或 3 的漏洞被认为是不可利用的。采用这种方法,作者确定 CVSS 对 Internet Explorer 的精确度为 7%,召回率为 97%,对 Windows 7 的精确度为 20%,召回率为 65%。 对于 Internet Explorer,MSRS 的此阈值导致准确率为 7%,召回率为 85%。对于 Windows 7,MSRS 的阈值导致准确率为 15%,召回率为 83%。作者认为,低准确率和召回率表明 CVSS 和 MSRS 不是可利用性的良好指标,需要新的度量标准。Similarly, a preliminary comparison of CVSS v3 with proprietary measures in S55 (see [
116]) examined the precision and recall of the Base Exploitability score, setting the threshold at each possible value from 0 to 3. In other words, they examined the precision and recall if a vulnerability with a Base Exploitability score of 0 or higher was considered “exploitable,” then they examined the precision and recall if a vulnerability had a Base Exploitability score of 1 or higher, and so on. For the ground truth, the authors use a combined dataset of exploit signatures from Symantec products; information extracted from Bugtraq, Tenable, Skybox, and AlienVault OTX vulnerability databases; and exploits extracted from the Contagio dataset, a publicly available list of exploit kits and malicious websites used in academic studies [
5,
6,
71,
74,
93,
143]. Using a Base Exploitability threshold of 0 or 1 had approximately 85% recall,
5 whereas using a threshold of 3 resulted in less than 20% recall. Precision values were below 20% for all thresholds. The authors of S17 and S55 [
116] both use their evaluation of CVSS to argue that better exploitability measures are needed.
同样,对 CVSS v3 与 S55(见[116])中的专有措施的初步比较,考察了基本可利用性得分的精确度和召回率,将阈值设定在从 0 到 3 的每个可能值。换句话说,他们考察了如果将基本可利用性得分为 0 或更高的漏洞视为“可利用”的精确度和召回率,然后考察了基本可利用性得分为 1 或更高的漏洞的精确度和召回率,依此类推。对于真实情况,作者使用来自 Symantec 产品的利用签名组合数据集;从 Bugtraq、Tenable、Skybox 和 AlienVault OTX 漏洞数据库中提取的信息;以及从 Contagio 数据集中提取的利用,这是一个公开的利用工具包和恶意网站列表,用于学术研究[5, 6, 71, 74, 93, 143]。使用基本可利用性阈值为 0 或 1 的召回率约为 85%,而使用阈值为 3 的召回率低于 20%。所有阈值下的精确度值都低于 20%。S17 和 S55[116]的作者都使用他们对 CVSS 的评估来论证需要更好的可利用性度量。In contrast to the analysis examining the CVSS Base Exploitability score as a whole, statistical analysis by Roumani and Nwankpa [103] in S47 found that the CVSS v2 exploitability (AV, AC, AU) and Impact (C, I, A) sub-metrics from the Base score all had statistically significant relationships with the “hazard” that an exploit would be made available in ExploitDB. The authors controlled for the affected product type, number of affected software versions, number of past exploits, year of disclosure, size of the software vendor, and R&D budget of the software vendor. This suggests that the relationship between CVSS exploitability-related metrics and exploit availability may be more nuanced and requires further investigation.
与整体分析 CVSS 基础可利用性分数相比,Roumani 和 Nwankpa[103]在 S47 中进行的统计分析发现,CVSS v2 可利用性(AV、AC、AU)和影响(C、I、A)子指标与 ExploitDB 中可利用的“危险”之间存在统计学上的显著关系。作者控制了受影响的产品类型、受影响软件版本的数量、过去漏洞的数量、披露年份、软件供应商规模和研发预算。这表明 CVSS 可利用性相关指标与漏洞可用性之间的关系可能更为复杂,需要进一步研究。
6.2.3 Distribution of CVSS Scores in the NVD.
6.2.3 NVD 中 CVSS 评分的分布
Another common critique of CVSS relates to the overall distribution of exploitability and severity values of the CVSS scores provided by the NVD. S04 (see [
80]), S10 (see [
115]), and S12 all critique CVSS scores in the NVD for being disproportionately “High,” which they attribute to problems with the CVSS calculation method. We discuss their proposed changes to the CVSS calculation method further in Section
6.3.1. However, in this section, we examine their criticisms and compare their analysis with the work by Gallon [45] in S05, which is less critical of the system and found a different distributional imbalance.
另一个对 CVSS 的常见批评与 NVD 提供的 CVSS 评分的利用性和严重性值的整体分布有关。S04(见[80])、S10(见[115])和 S12 都批评 NVD 的 CVSS 评分在“高”这一等级上不成比例,他们将此归因于 CVSS 计算方法的问题。我们在第 6.3.1 节中进一步讨论了他们对 CVSS 计算方法的建议更改。然而,在本节中,我们考察了他们的批评,并将他们的分析与 S05 中 Gallon[45]的工作进行比较,Gallon 对系统的批评较少,并发现了一种不同的分布不平衡。In S04, the authors argue, “In our opinion, the number of vulnerabilities with ‘Medium’ severity ranking should be the largest and the number of vulnerabilities with ‘High’ or ‘Low’ severity ranking [should be] much smaller” [
80]. In S10, the authors argue that severity scores should have a more diverse range of values and should be more evenly distributed [
115]. The authors of S12 also call for increased diversity, “The CVSS empirical values given by CVSS-SIG cannot distinguish software vulnerabilities that have identical scores but different severities” [
81].
在 S04 中,作者们认为,“我们认为,中等严重程度排名的漏洞数量应该是最大的,而高或低严重程度排名的漏洞数量[应该]小得多” [80]。在 S10 中,作者们认为严重程度得分应该有更广泛的价值范围,并且应该更加均匀分布 [115]。S12 的作者们也呼吁增加多样性,“CVSS-SIG 给出的 CVSS 经验值无法区分具有相同得分但严重程度不同的软件漏洞” [81]。The vulnerabilities examined in S04, S10, and S12 contain considerable overlap and have similar distributions. In S04, the authors evaluate 34,093 CVE vulnerabilities published from 1999 to 2008, of which 6.8% had a “Low” CVSS severity, 47.8% had a “Medium” CVSS severity, and 45.5% had a “High” CVSS severity. In S10, the authors evaluate CVSS scores from 9,455 vulnerabilities in the NVD published between November 1, 2010, and October 31, 2012. In S10, 7.9% of the vulnerabilities had “Low” severity, 53.2% of vulnerabilities had “Medium” severity, and 38.0% of the vulnerabilities had “High” severity. Additionally, in S10, the authors analyze the distribution of all metrics of the CVSS score, including AV, AC, and AU, showing that greater than 80% of vulnerabilities in their sample of 9,455 vulnerabilities from the NVD had an AV of “Network” (the highest value) and required no authentication (AU). However, AC was more evenly distributed. In S12, the authors examine 54,432 vulnerabilities published between 2002 and 2012 [
81]. In S12, Luo et al. [81] point to correlations between the sub-metrics of the Base CVSS v2 score (AV, AC, AU, C, I, and A) from the NVD, as determined by a chi-squared test, as an indicator that vulnerabilities are not being scored independently.
S04、S10 和 S12 中考察的漏洞存在相当大的重叠,分布相似。在 S04 中,作者评估了 1999 年至 2008 年间发布的 34,093 个 CVE 漏洞,其中 6.8%的漏洞 CVSS 严重程度为“低”,47.8%的漏洞 CVSS 严重程度为“中”,45.5%的漏洞 CVSS 严重程度为“高”。在 S10 中,作者评估了 2010 年 11 月 1 日至 2012 年 10 月 31 日在 NVD 发布的 9,455 个漏洞的 CVSS 评分。在 S10 中,7.9%的漏洞具有“低”严重程度,53.2%的漏洞具有“中”严重程度,38.0%的漏洞具有“高”严重程度。此外,在 S10 中,作者分析了 CVSS 评分的所有指标的分布,包括 AV、AC 和 AU,显示在他们的样本中,来自 NVD 的 9,455 个漏洞中超过 80%的漏洞 AV 为“网络”(最高值)且无需认证(AU)。然而,AC 分布较为均匀。在 S12 中,作者考察了 2002 年至 2012 年间发布的 54,432 个漏洞[81]。在 S12 中,罗等 [81] 指出 NVD 中 Base CVSS v2 评分的子指标(AV、AC、AU、C、I 和 A)之间的相关性,这些相关性是通过卡方检验确定的,作为漏洞评分不是独立进行的指标。The findings in S04, S10, and S12 contrasts with the findings by Gallon [45] in S05. The authors examine 40,026 vulnerabilities in the NVD published between 1999 and 2009. The authors found a distribution with 45% of vulnerabilities having “Low” severity, 46% of vulnerabilities having “Medium” severity, and only 9% of vulnerabilities having “High” severity. The authors also found that the diversity in the combinations of sub-metric values was relatively low, particularly the Impact sub-metrics (C, I, A). Unlike in the other studies, the authors of S05 do not inherently consider this a defect of the scores in the NVD or of the CVSS itself. The authors then examine how the Environmental Impact metrics included in the CVSS framework may alter the scores, such as to improve diversity, finding that the Environmental Impact scores are more likely to decrease the overall severity score.
S04、S10 和 S12 的研究结果与 Gallon[45]在 S05 中的研究结果形成对比。作者们检查了 NVD 在 1999 年至 2009 年间发布的 40,026 个漏洞。作者们发现,45%的漏洞具有“低”严重性,46%的漏洞具有“中”严重性,只有 9%的漏洞具有“高”严重性。作者们还发现,子度量值组合的多样性相对较低,尤其是影响子度量(C、I、A)。与其它研究不同,S05 的作者们并没有将这一点视为 NVD 评分或 CVSS 本身的缺陷。然后,作者们考察了 CVSS 框架中包含的环境影响度量如何可能改变评分,例如提高多样性,发现环境影响评分更有可能降低整体严重性评分。