survey

Open access

A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning

Authors:

Zhen Ming (Jack) Jiang

Nachiappan Nagappan

Authors Info & Claims

ACM Computing Surveys, Volume 57, Issue 3

Article No.: 55, Pages 1 - 36

https://doi.org/10.1145/3699711

Published: 11 November 2024 Publication History

PDF eReader

Abstract 摘要

In recent years, numerous Machine Learning (ML) models, including Deep Learning (DL) and classic ML models, have been developed to detect software vulnerabilities. However, there is a notable lack of comprehensive and systematic surveys that summarize, classify, and analyze the applications of these ML models in software vulnerability detection. This absence may lead to critical research areas being overlooked or under-represented, resulting in a skewed understanding of the current state of the art in software vulnerability detection. To close this gap, we propose a comprehensive and systematic literature review that characterizes the different properties of ML-based software vulnerability detection systems using six major Research Questions (RQs).
近年来，包括深度学习（DL）和经典机器学习（ML）模型在内的众多机器学习（ML）模型被开发出来以检测软件漏洞。然而，缺乏全面和系统的调查综述，这些综述总结了、分类并分析了这些 ML 模型在软件漏洞检测中的应用。这种缺失可能导致关键研究领域的忽视或代表性不足，从而导致对软件漏洞检测当前技术水平的理解出现偏差。为了填补这一空白，我们提出了一项全面和系统的文献综述，使用六个主要研究问题（RQs）来描述基于 ML 的软件漏洞检测系统的不同特性。

Using a custom web scraper, our systematic approach involves extracting a set of studies from four widely used online digital libraries: ACM Digital Library, IEEE Xplore, ScienceDirect, and Google Scholar. We manually analyzed the extracted studies to filter out irrelevant work unrelated to software vulnerability detection, followed by creating taxonomies and addressing RQs. Our analysis indicates a significant upward trend in applying ML techniques for software vulnerability detection over the past few years, with many studies published in recent years. Prominent conference venues include the International Conference on Software Engineering (ICSE), the International Symposium on Software Reliability Engineering (ISSRE), the Mining Software Repositories (MSR) conference, and the ACM International Conference on the Foundations of Software Engineering (FSE), whereas Information and Software Technology (IST), Computers & Security (C&S), and Journal of Systems and Software (JSS) are the leading journal venues.
使用自定义网络爬虫，我们的系统方法包括从四个广泛使用的在线数字图书馆中提取一系列研究：ACM 数字图书馆、IEEE Xplore、ScienceDirect 和 Google Scholar。我们手动分析了提取的研究，以过滤掉与软件漏洞检测无关的不相关工作，随后创建分类和解决 RQs。我们的分析表明，在过去的几年中，应用机器学习技术进行软件漏洞检测的趋势显著上升，近年来发表了大量研究。突出的会议场所包括国际软件工程会议（ICSE）、国际软件可靠性工程研讨会（ISSRE）、软件库挖掘（MSR）会议以及 ACM 国际软件工程基础会议（FSE），而《信息和软件技术》（IST）、《计算机与安全》（C&S）和《系统与软件杂志》（JSS）是领先的期刊场所。

Our results reveal that 39.1% of the subject studies use hybrid sources, whereas 37.6% of the subject studies utilize benchmark data for software vulnerability detection. Code-based data are the most commonly used data type among subject studies, with source code being the predominant subtype. Graph-based and token-based input representations are the most popular techniques, accounting for 57.2% and 24.6% of the subject studies, respectively. Among the input embedding techniques, graph embedding and token vector embedding are the most frequently used techniques, accounting for 32.6% and 29.7% of the subject studies. Additionally, 88.4% of the subject studies use DL models, with recurrent neural networks and graph neural networks being the most popular subcategories, whereas only 7.2% use classic ML models. Among the vulnerability types covered by the subject studies, CWE-119, CWE-20, and CWE-190 are the most frequent ones. In terms of tools used for software vulnerability detection, Keras with TensorFlow backend and PyTorch libraries are the most frequently used model-building tools, accounting for 42 studies for each. In addition, Joern is the most popular tool used for code representation, accounting for 24 studies.
我们的结果表明，39.1%的研究对象使用混合来源，而 37.6%的研究对象利用基准数据用于软件漏洞检测。基于代码的数据是研究对象中最常用的数据类型，其中源代码是主要子类型。基于图和基于标记的输入表示是最受欢迎的技术，分别占研究对象总数的 57.2%和 24.6%。在输入嵌入技术中，图嵌入和标记向量嵌入是最常用的技术，分别占研究对象总数的 32.6%和 29.7%。此外，88.4%的研究对象使用深度学习模型，其中循环神经网络和图神经网络是最受欢迎的子类别，而只有 7.2%使用经典机器学习模型。在研究对象覆盖的漏洞类型中，CWE-119、CWE-20 和 CWE-190 是最频繁的。在软件漏洞检测使用的工具方面，Keras 与 TensorFlow 后端和 PyTorch 库是最常用的模型构建工具，各有 42 项研究使用。此外，Joern 是用于代码表示最流行的工具，占 24 项研究。

Finally, we summarize the challenges and future directions in the context of software vulnerability detection, providing valuable insights for researchers and practitioners in the field.
最后，我们在软件漏洞检测的背景下总结挑战和未来方向，为该领域的学者和实践者提供有价值的见解。

1 Introduction 1 引言

Automatic vulnerability identification is essential for ensuring software security [99]. Successes in the field of Machine Learning (ML) have inspired a lot of interest in using these models to find software vulnerabilities in general/traditional software systems [145]. ML models excel at detecting subtle patterns and correlations in large datasets [6]. They can automatically extract important features from raw data, such as source code, and detect hidden patterns that could reveal software defects. This capacity is critical in vulnerability detection, as vulnerabilities frequently entail subtle code characteristics and dependencies. In addition, ML models can handle a wide range of data types and formats, including source code [26], textual information [56], and numerical features such as commit characteristics [114]. They can use these data representations to effectively discover vulnerabilities. This versatility enables researchers to use a variety of data sources and include numerous features for comprehensive vulnerability detection.
自动漏洞识别对于确保软件安全至关重要[99]。机器学习（ML）领域的成功激发了对利用这些模型在一般/传统软件系统中寻找软件漏洞的浓厚兴趣[145]。ML 模型擅长在大型数据集中检测微妙的模式和相关性[6]。它们可以从原始数据中自动提取重要特征，如源代码，并检测可能揭示软件缺陷的隐藏模式。这种能力在漏洞检测中至关重要，因为漏洞通常涉及微妙的代码特征和依赖关系。此外，ML 模型可以处理各种数据类型和格式，包括源代码[26]、文本信息[56]以及提交特征等数值特征[114]。它们可以使用这些数据表示来有效地发现漏洞。这种多功能性使研究人员能够使用各种数据源，并包括众多特征以实现全面的漏洞检测。

Although many studies have used ML models to detect software vulnerabilities, there has not been a comprehensive and systematic review to consolidate the various approaches and characteristics of these techniques. Conducting such a systematic survey would be beneficial for practitioners and researchers in gaining a better understanding of the current state-of-the-art tools for vulnerability detection and could serve as an inspiration for future studies. This study conducts a comprehensive and detailed survey to review, analyze, describe, and classify software vulnerability detection studies from different perspectives. We analyzed 138 studies published in many software engineering flagship journals and conferences from January 2011 to June 2024. In this study, we investigated the following Research Questions (RQs):
尽管许多研究已经使用机器学习模型来检测软件漏洞，但尚未有全面和系统的综述来整合这些技术的各种方法和特点。进行此类系统调查将有助于实践者和研究人员更好地了解当前最先进的漏洞检测工具，并为未来的研究提供灵感。本研究从不同角度对软件漏洞检测研究进行了全面和详细的调查、分析、描述和分类。我们分析了 2011 年 1 月至 2024 年 6 月发表在许多软件工程旗舰期刊和会议上的 138 项研究。在本研究中，我们探讨了以下研究问题（RQs）：

•

RQ1: What is the trend of studies?
RQ1：研究趋势是什么？

–

RQ1.1: What is the trend of studies over time?
RQ1.1：研究随时间推移的趋势是什么？

–

RQ1.2: What is the distribution of publication venues?
RQ1.2：出版场所的分布情况如何？

•

RQ2: What are the characteristics of software vulnerability detection datasets?
RQ2：软件漏洞检测数据集的特点是什么？

–

RQ2.1: What is the source of datasets?
RQ2.1：数据集的来源是什么？

–

RQ2.2: What are the most commonly used data types?
RQ2.2：最常用的数据类型有哪些？

–

RQ2.3: What are the most commonly used input representations?
RQ2.3：最常用的输入表示是什么？

–

RQ2.4: What are the most commonly used embedding approaches?
RQ2.4：最常用的嵌入方法有哪些？

•

RQ3: What is the distribution of ML and Deep Learning (DL) models used for software vulnerability detection?
RQ3：用于软件漏洞检测的机器学习（ML）和深度学习（DL）模型的分布情况如何？

•

RQ4: What are the most frequent types of vulnerabilities covered in the subject studies?
RQ4：在主题研究中，最常见的漏洞类型有哪些？

•

RQ5: What are the most frequently used tools for software vulnerability detection?
RQ5：软件漏洞检测中最常用的工具有哪些？

•

RQ6: What are possible challenges and open directions in software vulnerability detection?
RQ6：软件漏洞检测中可能面临的挑战和开放方向有哪些？

This article makes the following contributions:
本文做出了以下贡献：

—

We thoroughly analyze 138 studies that used ML models to detect security vulnerabilities regarding publication trends, distribution of publication venues, and types of contributions.
我们全面分析了 138 项研究，这些研究使用了机器学习模型来检测关于出版趋势、出版物分布和贡献类型的网络安全漏洞。

—

We conduct a comprehensive analysis to understand the dataset, the processing of data, data representation, model architecture, tools, and types of covered vulnerabilities in the subject studies.
我们对数据集、数据处理、数据表示、模型架构、工具以及研究对象中涵盖的漏洞类型进行了全面分析。

—

We provide a classification of ML models used in vulnerability detection based on their architectures.
我们根据其架构对用于漏洞检测的机器学习模型进行了分类。

—

We discuss distinct technical challenges of using ML techniques in vulnerability detection and outline key future directions.
我们讨论了在漏洞检测中使用机器学习技术的独特技术挑战，并概述了关键的未来发展方向。

—

We share our results and analysis data as a replication package¹ to allow other researchers to easily follow this work and extend it.
我们将我们的结果和分析数据作为复现包 ¹ 分享，以便其他研究人员能够轻松地跟随这项工作并扩展它。

We believe that this work is valuable for researchers and practitioners in software engineering and cybersecurity, especially those focused on software vulnerability detection and mitigation. It also benefits policymakers, software providers, and stakeholders interested in improving software security and reducing cyberattack risks, forming their software development, procurement, and risk management decisions.
我们认为这项工作对软件工程和网络安全领域的科研人员和从业者具有重要价值，尤其是那些专注于软件漏洞检测和缓解的。它也有利于政策制定者、软件提供商以及关注提高软件安全性和降低网络攻击风险的利益相关者，影响他们的软件开发、采购和风险管理决策。

The rest of the article is organized as follows. Section 2 provides background information and reviews related work. Section 3 outlines the research methodology proposed in this article. Section 4 addresses the RQs and presents the corresponding results. Section 5 discusses potential threats to the validity of this study. Finally, Section 6 presents the conclusion and suggests future directions.
文章其余部分组织如下。第 2 节提供背景信息和相关工作的综述。第 3 节概述了本文提出的研究方法。第 4 节讨论了研究问题（RQs）并展示了相应的结果。第 5 节讨论了本研究的有效性可能受到的潜在威胁。最后，第 6 节提出了结论并建议未来的研究方向。

2 Background and Related Work
2 背景及相关工作

In this section, we begin by defining vulnerability and outlining the key steps in detecting software vulnerabilities. We then review related surveys, emphasizing how they differ from our own.
在这一节中，我们首先定义漏洞并概述检测软件漏洞的关键步骤。然后，我们回顾相关调查，强调它们与我们自己的不同之处。

2.1 Background 2.1 背景

Software vulnerability management is crucial for ensuring software security and integrity [119]. With the increasing reliance on software for critical operations like financial transactions [39], vulnerabilities pose serious risks, including unauthorized access and service disruption. Effective management is essential for protecting user privacy, maintaining system availability, and ensuring trustworthiness. There are multiple steps in software vulnerability management, including vulnerability detection, vulnerability analysis, and vulnerability remediation. In the following subsections, we elaborate on each step in detail.
软件漏洞管理对于确保软件安全和完整性至关重要[119]。随着对软件在关键操作（如金融交易[39]）中依赖性的增加，漏洞带来了严重风险，包括未经授权的访问和服务中断。有效的管理对于保护用户隐私、维护系统可用性和确保可靠性至关重要。软件漏洞管理包括多个步骤，包括漏洞检测、漏洞分析和漏洞修复。在以下小节中，我们将详细阐述每个步骤。

2.1.1 Vulnerability Detection.
2.1.1 漏洞检测。

Vulnerability detection is critical in the overall process of managing software vulnerabilities [11]. It comprises detecting possible security weaknesses in software systems that attackers may exploit. There are several traditional techniques commonly used for vulnerability detection. In the manual code auditing method, human experts examine the source thoroughly to manually detect coding flaws, unsafe procedures, and possible vulnerabilities. Static analysis [35] involves using automated tools to analyze the source code or compiled binaries without executing the software under test. The goal of dynamic analysis [67, 102] is to evaluate the behavior of software while it is running. Running the software in a controlled environment or through automated tests while monitoring its execution and interactions with system resources is what it entails. However, dynamic analysis may have constraints in terms of significant system overhead [167]. One approach that falls under this category is the usage of fuzz testing for software vulnerability detection [42]. In fuzz testing, the input space for the program under test is identified, then the inputs are modified/mutated randomly or based on a set of already-defined rules to generate malformed inputs as well as boundary input values (i.e., edge cases). These tainted values are expected to hit parts of the program under test that are not properly validated, which results in serious security vulnerabilities like denial of service or remote code execution. Hybrid code analysis [25] is a strong approach that combines the benefits of static and dynamic analysis to increase the effectiveness of software vulnerability detection. Static analysis examines code without executing it. Its key strength is its ability to quickly scan the entire codebase and identify any flaws before the code executes. Yet, it often generates high false positives and has limited context on runtime behavior [52]. Dynamic analysis, however, involves running the code and monitoring its behavior in a real-time fashion. This method excels at finding runtime issues such as memory leaks.² Yet, the main drawback is that it is resource intensive, as you need to run the entire program under test to explore different code patches.
漏洞检测在软件漏洞管理整体过程中至关重要[11]。它包括检测软件系统中攻击者可能利用的安全弱点。用于漏洞检测的传统技术有几种。在手动代码审计方法中，专家会彻底检查源代码，以手动检测编码缺陷、不安全程序和可能存在的漏洞。静态分析[35]涉及使用自动化工具分析源代码或编译的二进制文件，而无需执行测试软件。动态分析[67, 102]的目标是评估软件在运行时的行为。这包括在受控环境中运行软件或通过自动化测试，同时监控其执行和与系统资源的交互。然而，动态分析可能在系统开销方面存在限制[167]。这一类别下的一个方法是使用模糊测试进行软件漏洞检测[42]。模糊测试中，确定了待测程序的输入空间，然后随机修改/变异输入或基于一组已定义的规则生成格式不正确的输入以及边界输入值（即边缘情况）。这些受污染的值预计会击中程序中未正确验证的部分，从而导致严重的安全漏洞，如拒绝服务或远程代码执行。混合代码分析[25]是一种强大的方法，它结合了静态分析和动态分析的优势，以提高软件漏洞检测的有效性。静态分析在不执行代码的情况下检查代码。其关键优势是能够快速扫描整个代码库并在代码执行之前识别任何缺陷。然而，它通常会产生大量的误报，并且对运行时行为的上下文有限[52]。然而，动态分析涉及运行代码并以实时方式监控其行为。这种方法擅长发现运行时问题，如内存泄漏。然而，主要缺点是它资源密集，因为你需要运行整个待测试程序来探索不同的代码补丁。

The hybrid model leverages the strengths of both approaches to ensure comprehensive coverage. Despite its benefits, implementing hybrid code analysis has technical complexities, such as integrating and synchronizing static and dynamic tools. Additionally, it demands significant computational resources and time, potentially slowing down development time.
混合模型利用两种方法的优势以确保全面覆盖。尽管具有这些优点，实施混合代码分析存在技术复杂性，例如集成和同步静态和动态工具。此外，它需要大量的计算资源和时间，可能会减缓开发时间。

2.1.2 Vulnerability Analysis.
2.1.2 漏洞分析。

After the detection of vulnerabilities, the subsequent step in software vulnerability management is vulnerability analysis and assessment [130]. This step involves a further examination of identified vulnerabilities to assess their severity, impact, and potential exploitability. First, with regard to severity, accurately assessing software vulnerabilities is vital for several reasons. One reason is that it allows organizations to prioritize their response based on the severity of the vulnerabilities. Severity refers to the potential impact a vulnerability could have if exploited [15]. By accurately assessing the severity, organizations can focus their attention on high-severity vulnerabilities that pose significant threats to the security and functionality of the software system. Second, with regard to impact, accurately assessing vulnerabilities helps determine the potential impact they may have on the organization [43]. The term impact refers to the manifestations of exploiting a vulnerability, such as denial of service [53] or data breaches. By understanding the potential impact, organizations can make informed decisions regarding the urgency and priority of remediation efforts. Third, with regard to exploitability, accurate vulnerability assessment aids in understanding their potential exploitability [14]. This entails determining the possibility that an attacker will be successful in exploiting the vulnerability to infiltrate the software system.
在漏洞检测之后，软件漏洞管理的下一步是漏洞分析和评估[130]。这一步骤包括对已识别的漏洞进行进一步审查，以评估其严重性、影响和潜在的利用性。首先，关于严重性，准确评估软件漏洞至关重要，原因有几个。其中一个原因是它允许组织根据漏洞的严重性来优先处理响应。严重性指的是漏洞被利用时可能产生的潜在影响[15]。通过准确评估严重性，组织可以集中精力关注那些对软件系统安全和功能构成重大威胁的高严重性漏洞。其次，关于影响，准确评估漏洞有助于确定它们可能对组织产生的潜在影响[43]。影响一词指的是利用漏洞的表现，如拒绝服务[53]或数据泄露。通过理解潜在影响，组织可以就修复努力的紧迫性和优先级做出明智的决策。第三，关于可利用性，准确的安全漏洞评估有助于理解其潜在的利用性[14]。这包括确定攻击者成功利用漏洞渗透软件系统的可能性。

2.1.3 Vulnerability Remediation.
2.1.3 漏洞修复。

The process of resolving detected software vulnerabilities by different techniques such as patching, code modification, and repairing is referred to as software vulnerability remediation [59]. The fundamental goal of remediation is to eliminate or mitigate vulnerabilities to improve the security and dependability of the software system. One common approach to vulnerability remediation is applying patches provided by software vendors or open source communities [156]. Patches are updates or fixes that address specific vulnerabilities or weaknesses identified in a software system.
软件漏洞的解决过程，通过修补、代码修改和修复等不同技术手段，被称为软件漏洞修复[59]。修复的基本目标是消除或减轻漏洞，以提高软件系统的安全性和可靠性。一种常见的漏洞修复方法是应用软件供应商或开源社区提供的补丁[156]。补丁是针对软件系统中识别出的特定漏洞或弱点的更新或修复。

2.1.4 ML for Software Vulnerability Detection.
2.1.4 软件漏洞检测中的机器学习

By utilizing data analysis, pattern recognition, and ML to find software security vulnerabilities, ML approaches have revolutionized software vulnerability detection [145]. These techniques improve the accuracy and efficiency of vulnerability detection, potentially allowing automated detection, faster analysis, and the identification of previously undisclosed vulnerabilities. One common application of ML in vulnerability detection is the classification of code snippets [27], software binaries, or code changes extracted from open source repositories such as GitHub or Common Vulnerability and Exposure (CVE). ML models can be trained on labeled datasets, where each sample represents a known vulnerability or non-vulnerability. These models then learn to generalize from the provided examples and classify new instances based on the patterns they have learned. This method allows for automatic vulnerability discovery without the need for manual examination, considerably lowering the time and effort necessary for analysis.
通过利用数据分析、模式识别和机器学习来发现软件安全漏洞，机器学习方法已经彻底改变了软件漏洞检测[145]。这些技术提高了漏洞检测的准确性和效率，可能实现自动化检测、快速分析和识别之前未公开的漏洞。机器学习在漏洞检测中的一个常见应用是对代码片段[27]、软件二进制文件或从 GitHub 或通用漏洞和暴露（CVE）等开源存储库中提取的代码更改进行分类。这些模型可以在标记数据集上进行训练，其中每个样本代表一个已知的漏洞或非漏洞。然后，这些模型从提供的示例中学习，并根据它们学到的模式对新实例进行分类。这种方法允许自动发现漏洞，无需手动检查，大大降低了分析所需的时间和精力。

ML models for detecting software vulnerabilities have promising advantages over traditional methodologies. Each benefit is discussed in detail in the next paragraph. Automation is a significant advantage. ML models can automatically scan and analyze large codebases, or system configurations, detecting potential vulnerabilities without requiring human intervention for each case [12]. This automation speeds up the detection process, allowing security teams to focus on verifying and mitigating vulnerabilities rather than manual analysis. With regard to efficiency and scalability, ML approaches offer faster analysis. Traditional vulnerability detection techniques rely on manual inspection or the application of pre-defined rules [128]. In contrast, ML approaches can evaluate enormous volumes of data in parallel and generate predictions quickly, dramatically shortening the time necessary to find vulnerabilities. With regard to detection effectiveness, ML models can uncover previously unknown vulnerabilities, commonly known as zero-day vulnerabilities [5]. These models may uncover signs of vulnerabilities even when they have not been specifically trained on them by learning patterns and generalizing from labeled data. This capability improves the overall security by helping to identify and address unknown weaknesses in software before they are exploited by attackers [1].
机器学习模型在检测软件漏洞方面相较于传统方法具有显著优势。每个优势将在下一段详细讨论。自动化是一个重要优势。机器学习模型可以自动扫描和分析大型代码库或系统配置，检测潜在漏洞，无需对每个案例进行人工干预[12]。这种自动化加快了检测过程，使安全团队能够专注于验证和缓解漏洞，而不是手动分析。在效率和可扩展性方面，机器学习方法提供了更快的分析。传统的漏洞检测技术依赖于人工检查或应用预定义的规则[128]。相比之下，机器学习方法可以并行评估大量数据并快速生成预测，显著缩短发现漏洞所需的时间。在检测有效性方面，机器学习模型可以揭示之前未知的漏洞，通常称为零日漏洞[5]。这些模型甚至在没有针对它们进行特定训练的情况下，通过学习模式和从标记数据中泛化，也可能发现漏洞的迹象。这种能力通过帮助在攻击者利用之前识别和解决软件中的未知弱点，从而提高了整体安全性[1]。

Figure 1 shows the overall pipeline of software vulnerability detection. The pipeline for software vulnerability detection using ML models involves several key stages.
图 1 展示了软件漏洞检测的整体流程。使用机器学习模型进行软件漏洞检测的流程包括几个关键阶段。

Fig. 1.

The first stage is data collection, where data is gathered from various sources such as benchmark datasets including but not limited to the National Vulnerability Database (NVD) and the National Institute of Standards and Technology (NIST) Software Assurance Reference Dataset (SARD), code repositories (GitHub), and specific open source projects (LibTIFF, FFMPEG).
第一阶段是数据收集，数据来自各种来源，包括但不限于国家漏洞数据库（NVD）和国家标准与技术研究院（NIST）软件保证参考数据集（SARD），代码仓库（GitHub），以及特定的开源项目（LibTIFF，FFMPEG）。

The data preprocessing stage involves tokenization, parsing (using tools like Joern,³) normalization, and feature extraction to convert raw code into analyzable formats.
数据预处理阶段包括分词、解析（使用如 Joern、 ³ 等工具）、归一化和特征提取，以将原始代码转换为可分析格式。

The data representation stage is where the preprocessed data is converted into appropriate representations, including graph-based representations such as control flow or dataflow graphs, token representations, or numerical attributes.
数据表示阶段是将预处理后的数据转换为适当的表示，包括基于图的表示，如控制流图或数据流图，标记表示或数值属性。

In the feature extraction stage, once the data is represented in an appropriate form, these representations are converted into suitable features using different embedding techniques such as graph embedding or token vector embedding.
在特征提取阶段，一旦数据以适当的形式表示，这些表示就通过不同的嵌入技术（如图嵌入或标记向量嵌入）转换为合适的特征。

In the model inference stage, appropriate DL models (e.g., Recurrent Neural Networks (RNNs), Graph Neural Networks (GNNs), Transformers, Autoencoders, and Deep Belief Networks (DBNs)), as well as traditional ML models (e.g., Support Vector Machines (SVMs), Decision Trees, and Random Forests), are chosen based on the characteristics of the data. The training process includes splitting the data into training and test sets, feature engineering, hyperparameter tuning, and applying suitable training algorithms.
在模型推理阶段，根据数据特征选择合适的深度学习模型（例如，循环神经网络（RNNs）、图神经网络（GNNs）、Transformer、自编码器和深度信念网络（DBNs）），以及传统的机器学习模型（例如，支持向量机（SVMs）、决策树和随机森林）。训练过程包括将数据分为训练集和测试集、特征工程、超参数调整以及应用合适的训练算法。

In addition, model evaluation is often conducted using cross validation, performance metrics (i.e., accuracy, precision, and recall), confusion matrices, and ablation studies to ensure robust performance. This step ensures the models are accurate and reliable for detecting software vulnerabilities.
此外，模型评估通常采用交叉验证、性能指标（即准确率、精确率和召回率）、混淆矩阵和消融研究来确保稳健的性能。这一步骤确保模型在检测软件漏洞方面准确可靠。

2.2 Related Work 2.2 相关工作

There have been several existing survey papers on software vulnerabilities in the literature. In this section, we analyze the existing papers based on different aspects as shown in Table 1.
文献中已有关于软件漏洞的几篇综述论文。在本节中，我们根据表 1 所示的不同方面分析现有论文。

Table 1.

No. 第号	Study 研究	Data Source 数据源	Representation 表示	Embedding 嵌入	Models 模型	Vulnerability Types 漏洞类型	Tools 工具
1	Le et al. [72] Le 等人[72]	$✓$	$\times$	$✓$	$✓$	$\times$	$\times$
2	Ghaffarian & Shahriari [40] 加法里安 & 沙赫里亚里 [40]	$✓$	$✓$	$✓$	$✓$	$\times$	$\times$
3	Lin et al. [86] 林等人[86]	$✓$	$✓$	$✓$	$✓$	$\times$	$\times$
4	Zeng et al. [173] 曾等[173]	$✓$	$✓$	$✓$	$✓$	$\times$	$\times$
5	Semasaba et al. [124] Semasaba 等[124]	$✓$	$✓$	$\times$	$✓$	$✓$	$\times$
6	Sun et al. [133] Sun 等人[133]	$✓$	$\times$	$\times$	$\times$	$✓$	$\times$
7	Kritikos et al. [69] Kritikos 等人[69]	$✓$	$\times$	$\times$	$\times$	$✓$	$✓$
8	Khan & Parkinson [66] 卡恩与帕金森[66]	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$
9	Nong et al. [112] Nong 等人 [112]	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$
10	Chakraborty et al. [12] 查克拉巴蒂等[12]	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$
11	Liu et al. [90] 刘等[90]	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$
12	Our survey 我们的调查	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$

Table 1. Comparison of Contributions between Our Survey and the Existing Related Surveys/Reviews
表 1. 我们调查与现有相关调查/综述贡献比较

The table’s columns represent different aspects of the surveys, such as the data source used, representation, feature embedding, ML models, vulnerability types, and tools employed for model building or dataset processing. Data Source indicates whether the survey reviewed vulnerability detection data sources. Representation discusses whether the survey considered source code representation in its analysis. Embedding checks whether the survey considered feature embedding. The table also considers the ML models in the sixth column. The table also checks whether the survey considers vulnerability types based on the Common Weakness Enumeration (CWE) number. The last column indicates whether the studies covered tools used for software vulnerability detection.
表格的列表示调查的不同方面，例如所使用的数据源、表示、特征嵌入、机器学习模型、漏洞类型以及用于模型构建或数据集处理的工具。数据源表示调查是否审查了漏洞检测数据源。表示讨论调查是否在其分析中考虑了源代码表示。嵌入检查调查是否考虑了特征嵌入。表格还考虑了第六列的机器学习模型。表格还检查调查是否根据通用弱点枚举（CWE）编号考虑漏洞类型。最后一列表示研究是否涵盖了用于软件漏洞检测的工具。

The works of Ghaffarian and Shahriari [40] and Kritikos et al. [69] are the closest surveys to ours when it comes to the detection of data-driven security vulnerabilities. In their surveys, they analyzed ML-based software vulnerability detection from various aspects as shown in Table 1. However, there are a couple of differences compared to our work. Specifically, our work surveys vulnerability detection from the following aspects: better understanding of attack patterns and tools used for software vulnerability detection. Understanding different types of vulnerabilities gives researchers insights into various attack patterns, enabling them to design detection techniques that can identify both known and unknown attack patterns. Understanding tools for software vulnerability detection reveals technological trends, helping researchers in this field leverage tools for reproducibility. It highlights the strengths and weaknesses of existing tools, guiding new developments. Popular tools offer community support, documentation, and shared knowledge, accelerating innovation and practical application of research.
Ghaffarian 和 Shahriari[40]以及 Kritikos 等人[69]的作品在数据驱动安全漏洞检测方面与我们的研究最为接近。在他们调查中，他们从多个方面分析了基于机器学习的软件漏洞检测，如表 1 所示。然而，与我们的工作相比，存在一些差异。具体来说，我们的工作从以下方面调查了漏洞检测：更好地理解攻击模式和用于软件漏洞检测的工具。了解不同类型的漏洞使研究人员能够深入了解各种攻击模式，从而设计出能够识别已知和未知攻击模式的检测技术。了解软件漏洞检测工具揭示了技术趋势，帮助该领域的研究人员利用工具实现可重复性。它突出了现有工具的优缺点，指导了新发展的方向。流行的工具提供社区支持、文档和共享知识，加速了研究的创新和实际应用。

Le et al. [72] reviewed data-driven vulnerability assessment and prioritization studies. They conducted a review of prior research on software assessment and prioritization that leverages ML and data mining methods. The major difference from ours is that we review software vulnerability detection techniques, which refers to the process of identifying potential vulnerabilities in software systems, whereas they survey assessment and prioritization techniques.
Le 等人[72]回顾了基于数据驱动的漏洞评估和优先级研究。他们回顾了利用机器学习和数据挖掘方法进行软件评估和优先级研究的先前研究。与我们不同的是，我们回顾的是软件漏洞检测技术，这指的是识别软件系统中潜在漏洞的过程，而他们调查的是评估和优先级技术。

Lin et al. [86] examined the literature on using DL and neural network based techniques to detect software vulnerabilities. The major difference compared to our work is that we examine the trend analysis of papers published in software vulnerability detection in journal and conference papers because it provides a comprehensive understanding of the publishing patterns in a particular field or area of research. Trend analysis can shed light on the distribution of research output across various publication venues and the shifting preferences of researchers and authors.
Lin 等人[86]研究了使用深度学习和基于神经网络的技巧来检测软件漏洞的文献。与我们的工作相比，主要区别在于我们考察了在期刊和会议论文中发表的软件漏洞检测论文的趋势分析，因为这提供了对特定领域或研究领域的出版模式的整体理解。趋势分析可以揭示研究产出在各种出版场所的分布以及研究人员和作者偏好的变化。

Zeng et al. [173] discussed the growing focus on exploitable software vulnerabilities and the development of detection methods, especially using ML techniques. It reviews 22 recent studies employing DL for vulnerability detection and identifies four significant game-changers in the field. The survey compares these game-changers based on data sources, feature representation, DL models, and detection tools. Our survey differs in two key ways. First, we analyze publication trends in software vulnerability detection in journals and conferences, providing a comprehensive understanding of research trends. Second, we cover additional aspects beyond data sources, feature representation, and ML models including vulnerability types and detection tools.
曾等人[173]讨论了可利用软件漏洞的关注度日益增长以及检测方法的发展，特别是使用机器学习技术。它回顾了 22 项使用深度学习进行漏洞检测的近期研究，并确定了该领域四个重要的颠覆性创新。该调查基于数据来源、特征表示、深度学习模型和检测工具对这些颠覆性创新进行了比较。我们的调查在两个方面有所不同。首先，我们分析了期刊和会议中软件漏洞检测的出版趋势，提供了对研究趋势的全面理解。其次，我们涵盖了除数据来源、特征表示和机器学习模型之外的其他方面，包括漏洞类型和检测工具。

Kritikos et al. [69] and Sun et al. [133] focused on cybersecurity and aimed to improve cyber resilience. Sun et al. [133] discussed the paradigm shift in understanding and protecting against cyber threats from reactive detection to proactive prediction, with an emphasis on new research on cybersecurity incident prediction systems that use many types of data sources. Kritikos et al. [69] discusses the challenges of migrating applications to the cloud and ensuring their security, with a focus on vulnerability management during the application lifecycle and the use of open source tools and databases to better secure applications. While both approaches aim to improve the security of applications, they differ in their focus and techniques used. They mainly focus on providing guidance and tools to support vulnerability management during the application lifecycle, whereas in our survey, we focus on software vulnerability detection using ML techniques on source code which aim at automating the identification of vulnerabilities in the source code or repository data (i.e., commit characteristics).
Kritikos 等人[69]和 Sun 等人[133]专注于网络安全，旨在提高网络韧性。Sun 等人[133]讨论了从被动检测到主动预测的网络安全威胁理解和防御的范式转变，重点介绍了使用多种数据源进行网络安全事件预测系统的新研究。Kritikos 等人[69]讨论了将应用程序迁移到云中并确保其安全的挑战，重点关注应用程序生命周期中的漏洞管理以及使用开源工具和数据库来更好地保护应用程序。虽然这两种方法都旨在提高应用程序的安全性，但它们在重点和所使用的技术上有所不同。它们主要侧重于提供指导和支持工具，以支持应用程序生命周期中的漏洞管理，而我们的调查则侧重于使用机器学习技术在源代码上检测软件漏洞，旨在自动化源代码或存储库数据（即提交特征）中漏洞的识别。

Khan and Parkinson [66] focused on vulnerability assessment, which is the process of finding and fixing vulnerabilities in a computer system before they can be exploited by hackers. This highlights the necessity for more studies into automated vulnerability mitigation strategies that can effectively secure software systems. However, vulnerability identification with ML approaches on source code entails analyzing a software’s source code to spot security flaws. Instead of evaluating the safety of the entire system, this method concentrates on finding vulnerabilities in the code itself.
Khan 和 Parkinson[66]专注于脆弱性评估，即在黑客利用之前，寻找和修复计算机系统中的漏洞的过程。这突出了对自动化脆弱性缓解策略进行更多研究的必要性，这些策略可以有效地保障软件系统安全。然而，使用机器学习方法对源代码进行脆弱性识别，需要分析软件的源代码以发现安全漏洞。这种方法不是评估整个系统的安全性，而是专注于寻找代码本身的漏洞。

Nong et al. [112] explored the open science aspects of studies on software vulnerability detection and argued that there is a dearth of research on problems of open science in software engineering, particularly about software vulnerability detection. The authors conducted an exhaustive literature study and identified 55 relevant studies that propose DL-based vulnerability detection approaches. They investigated open science aspects including availability, executability, reproducibility, and replicability. The study revealed that 25.5% of the examined approaches provide open source tools.
Nong 等人[112]探讨了软件漏洞检测研究的开放科学方面，并认为在软件工程中，关于开放科学问题的研究不足，尤其是关于软件漏洞检测的问题。作者进行了详尽的文献研究，并确定了 55 项相关研究，这些研究提出了基于深度学习的漏洞检测方法。他们研究了开放科学的各个方面，包括可用性、可执行性、可重复性和可复制性。该研究揭示了 25.5%的检查方法提供了开源工具。

Chakraborty et al. [12] investigated the performance of cutting-edge DL-based vulnerability prediction approaches in real-world vulnerability prediction scenarios. They find that the performance of the state-of-the-art DL-based techniques drops by more than 50% in real-world scenarios. The significant difference compared to our survey study is that in our work, we focus on the usage of ML models for software vulnerability detection and characterize the different stages in the pipeline of vulnerability detection. However, they focus on issues related to the use of state-of-the-art DL models for software vulnerability detection.
Chakraborty 等人[12]研究了基于深度学习的漏洞预测方法在现实世界漏洞预测场景中的性能。他们发现，最先进的基于深度学习的技术在现实世界场景中的性能下降了 50%以上。与我们的调查研究相比，一个显著的不同之处在于，在我们的工作中，我们专注于使用机器学习模型进行软件漏洞检测，并描述了漏洞检测流程中的不同阶段。然而，他们关注的是与使用最先进的深度学习模型进行软件漏洞检测相关的问题。

Liu et al. [90] discussed the increasing popularity of DL techniques in software engineering research due to their ability to address software engineering challenges without extensive manual feature engineering. The major difference compared to our study is that we focus on the usage of ML techniques in software vulnerability detection pipelines, whereas they emphasize replicability and reproducibility of the results reported in software engineering research studies.
刘等人[90]讨论了深度学习（DL）技术在软件工程研究中的日益普及，这得益于它们能够解决软件工程挑战而无需进行广泛的手动特征工程。与我们的研究相比，主要区别在于我们关注机器学习（ML）技术在软件漏洞检测管道中的应用，而他们强调软件工程研究报告中结果的复现性和可重复性。

3 Methodology 3 方法论

3.1 Sources of Information
3.1 信息来源

In this article, we conducta systematic survey following other works [65, 116] to collect and examine studies from January 2011 to June 2024 focusing on software vulnerability detection using ML techniques. The overall workflow of our systematic approach is depicted in Figure 2. We target a set of popular and widely used digital libraries as the source of our data, including ACM Digital Library, ScienceDirect, IEEE Xplore, and Google Scholar. We developed a web crawler⁴ based on Selenium⁵ and Beautiful Soup⁶ libraries. The reason we developed a web crawler is that it offers a reliable, scalable, and effective method for collecting relevant information from the web, which is very useful for academic research, specifically systematic literature review.
本文对其他作品[65, 116]进行了系统调查，收集并审查了从 2011 年 1 月到 2024 年 6 月关于使用机器学习技术进行软件漏洞检测的研究。我们系统方法的整体工作流程如图 2 所示。我们将一系列流行且广泛使用的数字图书馆作为数据来源，包括 ACM 数字图书馆、ScienceDirect、IEEE Xplore 和 Google Scholar。我们基于 Selenium[1]和 Beautiful Soup[2]库开发了一个网络爬虫 ⁴ 。我们开发网络爬虫的原因是它提供了一种可靠、可扩展且有效的方法来从网络中收集相关信息，这对于学术研究，特别是系统文献综述非常有用。

Fig. 2.

The period between January 2011 to June 2024 is an appropriate time interval for extracting software vulnerability detection studies for several reasons. One reason is the increase in the volume and diversity of software vulnerabilities. During the past decade, there has been a significant increase in the number and diversity of software vulnerabilities that have been discovered and reported.⁷ As of 2021, there were 150,000 CVE records in NVD.⁸ This increase has created a need for more sophisticated and effective methods for vulnerability detection, which has led to the development of new data-driven techniques. A second reason is advancements in ML and data analytics. The past decade has seen significant advancements in ML, including the development of DL algorithms [44, 55], natural language processing techniques [89], and other data-driven approaches that are highly effective in detecting software vulnerabilities.
2011 年 1 月至 2024 年 6 月是提取软件漏洞检测研究的一个合适的时间段，原因有以下几点。一方面，软件漏洞的数量和多样性有所增加。在过去十年中，发现和报告的软件漏洞数量和多样性显著增加。截至 2021 年，NVD 中有 15 万个 CVE 记录。这种增加需要更复杂和有效的漏洞检测方法，这导致了新的数据驱动技术的开发。另一方面，机器学习和数据分析的进步。过去十年中，机器学习取得了显著进展，包括深度学习算法[44, 55]、自然语言处理技术[89]和其他在检测软件漏洞方面非常有效的数据驱动方法。

3.2 Search Terms 3.2 搜索词

Following existing surveys [72, 86, 124, 173], we devised the following search terms:
根据现有调查[72, 86, 124, 173]，我们设计了以下搜索词：

“vulnerability detection” OR “Deep Transfer Learning Vulnerability Detection” OR “Transfer Learning Software Vulnerability Detection” OR “Transfer Learning Software Bug Detection” OR “Software Vulnerability Detection” OR “Vulnerability Detection Using Deep Learning” OR “Source Code Security Bug Prediction” OR “Source Code Vulnerability Detection” OR “Source Code Bug Detection” OR “Vulnerability Detection on Source Code Using Deep Learning“
漏洞检测或 “深度迁移学习漏洞检测” 或 “迁移学习软件漏洞检测” 或 “迁移学习软件错误检测” 或 “软件漏洞检测” 或 “使用深度学习的漏洞检测” 或 “源代码安全错误预测” 或 “源代码漏洞检测” 或 “源代码错误检测” 或 “使用深度学习在源代码上的漏洞检测”

Using the keywords and our web scraper, we collected more than 15K initial records⁹ from the subject digital libraries shown in Figure 2. After extracting initial records, we started the manual analysis and filtering of initial records in three stages including verification based on paper titles, abstracts, and contents. These three stages are explained in detail in the following subsections.
使用关键词和我们的网络爬虫，我们从图 2 所示的数字图书馆中收集了超过 15K 条初始记录。在提取初始记录后，我们开始对初始记录进行三个阶段的手动分析和筛选，包括基于论文标题、摘要和内容的验证。这三个阶段将在下文小节中详细解释。

3.3 Study Selection and Quality Assessment
3.3 研究选择与质量评估

The process of selecting studies to be included in our survey involves the following stages: (1) initially choosing studies based on their title, (2) selecting studies after reviewing their abstracts, and (3) making further selections after reading the full articles. Note that the initial search results contain entries that are not related to software vulnerability detection. This might be caused by accidental keyword matching. We manually checked each paper and removed these irrelevant papers to ensure the quality of our survey dataset. We also observe that there exist duplicate papers among search results since the same study could be indexed by multiple databases. We then discarded duplicate studies manually.
选择纳入我们调查的研究过程包括以下阶段：（1）最初根据标题选择研究，（2）在审阅摘要后选择研究，以及（3）阅读全文后进行进一步选择。请注意，初始搜索结果包含与软件漏洞检测无关的条目。这可能是由于关键词匹配错误造成的。我们手动检查了每一篇论文，并删除了这些无关论文，以确保我们调查数据集的质量。我们还观察到，搜索结果中存在重复的论文，因为同一研究可能被多个数据库索引。然后我们手动丢弃了重复的研究。

The inclusion criteria are as follows: (1) the studies should have been peer reviewed (i.e., we do not include arXiv papers), (2) the studies should have experimental results, (3) the studies should propose a novel ML technique, (4) the studies should improve existing data-drive vulnerability detection techniques, and (5) the input to ML models should be either source code, text, commit, byte-code, or a combination of them. In addition, we have the following exclusion criteria to filter out irrelevant papers: (1) studies focusing on other engineering domains (electrical engineering, mechanical engineering, aerospace engineering, etc.), (2) studies addressing static analysis, dynamic analysis, hybrid analysis, and mutation testing, (3) review or survey studies, (4) studies focusing on vulnerability detection of web and Android applications, (5) studies belonging to one of the following categories: books, chapters, tutorials, or technical reports, and (6) studies focusing on malware detection on mobile devices, intrusion detection, and bug detection using static code attributes (i.e., Cyclomatic Complexities).
纳入标准如下：（1）研究应为同行评审（即我们不包括 arXiv 论文），（2）研究应包含实验结果，（3）研究应提出新的机器学习技术，（4）研究应改进现有的数据驱动漏洞检测技术，（5）机器学习模型的输入应为源代码、文本、提交、字节码或它们的组合。此外，我们还有以下排除标准以筛选出无关论文：（1）关注其他工程领域（如电气工程、机械工程、航空航天工程等）的研究，（2）涉及静态分析、动态分析、混合分析和突变测试的研究，（3）综述或调查性研究，（4）关注 Web 和 Android 应用程序漏洞检测的研究，（5）属于以下类别之一：书籍、章节、教程或技术报告的研究，（6）关注移动设备恶意软件检测、入侵检测和利用静态代码属性（即圈复杂度）进行漏洞检测的研究。

3.3.1 Title Filtering Stage.
3.3.1 标题过滤阶段。

In this stage, we filter studies based on their titles. Since titles do not convey much information about the subject study, we only focused on relevance to the initial keywords. In this stage, we answer the following question: Do the titles contain specific keywords or phrases that are central to software vulnerability detection? For example, in the study titled “Toward Hardware-Based IP Vulnerability Detection and Post-Deployment Patching in Systems-on-Chip,” although the title includes our devised keyword vulnerability detection, the context indicates that the focus is on hardware and systems-on-chip rather than software engineering.
在这个阶段，我们根据研究标题进行筛选。由于标题并不能传达太多关于研究主题的信息，我们只关注与初始关键词的相关性。在这个阶段，我们回答以下问题：标题中是否包含与软件漏洞检测相关的特定关键词或短语？例如，在标题为“面向基于硬件的 IP 漏洞检测和片上系统部署后补丁的解决方案”的研究中，尽管标题中包含了我们设计的“漏洞检测”关键词，但上下文表明，重点是硬件和片上系统，而不是软件工程。

After the manual analysis on approximately 15K records, we collected 398 unique studies for further evaluation.
在手动分析约 15K 条记录之后，我们收集了 398 个独特的研究进行进一步评估。

Abstract Filtering Stage. Given the list of studies filtered from the previous stage, we thoroughly analyzed the abstract of the studies. We decomposed the abstract of each paper into four major sections, including Context, Objective, Approach, and Results/Findings, as abstracts of research papers often follow such structure.
摘要筛选阶段。给定从上一阶段筛选出的研究列表，我们对研究的摘要进行了全面分析。我们将每篇论文的摘要分解为四个主要部分，包括背景、目标、方法和结果/发现，因为研究论文的摘要通常遵循这种结构。

In this stage of filtering, we get 202 unique papers for further verification.
在这个筛选阶段，我们获得了 202 篇独特的论文以供进一步验证。

Content Filtering Stage. In this section, we analyze the content of each study in detail to perform the filtering process. Since there is more detail in the actual content of each study, we devise a set of criteria questions. We rely on the answers to these questions to assess the quality of the papers. If the answers to these questions are positive, the study is relevant; otherwise, we remove the paper from further examination. The questions are as follows: (1) Is there a clearly stated research goal related to software vulnerability detection in the introduction of the paper?; (2) Does the proposed vulnerability detection approach use ML or DL techniques?; (3) Is there a defined and repeatable technique?; (4) Is there any explicit contribution to software vulnerability detection?; (5) Is there a clear methodology for validating the technique?; (6) Are the subject projects selected for validation suitable for the research goals?; (7) Are the employed datasets relevant to software vulnerability detection?; (8) Are the type of input data to DL and ML models relevant to software vulnerability detection? (valid data types include source code, binary code, text, and commit metrics); (9) Are there control techniques or baselines to demonstrate the effectiveness of the software vulnerability detection technique?; (10) Are the evaluation metrics relevant (e.g., evaluate the effectiveness of the proposed technique) to the research objectives?; and (11) Do the results presented in the study align with the research objectives, and are they presented in a clear and relevant manner?
内容过滤阶段。在本节中，我们详细分析每项研究的具体内容以执行过滤过程。由于每项研究的实际内容中包含更多细节，我们制定了一套标准问题。我们依靠对这些问题的回答来评估论文的质量。如果这些问题的回答是肯定的，则该研究相关；否则，我们将该论文从进一步审查中移除。问题如下：（1）论文引言中是否明确提出了与软件漏洞检测相关的明确研究目标？（2）所提出的漏洞检测方法是否使用了机器学习或深度学习技术？（3）是否存在定义明确且可重复的技术？（4）是否对软件漏洞检测有明确的贡献？（5）是否有验证技术的明确方法？（6）所选用于验证的项目主题是否适合研究目标？（7）所使用的数据集是否与软件漏洞检测相关？（8）深度学习和机器学习模型的输入数据类型是否与软件漏洞检测相关？ (有效数据类型包括源代码、二进制代码、文本和提交指标)；（9）是否存在控制技术或基线来证明软件漏洞检测技术的有效性？；（10）评估指标是否与研究目标相关（例如，评估所提出技术的有效性）？；（11）研究中呈现的结果是否与研究目标一致，并且是否以清晰和相关的形式呈现？

The filtering process in this stage resulted in 138 subject studies to address the RQs. We used these 138 studies to create taxonomies which are explained in detail in the next section.
本阶段的筛选过程产生了 138 项主题研究以解决 RQs。我们利用这 138 项研究创建了分类法，将在下一节中详细解释。

3.4 Taxonomy Development and Classification Methodology
3.4 分类发展及分类方法

In this section, we present the methodology used to develop our taxonomy and classify the selected papers based on our RQs. The process is done in an incremental approach following existing studies [53]. The foundation of our taxonomy is anchored in a systematic analysis of the literature, guided by the specific RQs designed to explore various dimensions of software vulnerability detection. Each RQ serves as a focal point for our classification, ensuring a structured and coherent approach.
本节中，我们介绍了用于开发我们的分类法和根据我们的研究问题（RQs）对所选论文进行分类的方法。该过程采用增量方法，遵循现有研究[53]。我们的分类法基础建立在文献的系统分析上，由旨在探索软件漏洞检测各个维度的特定 RQs 指导。每个 RQ 都作为我们分类的焦点，确保了一种结构化和连贯的方法。

Extraction of Relevant Information. We meticulously examined each selected paper to extract relevant text segments related to the RQs. For RQ2, which pertains to the sources of datasets, we examine the experiential setup sections of each study. This section is the most commonly used section where authors discuss the source of datasets.¹⁰ This allows us to understand the types of datasets the authors used to evaluate their proposed software vulnerability detection techniques. For RQ3, we analyzed the section detailing the proposed approach for software vulnerability detection. This involved identifying descriptions of the employed ML and DL models. One of the main sources of information that clearly explains the proposed approach is the overall architecture, which depicts the entire process of the proposed technique. For RQ4, we examine the vulnerability types covered in the subject study. These types often use the CWE system, which is easy to locate in the paper. We search for any keywords that start with CWE in the paper. If we find any CWE IDs mentioned, we record them. Otherwise, we note that the paper does not specify which vulnerability types their study aims to detect. Please note that some papers do not mention the CWE ID. For instance, for Integer Overflow (CWE-190), they only use the original title instead of the CWE ID. Therefore, we search for both CWE IDs and other related vulnerability keywords. For RQ5, we thoroughly analyzed the experimental sections, particularly the implementation sections of the subject studies to extract information about the tools used for building the ML models. Our empirical evaluation revealed that the authors usually use the keywords implementation or built to describe the tools they used. For RQ6, we examine the introduction section of the subject study, as authors often explicitly mention the specific problem they address in software vulnerability detection.
相关信息提取。我们仔细审查了每篇选定的论文，以提取与 RQs 相关的文本片段。对于 RQ2，它涉及数据集的来源，我们检查了每项研究的经验设置部分。这部分是作者讨论数据集来源最常用的部分。 ¹⁰ 这使我们能够了解作者用于评估其提出的软件漏洞检测技术的数据集类型。对于 RQ3，我们分析了详细说明软件漏洞检测方法的部分。这包括识别所采用的机器学习（ML）和深度学习（DL）模型的描述。清楚地解释所提出方法的主要信息来源之一是整体架构，它描绘了所提出技术的整个过程。对于 RQ4，我们检查了主题研究中涵盖的漏洞类型。这些类型通常使用 CWE 系统，在论文中容易找到。我们在论文中搜索以 CWE 开头的任何关键词。如果我们找到任何提到的 CWE ID，我们就记录下来。否则，我们注明该论文未指定其研究旨在检测哪些漏洞类型。请注意，有些论文没有提及 CWE ID。例如，对于整数溢出（CWE-190），它们只使用原始标题而不是 CWE ID。因此，我们同时搜索 CWE ID 和其他相关漏洞关键词。对于 RQ5，我们详细分析了实验部分，特别是主题研究的实现部分，以提取有关用于构建 ML 模型的工具的信息。我们的实证评估显示，作者通常使用“实现”或“构建”等关键词来描述他们使用的工具。对于 RQ6，我们检查了主题研究的引言部分，因为作者通常明确提到他们在软件漏洞检测中解决的具体问题。

Create Preliminary Taxonomies. Initially, we establish a preliminary taxonomy that groups the studies based on defined RQs, which provides a basic framework for organizing the studies in a meaningful and systematic manner. For example, for the first study, we create preliminary taxonomies for RQ1 through RQ6. After thoroughly addressing all RQs for a given study, we move on to the next study.
创建初步分类法。最初，我们根据定义的研究问题（RQs）建立初步分类法，这为以有意义和系统的方式组织研究提供了一个基本框架。例如，对于第一项研究，我们为 RQ1 至 RQ6 创建初步分类法。在彻底解决给定研究的所有 RQs 之后，我们继续下一项研究。

Iterative Refinement. Once the initial taxonomy is created, we proceed to expand and refine it as we delve deeper into the analysis of each RQ across all subject studies. The authors then expand the taxonomy by assigning new papers to the preliminary taxonomy. If a new paper cannot fit into any of the existing categories within the taxonomy, a new category is created that reflects the unique characteristics of that paper. To ensure the accuracy of the taxonomy, the second and third authors (who are not involved in the taxonomy creation process) randomly select 20 papers from the workflow and check the created taxonomies for any discrepancies. After identifying any disagreements, they proceed to mark them. Subsequently, all authors engage in discussions to address and resolve these disagreements. Initially, the disagreement rate was 30%, but after a second round of review and cross checking of the papers, we were able to eliminate all disagreements.
迭代细化。一旦创建了初始分类法，我们就随着对每个研究问题（RQ）在所有主题研究中的分析深入，对其进行扩展和细化。然后，作者通过将新论文分配到初步分类法中来扩展分类法。如果一篇新论文无法纳入分类法中现有的任何类别，就创建一个反映该论文独特特征的新类别。为确保分类法的准确性，第二和第三位作者（未参与分类法创建过程）从工作流程中随机选取 20 篇论文，检查创建的分类法是否存在任何差异。在确定任何分歧后，他们进行标记。随后，所有作者参与讨论，以解决和解决这些分歧。最初，分歧率为 30%，但在第二次审查和交叉检查论文后，我们能够消除所有分歧。

Resolving Disagreements. During the extraction process, if we encountered conflicting information or interpretations, we collaboratively discussed these discrepancies to reach a consensus. This collaborative effort ensured that our classification remained consistent and accurate. By following this rigorous methodology, we ensured that our taxonomy is grounded in detailed and systematic analyses of the literature. This approach provides a clear and coherent framework for classifying the selected papers and addressing each RQ comprehensively.
解决分歧。在提取过程中，如果我们遇到冲突的信息或解释，我们会共同讨论这些差异，以达成共识。这种协作努力确保了我们的分类保持一致和准确。通过遵循这种严格的方法，我们确保了我们的分类体系基于对文献的详细和系统分析。这种方法为对所选论文进行分类和全面解决每个研究问题提供了一个清晰且连贯的框架。

4 Results 4 结果

In this section, we present our analyses and findings to address the RQs.
在这一节中，我们展示了我们的分析和发现，以解决研究问题。

4.1 RQ1: What Is the Trend of Studies?
4.1 RQ1：研究趋势是什么？

To understand the trend of publications, we examined the publication dates and the venues in which they were presented.
为了了解出版物趋势，我们考察了出版日期以及它们所呈现的场所。

4.1.1 RQ1.1: What Is the Trend of Studies over Time?.
4.1.1 RQ1.1：研究随时间推移的趋势是什么？

Figure 3 demonstrates the publication trend of software vulnerability detection studies published over 13 years (i.e., between January 2011 and June 2024). It is observable that the number of publications has gradually increased over the years.
图 3 展示了过去 13 年（即 2011 年 1 月至 2024 年 6 月）发表的软件漏洞检测研究出版趋势。观察可知，出版物数量逐年逐渐增加。

Fig. 3.

We also analyze the cumulative number of publications shown in Figure 3. It is noticeable that the curve fitting the distribution shows a significant increase in slope between 2020 and 2024, suggesting that the usage of ML techniques for software vulnerability detection has become a prevalent trend since 2020.
我们同时分析了图 3 中显示的出版物累积数量。值得注意的是，拟合分布的曲线在 2020 年至 2024 年之间斜率显著增加，表明自 2020 年以来，使用机器学习技术进行软件漏洞检测已成为一种普遍趋势。

4.1.2 RQ1.2: What Is the Distribution of Publication Venues?.
4.1.2 RQ1.2：出版物场所的分布情况如何？

In this study, in general, we studied and reviewed 138 studies from various publication venues, including 61 studies from conferences and symposiums and 77 studies from journals. Table 2 shows the distribution of studies for each publication venue. A total of 44.2% of the publications are published in conferences and symposiums, whereas 55.7% of the studies have been published as articles in journals. It is observable that ICSE, ISSRE, MSR, and FSE are the most popular venues that have the highest number of studies. Meanwhile, among the journal venues, IST, C&S, and JSS have the highest number of studies—that is, 13, 12, and 12 studies, respectively.
在本研究中，总体而言，我们研究了并回顾了来自各种出版物场所的 138 项研究，包括来自会议和研讨会的 61 项研究以及来自期刊的 77 项研究。表 2 显示了每个出版物场所的研究分布。总共 44.2%的出版物是在会议和研讨会上发表的，而 55.7%的研究已作为文章发表在期刊上。可以观察到，ICSE、ISSRE、MSR 和 FSE 是最受欢迎的场所，拥有最多的研究数量。同时，在期刊场所中，IST、C&S 和 JSS 的研究数量最多，分别是 13 篇、12 篇和 12 篇。

Table 2.

Conference Venue 会议地点	# Studies # 研究	References 参考文献	Journal Venue 期刊出版地	# Studies # 研究	References 参考文献
ICSE	9	[11, 113, 129, 135, 146, 147, 150, 155, 170] [11, 113, 129, 135, 146, 147, 150, 155, 170]	IST	13	[9, 10, 17, 30, 31, 108, 126, 127, 139, 149, 158, 175, 181] [9, 10, 17, 30, 31, 108, 126, 127, 139, 149, 158, 175, 181]
ISSRE	6	[153, 169, 172, 179, 180, 185] [153, 169, 172, 179, 180, 185]	C&S C&S：C&S（此处保持原文，可能为专有名词或缩写）	12	[36, 45, 47, 63, 68, 77, 131, 132, 138, 148, 152, 164] [36, 45, 47, 63, 68, 77, 131, 132, 138, 148, 152, 164]
MSR	5	[19, 38, 54, 56, 105] [19, 38, 54, 56, 105]	JSS	12	[7, 8, 13, 16, 32, 91, 98, 106, 114, 136, 143, 171] [7, 8, 13, 16, 32, 91, 98, 106, 114, 136, 143, 171]
FSE	5	[78, 81, 103, 111, 184] [78, 81, 103, 111, 184]	TDSC	6	[83, 84, 87, 95, 188, 189] [83, 84, 87, 95, 188, 189]
IJCAI	4	[23, 34, 96, 187] [23, 34, 96, 187]	TSE	5	[26, 82, 122, 151, 174] [26, 82, 122, 151, 174]
ASE	3	[73, 110, 177] [73, 110, 177]	TIFS	4	[58, 142, 154, 157] [58, 142, 154, 157]
NDSS	2	[85, 125] [ 85, 125 ]	ISA	4	[134, 159, 176, 182] [134, 159, 176, 182]
NeurIPS 神经信息处理系统大会	2	[3, 183] [3, 183]	TOSEM	3	[20, 109, 190] [ 20, 109, 190 ]
TrustCom	2	[93, 165] [93, 165]	TKDE	2	[76, 97] [76, 97]
OOPSLA	2	[79, 118] [79, 118]	IS	2	[41, 64] [41, 64]
CCS	2	[115, 163] [ 115, 163 ]	ESA	2	[92, 140] [92, 140]
ICLR	2	[28, 71] [28, 71]	CN	1	[178] [178]
QRS	2	[74, 141] [74, 141]	TFS	1	[94] [ 94 ]
USENIX	1	[162] [ 162 ]	SQJ	1	[33] [33]
MASCOTS	1	[37] [37]	PL	1	[80] [80]
KDDM	1	[107] [ 107 ]	P&S P&S：P&S（可能指产品与服务的缩写，具体含义需根据上下文确定）	1	[75] [ 75 ]
ISSTA	1	[22] [ 22 ]	Nature 自然	1	[60] [60]
IJCNN	1	[48] [48]	KBS	1	[186] [186]
ICTAI	1	[117] [ 117 ]	FGCS	1	[49] [49]
ICECCS	1	[21] [21]	EAAI	1	[144] [144]
ICBD	1	[168] [168]	CEE	1	[120] [120]
GLOBCOM	1	[166] [166]	BRA	1	[4] [4]
DSAA	1	[104] [ 104 ]	ASC	1	[57] [57]
CDSN	1	[137] [ 137 ]
CARS	1	[70] [70]
SANER	1	[29] [29]
ENTCC	1	[62] [62]
MCSoC MCSoC：多核片上系统	1	[46] [ 46 ]
Overall 总体	61			77

Table 2. Distribution of Publications Based on Conference and Journal Venues
表 2. 基于会议和期刊场所的出版物分布

4.2 RQ2: What Are the Characteristics of Software Vulnerability Detection Datasets?
4.2 研究问题 2：软件漏洞检测数据集的特征是什么？

In this section, we examine data used in vulnerability detection studies and conduct a comprehensive analysis of the steps of data source, data type, and data representation.
在这一节中，我们考察了用于漏洞检测研究的数据，并对数据来源、数据类型和数据表示的步骤进行了全面分析。

4.2.1 RQ2.1: What Is the Source of Datasets?.
4.2.1 RQ2.1：数据集的来源是什么？

One of the main challenges in ML-based software vulnerability detection is the insufficient amount of data available for model training [19, 88]. Consequently, there exists a gap in research on how to obtain sufficient datasets to facilitate the training of ML models for software vulnerability detection. To this end, we analyze the sources of datasets in the subject studies. Our analysis reveals that datasets for this purpose can be broadly classified into four categories: Benchmark, Hybrid, Open Source Software, and Repository sources. Among the subject studies, 39.1% of them use Hybrid as the data source for the detection of software vulnerability. They use a combination of various sources of data, such as benchmarks, repositories, and open source projects, to provide a comprehensive and multi-faceted resource for software vulnerability detection [36, 141]. These datasets combine the benefits of each data source to provide richer and more diversified information, which is critical for building and verifying robust vulnerability detection systems. Benchmark datasets used by 37.6% of the subject studies play a crucial role in the field of software vulnerability detection by providing standardized, high-quality data that researchers can use to evaluate and compare the effectiveness of their detection technique [127, 159]. Using benchmark datasets facilitates the construction of ML models for software vulnerability detection. However, they may not include zero-day vulnerabilities, which have a significant impact. Among the subject studies, 13.7% of them collect datasets from online repositories which we classify as the Repository category. These datasets are gathered from publicly available projects hosted on repository websites such as GitHub or Stack Overflow [28, 118, 184]. These repositories hold a plethora of data, including source code, commit history, issue trackers, and documentation. Repositories keep detailed records of any changes made to a codebase, such as commit messages, diffs, and timestamps [101]. This comprehensive history enables researchers to trace the lifecycle of vulnerabilities from introduction to resolution (please refer to the work of Iannone et al. [61]). The fourth source is open source software, accounting for 9.4% of the subject studies, which provides a rich and diverse source of data for software vulnerability detection [126, 163]. These projects are publicly accessible and typically have a large community of contributors who continuously update and maintain the code. Some example open source projects include but are not limited to FFmpeg, QEMU, OpenSSH, and LibTIFF. The open nature of these projects means that they are often inspected carefully by numerous developers, which can lead to the discovery and documentation of various vulnerabilities.
机器学习软件漏洞检测的主要挑战之一是模型训练所需数据的不足[19, 88]。因此，在如何获取足够的数据集以促进软件漏洞检测的机器学习模型训练方面的研究存在差距。为此，我们分析了该主题研究中数据集的来源。我们的分析表明，用于此目的的数据集可以大致分为四类：基准、混合、开源软件和存储库来源。在主题研究中，39.1%的研究使用混合作为软件漏洞检测的数据源。他们结合了各种数据来源，如基准、存储库和开源项目，为软件漏洞检测提供了一个全面和多角度的资源[36, 141]。这些数据集结合了每个数据来源的优点，提供了更丰富和更多样化的信息，这对于构建和验证稳健的漏洞检测系统至关重要。由 37.6% 的研究对象在软件漏洞检测领域发挥着关键作用，通过提供标准化的、高质量的数据，研究人员可以使用这些数据来评估和比较其检测技术的有效性[127, 159]。使用基准数据集有助于构建软件漏洞检测的机器学习模型。然而，它们可能不包括零日漏洞，这对影响很大。在研究对象中，13.7% 的研究收集来自在线存储库的数据集，我们将这类数据集归类为存储库类别。这些数据集来自 GitHub 或 Stack Overflow 等存储库网站上公开的项目[28, 118, 184]。这些存储库包含大量数据，包括源代码、提交历史、问题跟踪器和文档。存储库详细记录了代码库的任何更改，如提交信息、差异和时间戳[101]。这一全面的历史记录使研究人员能够追踪漏洞从引入到解决的生命周期（请参阅 Iannone 等人[61]的工作）。第四个来源是开源软件，占 9%。4% 的受试研究，为软件漏洞检测提供了丰富多样的数据来源[126, 163]。这些项目是公开可访问的，通常拥有庞大的贡献者社区，他们持续更新和维护代码。一些示例开源项目包括但不限于 FFmpeg、QEMU、OpenSSH 和 LibTIFF。这些项目的开放性意味着它们经常被众多开发者仔细检查，这可能导致各种漏洞的发现和记录。

Table 3 shows the detailed distribution of benchmark data used in the subject studies. As it is observable, SARD and NVD are the most widely used sources of data in the Benchmark category. SARD is a comprehensive set of test cases created exclusively for testing software systems. It was developed by NIST¹¹ as part of their efforts to improve the quality and safety of software systems. SARD offers a wide range of synthetic and real-world test scenarios intended to reflect many sorts of software vulnerabilities. Another major source of benchmark data is NVD, which is a comprehensive repository of publicly disclosed software vulnerabilities. NVD entries are based on the CVE system, which provides standardized identifiers and descriptions for each vulnerability. CVEs are assigned by CVE Numbering Authorities¹² and are a cornerstone of NVD. Each entry in NVD includes detailed information about the vulnerability, such as its description, severity (using the Common Vulnerability Scoring System), impacted software versions, references to related advisories, and mitigation advice. Smartbugs Wild¹³ is also the third most commonly used (accounting for 12 studies) dataset for software vulnerability detection within the field of smart contracts. Smartbugs Wild contains more than 47K smart contracts mined from the main network of Ethereum, which includes a wide variety of real-world smart contracts, providing a useful dataset for testing and assessing vulnerability detection techniques. Please note that the key factor confirming the validity of a benchmark dataset is its continuous updating. As the nature of vulnerabilities evolves and more zero-day vulnerabilities emerge, these datasets need to be updated to reflect the latest software vulnerability patterns. This is why researchers do not rely solely on benchmark data for building ML models.
表 3 显示了主题研究中使用的基准数据的详细分布。观察可知，SARD 和 NVD 是基准类别中最广泛使用的数据来源。SARD 是一套专为测试软件系统而创建的测试用例集合。它是作为 NIST ¹¹ 提高软件系统质量和安全性的努力之一而开发的。SARD 提供了一系列合成和现实世界的测试场景，旨在反映许多类型的软件漏洞。基准数据的另一个主要来源是 NVD，它是一个公开披露的软件漏洞的综合性存储库。NVD 条目基于 CVE 系统，该系统为每个漏洞提供标准化的标识符和描述。CVE 由 CVE 编号机构 ¹² 分配，是 NVD 的基石。NVD 中的每个条目都包含有关漏洞的详细信息，例如其描述、严重性（使用通用漏洞评分系统）、受影响的软件版本、相关警告的引用以及缓解建议。 Smartbugs Wild ¹³ 也是智能合约领域软件漏洞检测中第三常用的（占 12 项研究）数据集。Smartbugs Wild 包含从以太坊主网络挖掘的超过 47K 个智能合约，其中包括各种现实世界的智能合约，为测试和评估漏洞检测技术提供了有用的数据集。请注意，确认基准数据集有效性的关键因素是其持续更新。随着漏洞性质的变化和更多零日漏洞的出现，这些数据集需要更新以反映最新的软件漏洞模式。这就是为什么研究人员在构建机器学习模型时不仅仅依赖于基准数据。

Table 3.

No. 第号	Source 源	# Studies # 研究	References 参考文献
1	SARD	33	[9, 11, 20, 21, 31, 34, 36, 37, 47, 49, 63, 68, 83, 84, 85, 87, 95, 137, 138, 139, 140, 141, 142, 143, 148, 152, 155, 157, 158, 165, 179, 180, 189] [9, 11, 20, 21, 31, 34, 36, 37, 47, 49, 63, 68, 83, 84, 85, 87, 95, 137, 138, 139, 140, 141, 142, 143, 148, 152, 155, 157, 158, 165, 179, 180, 189]
2	NVD	32	[10, 11, 19, 30, 31, 32, 36, 37, 49, 54, 63, 68, 70, 73, 83, 84, 85, 94, 95, 113, 132, 137, 142, 143, 155, 157, 158, 159, 174, 179, 180, 189] [10, 11, 19, 30, 31, 32, 36, 37, 49, 54, 63, 68, 70, 73, 83, 84, 85, 94, 95, 113, 132, 137, 142, 143, 155, 157, 158, 159, 174, 179, 180, 189]
3	Smartbugs Wild 智能虫	12	[7, 13, 57, 91, 103, 104, 105, 134, 153, 169, 177, 185] [7, 13, 57, 91, 103, 104, 105, 134, 153, 169, 177, 185]
4	Big-Vul 大漏洞	8	[32, 38, 81, 98, 108, 110, 129, 188] [32, 38, 81, 98, 108, 110, 129, 188]
5	Reveal 揭示	6	[78, 98, 140, 150, 151, 174] [78, 98, 140, 150, 151, 174]
6	Juliet Test Suit 朱丽叶测试套件	5	[23, 29, 75, 148, 164] [23, 29, 75, 148, 164]
7	ESC	5	[76, 96, 97, 169, 187] [76, 96, 97, 169, 187]
8	D2A	5	[22, 29, 127, 140, 174] [22, 29, 127, 140, 174]
9	SolidiFi-benchmark	5	[7, 103, 104, 105, 134] [7, 103, 104, 105, 134]
10	Fan et al. 范等	4	[22, 78, 150, 151] [ 22, 78, 150, 151 ]
11	Vuldeepecker	4	[16, 17, 140, 190] [16, 17, 140, 190]
12	VSC	4	[76, 96, 97, 187] [76, 96, 97, 187]
13	NDSS	3	[71, 75, 107] [71, 75, 107]
14	PROMISE	3	[74, 146, 172] [74, 146, 172]
15	FUNDED	2	[60, 174] [60, 174]
16	F-Droid	2	[26, 122] [26, 122]
17	Android/iOS	2	[26, 122] [26, 122]
18	SySeVr	2	[17, 190] [17, 190]
20	Others 其他人	25	[7, 32, 33, 41, 46, 48, 57, 60, 64, 70, 73, 81, 82, 95, 129, 134, 135, 136, 140, 142, 164, 166, 174, 181, 182] [7, 32, 33, 41, 46, 48, 57, 60, 64, 70, 73, 81, 82, 95, 129, 134, 135, 136, 140, 142, 164, 166, 174, 181, 182]
–	Unique Total 独特总计	99	–

Table 3. Detailed Distribution of Benchmark Sources
表 3. 基准数据源详细分布

Table 4 shows the detailed distribution of the Repository source of data. As shown, GitHub is the most popular source of data for software vulnerability detection, accounting for 27 subject studies. One benefit of utilizing GitHub as a data source is that it gives you access to real-world code written by developers, which can be used to train and test vulnerability detection models. The second commonly used source of repository data is the CVE system, which is a widely recognized and utilized framework for identifying, cataloging, and referencing publicly disclosed vulnerabilities. Each vulnerability in the CVE system is given a unique identification known as a CVE ID (e.g., CVE-2023-33976). This standardized identifier facilitates easy reference and communication across various platforms and tools. CVE entries provide detailed descriptions of vulnerabilities, outlining the nature of the issue, the affected software, and the potential impacts. The third commonly used source of repository data is Etherscan,¹⁴ a popular blockchain explorer for the Ethereum blockchain. Etherscan provides users with extensive information about Ethereum transactions, addresses, tokens, and smart contracts. It offers detailed insights into deployed smart contracts, including the contract’s source code (if verified), transactions, and execution history. Users can access the complete history of transactions involving a smart contract, with details about function calls, input parameters, and transaction results.
表 4 显示了数据存储库源的数据详细分布。如图所示，GitHub 是软件漏洞检测数据的最受欢迎来源，占 27 项研究。利用 GitHub 作为数据源的一个好处是，它为您提供了开发者编写的真实世界代码的访问权限，这些代码可用于训练和测试漏洞检测模型。第二个常用的存储库数据来源是 CVE 系统，这是一个广泛认可和使用的框架，用于识别、编目和引用公开披露的漏洞。CVE 系统中的每个漏洞都被赋予一个唯一的标识符，称为 CVE ID（例如，CVE-2023-33976）。这个标准化标识符促进了在各种平台和工具之间的轻松引用和沟通。CVE 条目提供了漏洞的详细描述，概述了问题的性质、受影响的软件和潜在影响。第三个常用的存储库数据来源是 Etherscan， ¹⁴ ，一个流行的以太坊区块链浏览器。 Etherscan 为用户提供关于以太坊交易、地址、代币和智能合约的详细信息。它提供了对已部署智能合约的详细洞察，包括合约的源代码（如果已验证）、交易和执行历史。用户可以访问涉及智能合约的所有交易的完整历史，包括函数调用、输入参数和交易结果。

Table 4.

No. 第号	Source 源	# Studies # 研究	References 参考文献
1	GitHub	27	[10, 11, 19, 20, 28, 48, 54, 73, 79, 80, 93, 94, 106, 111, 113, 114, 115, 118, 120, 132, 142, 147, 149, 159, 175, 176, 184] [10, 11, 19, 20, 28, 48, 54, 73, 79, 80, 93, 94, 106, 111, 113, 114, 115, 118, 120, 132, 142, 147, 149, 159, 175, 176, 184]
2	CVE	20	[9, 11, 19, 38, 47, 58, 60, 75, 87, 94, 113, 115, 131, 132, 141, 147, 152, 154, 174, 176] [9, 11, 19, 38, 47, 58, 60, 75, 87, 94, 113, 115, 131, 132, 141, 147, 152, 154, 174, 176]
3	Etherscan 以太坊浏览器	13	[4, 7, 8, 58, 62, 82, 120, 125, 135, 168, 171, 176, 178] [4, 7, 8, 58, 62, 82, 120, 125, 135, 168, 171, 176, 178]
4	Bugzilla	4	[19, 114, 166, 184] [19, 114, 166, 184]
5	Jira	3	[19, 80, 184] [19, 80, 184]
6	PyPI	1	[3] [ 3 ]
–	Unique Total 独特总计	51	–

Table 4. Detailed Distribution of Repositories Used for Collecting Data
表 4. 用于收集数据所使用的存储库的详细分布

4.2.2 RQ2.2: What Are the Most Commonly Used Data Types?.
4.2.2 RQ2.2：最常见的使用数据类型是什么？

When it comes to detecting software vulnerabilities, datasets can have varying data types. Existing software vulnerability detection models, for example, can find vulnerabilities in source code or commits. It is crucial to carefully examine these data types, as they require different preprocessing techniques and must be represented differently when using ML models. Additionally, distinct data types necessitate different architectural approaches for ML models. This section provides an overview of the various data types and their distributions. We classified the data types of the employed datasets into four broad categories: Code, Text, Numerical, and Hybrid.
在检测软件漏洞方面，数据集可以包含不同的数据类型。例如，现有的软件漏洞检测模型可以在源代码或提交中找到漏洞。仔细检查这些数据类型至关重要，因为它们需要不同的预处理技术，并且在使用机器学习模型时必须以不同的方式表示。此外，不同的数据类型需要为机器学习模型采用不同的架构方法。本节概述了各种数据类型及其分布。我们将所使用数据集的数据类型分为四大类：代码、文本、数值和混合。

The majority of the subject studies (92.7%) primarily focus on analyzing source code for software vulnerability detection, underscoring the importance of code-level analysis in identifying vulnerabilities. Repository-level data, such as textual reports and logs, account for 1.4%, whereas commit characteristics (numerical data) account for 2.8%. Additionally, 2.8% of the studies adopt a hybrid approach, combining both code-level analysis and repository-level data.
大多数研究对象（92.7%）主要关注分析源代码以检测软件漏洞，强调了在识别漏洞中代码级分析的重要性。仓库级数据（如文本报告和日志）占 1.4%，而提交特征（数值数据）占 2.8%。此外，2.8%的研究采用混合方法，结合了代码级分析和仓库级数据。

Table 5 elaborates on the detailed data type categories used in the subject studies. The table shows that 128 subject studies used a code-based category and the major data type of this category is Source code [34, 179]. Binary code is the second major data type in the code-based category [58, 117], accounting for 18 subject studies.
表 5 详细说明了在主题研究中使用的详细数据类型类别。表格显示，128 项主题研究使用了基于代码的类别，该类别的主要数据类型是源代码[34, 179]。二进制代码是基于代码类别的第二大主要数据类型[58, 117]，占 18 项主题研究。

Table 5.

Category 类别	Data Type 数据类型	# Studies # 研究	Total 总计	References 参考文献
Code based 基于代码	Source code 源代码	108	128	[3, 7, 8, 9, 10, 11, 13, 16, 20, 21, 22, 23, 26, 28, 29, 31, 32, 34, 36, 37, 38, 41, 47, 48, 49, 54, 60, 63, 64, 68, 70, 73, 74, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87, 91, 92, 93, 94, 95, 96, 97, 98, 103, 104, 106, 108, 109, 110, 113, 118, 120, 122, 126, 127, 129, 131, 132, 134, 135, 136, 137, 138, 140, 141, 142, 143, 144, 146, 147, 149, 150, 151, 152, 153, 154, 157, 158, 159, 162, 163, 165, 169, 170, 171, 172, 174, 175, 176, 177, 179, 180, 181, 183, 185, 186, 187, 188, 189, 190] [3, 7, 8, 9, 10, 11, 13, 16, 20, 21, 22, 23, 26, 28, 29, 31, 32, 34, 36, 37, 38, 41, 47, 48, 49, 54, 60, 63, 64, 68, 70, 73, 74, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87, 91, 92, 93, 94, 95, 96, 97, 98, 103, 104, 106, 108, 109, 110, 113, 118, 120, 122, 126, 127, 129, 131, 132, 134, 135, 136, 137, 138, 140, 141, 142, 143, 144, 146, 147, 149, 150, 151, 152, 153, 154, 157, 158, 159, 162, 163, 165, 169, 170, 171, 172, 174, 175, 176, 177, 179, 180, 181, 183, 185, 186, 187, 188, 189, 190]
	Binary code 二进制代码	18		[4, 45, 46, 57, 58, 62, 71, 75, 105, 107, 117, 125, 139, 148, 164, 168, 178, 182] [4, 45, 46, 57, 58, 62, 71, 75, 105, 107, 117, 125, 139, 148, 164, 168, 178, 182]
	Image 图像	2		[76, 155] [76, 155]
Hybrid 混合	–	4	4	[17, 19, 30, 56] [17, 19, 30, 56]
Commit Metrics 提交指标	–	4	4	[111, 114, 115, 166] [111, 114, 115, 166]
Text 文本	–	2	2	[33, 184] [33, 184]
Unique Total 独特总计	–	–	138	–

Table 5. Detailed Data Types Used in the Subject Studies
表 5. 主题研究中使用的详细数据类型

4.2.3 RQ2.3: What Are the Most Commonly Used Input Representations?.
4.2.3 RQ2.3：最常见的输入表示是什么？

As noted in earlier sections, research studies focusing on software vulnerability detection rely on diverse sources of data and data types. This variability urges the adoption of various representation strategies, architectural approaches, and design assumptions for ML models.
如前文所述，专注于软件漏洞检测的研究依赖于多种数据来源和数据类型。这种多样性促使采用各种表示策略、架构方法和设计假设来构建机器学习模型。

We classified the input representation of employed datasets into five broad categories: Graph, Token, Tree, Commit Metrics, and Hybrid. The most popular input representation is the use of Graph, accounting for 57.2% of the subject studies. Token follows closely, representing a substantial portion (24.6%) of the subject studies. Tree representation is the third most common approach, accounting for 11.5% of the subject studies. The Commit Metrics and Hybrid categories have the smallest portion, accounting for 2.8% and 2.1% of the subject studies, respectively. In the following paragraphs, we elaborate on each category in detail.
我们将所使用数据集的输入表示分为五大类：图、标记、树、提交指标和混合。最流行的输入表示是图的使用，占主题研究的 57.2%。标记紧随其后，代表主题研究的大比例（24.6%）。树表示是第三种最常见的方法，占主题研究的 11.5%。提交指标和混合类别所占比例最小，分别占主题研究的 2.8%和 2.1%。在接下来的段落中，我们将详细阐述每一类。

Graph/Tree-Based Representation [63, 126]. This type allows for the detection of complex patterns and relationships between different code elements. By representing source code as a graph or tree, we can capture not only the syntax and structure of the code but also its semantics, control flow, and dataflow. There are many graph/tree-based representation techniques, such as AST (Abstract Syntax Trees) [100, 161] and CPG (Code Property Graph) [34, 41, 183] used to transform source code into AST and CPG representations.
图/树状表示[63, 126]。这种类型允许检测不同代码元素之间的复杂模式和关系。通过将源代码表示为图或树，我们不仅可以捕获代码的语法和结构，还可以捕获其语义、控制流和数据流。有许多基于图/树的表示技术，例如 AST（抽象语法树）[100, 161]和 CPG（代码属性图）[34, 41, 183]，用于将源代码转换为 AST 和 CPG 表示。

Token-Based Representation [45, 140]. This typetreats the source code as string token sequences and then transforms source code into token vectors. The input data is first broken down into tokens, which are then turned into numerical vectors that can be processed by ML algorithms. Tokenization involves breaking down a string of text or source code into smaller units, or tokens, which can then be used as the basis for further analysis. In the case of source code, tokens might include keywords, operators, variables, and other elements of the programming language syntax.
基于标记的表示[45, 140]。这种类型将源代码视为字符串标记序列，然后将源代码转换为标记向量。首先将输入数据分解为标记，然后将这些标记转换为可以由机器学习算法处理的数值向量。标记化涉及将文本或源代码字符串分解为更小的单元，即标记，然后可以用作进一步分析的基础。在源代码的情况下，标记可能包括关键字、运算符、变量和编程语言语法的其他元素。

Commit Metrics [114, 115]. This type leverages the metrics extracted from commits to represent code commits. Commit-level features, such as the number of code changes, the number of modified lines, and the programming language used, can be used to train ML models. These models may then learn patterns and connections between commit attributes and the presence of vulnerabilities, allowing for automatic detection of new commits.
提交指标[114, 115]。此类方法利用从提交中提取的指标来表示代码提交。提交级别的特征，如代码更改数量、修改行数以及使用的编程语言，可以用于训练机器学习模型。这些模型可以学习提交属性与漏洞存在之间的模式和联系，从而实现新提交的自动检测。

Hybrid Representation [19, 30]. This type employs a variety of representations to discover software security vulnerabilities. Combining diverse representations of input data can result in a more comprehensive and richer input representation of source code, which can help vulnerability detection techniques perform better in tasks like prediction and detection.
混合表示法[19, 30]。此类方法采用多种表示法来发现软件安全漏洞。结合输入数据的多样化表示可以产生更全面、更丰富的源代码输入表示，这有助于漏洞检测技术在预测和检测等任务中表现更佳。

Table 6 shows the representation techniques distributed by different artifacts used by ML models. It is evident that Graph/Tree-based representation is the most prevalent technique, with a total of 96 studies employing this method. These studies represent the input to ML models using various forms: Source code as a graph, Source code as a tree, Binary code as a graph, and Binary code as a tree. Notably, Source code as a graph is the predominant representation technique, used by 71 studies. Furthermore, 33 subject studies employed Token-based representation. Among them, 23 studies represented source code as a token sequence, 9 studies modeled binary code as tokens, and 2 studies represented text as token sequences.
表 6 展示了机器学习模型使用的不同工具所分布的表示技术。显然，基于图/树表示是最普遍的技术，共有 96 项研究采用这种方法。这些研究使用各种形式将输入表示为机器学习模型：源代码作为图、源代码作为树、二进制代码作为图和二进制代码作为树。值得注意的是，源代码作为图是主要的表示技术，被 71 项研究使用。此外，33 项主题研究采用了基于标记的表示。其中，23 项研究将源代码表示为标记序列，9 项研究将二进制代码建模为标记，2 项研究将文本表示为标记序列。

Table 6.

Category 类别	Artifact 文物	# Studies # 研究	Total 总计	References 参考文献
Graph/Tree 图/树	Source code as a graph 源代码作为图	71	96	[3, 7, 8, 9, 10, 11, 13, 20, 21, 29, 31, 34, 36, 38, 41, 47, 49, 60, 63, 64, 68, 70, 77, 78, 79, 81, 82, 84, 85, 91, 92, 93, 95, 96, 97, 103, 104, 106, 110, 120, 126, 127, 129, 132, 134, 136, 138, 141, 143, 144, 147, 150, 151, 152, 153, 154, 157, 158, 165, 170, 172, 174, 175, 179, 180, 181, 183, 185, 187, 188, 189] [3, 7, 8, 9, 10, 11, 13, 20, 21, 29, 31, 34, 36, 38, 41, 47, 49, 60, 63, 64, 68, 70, 77, 78, 79, 81, 82, 84, 85, 91, 92, 93, 95, 96, 97, 103, 104, 106, 110, 120, 126, 127, 129, 132, 134, 136, 138, 141, 143, 144, 147, 150, 151, 152, 153, 154, 157, 158, 165, 170, 172, 174, 175, 179, 180, 181, 183, 185, 187, 188, 189]
	Source code as a tree 源代码作为树	15		[22, 28, 32, 74, 80, 83, 87, 94, 98, 142, 146, 159, 176, 177, 186] [22, 28, 32, 74, 80, 83, 87, 94, 98, 142, 146, 159, 176, 177, 186]
	Binary code as graph 二进制代码作为图形	8		[46, 58, 75, 105, 117, 139, 148, 178] [46, 58, 75, 105, 117, 139, 148, 178]
	Binary code as tree 二进制代码作为树	1		[4] [4]
Token 标记	Source code as a token 源代码作为标记	23	33	[16, 23, 26, 37, 48, 54, 56, 73, 108, 109, 113, 118, 122, 131, 135, 137, 140, 149, 162, 163, 169, 171, 190] [16, 23, 26, 37, 48, 54, 56, 73, 108, 109, 113, 118, 122, 131, 135, 137, 140, 149, 162, 163, 169, 171, 190]
	Binary code as a token 二进制代码作为标记	9		[45, 57, 62, 71, 107, 125, 164, 168, 182] [45, 57, 62, 71, 107, 125, 164, 168, 182]
	Text as a token 文本作为标记	2		[33, 184] [33, 184]
Commit Metrics 提交指标	–	4	4	[111, 114, 115, 166] [111, 114, 115, 166]
Hybrid 混合	–	3	3	[17, 19, 30] [17, 19, 30]
Image 图像	–	2	2	[76, 155] [76, 155]
Unique Total 独特总计	–	–	138

Table 6. Distribution of Input Representations in the Subject Studies
表 6. 主题研究中的输入表示分布

Figure 4 shows the distribution of data type representation in software vulnerability detection studies over time. As shown in the figure, Graph-based representation shows a substantial presence compared to other input representation techniques. There are a couple of reasons for this trend. First, graphs provide a natural and intuitive way to represent the structural relationships within the source code. By modeling the code as a graph, the relationships between functions, classes, methods, and variables can be captured effectively. Token-based representation has also gained popularity, with a peak occurrence in 2023. This is because it provides a fine-grained representation of the code. It simplifies the code analysis process by reducing the complexity of the code to a sequence of tokens, making it easier to apply ML models.
图 4 显示了软件漏洞检测研究中数据类型表示的分布情况。如图所示，基于图的表达方式与其他输入表示技术相比，具有显著的存在感。这种趋势有几个原因。首先，图提供了一种自然直观的方式来表示源代码中的结构关系。通过将代码建模为图，可以有效地捕捉函数、类、方法和变量之间的关系。基于标记的表达方式也获得了流行，2023 年达到峰值。这是因为它提供了代码的细粒度表示。它通过将代码的复杂性简化为标记序列，简化了代码分析过程，使得应用机器学习模型变得更加容易。

Fig. 4.

4.2.4 RQ2.4: What Are the Most Commonly Used Embedding Approaches?.
4.2.4 RQ2.4：最常见的嵌入方法有哪些？

In this section, we look at embedding methods that can convert these representations explored in the previous section into inputs that ML models can understand. The representation approaches are in a human-readable format and cannot be directly understood by computers. As a result, researchers applied various embedding approaches to translate these representations into numerical format. We discuss the embedding techniques in the following paragraphs.
在这一节中，我们探讨可以将上一节中探索的这些表示转换为机器学习模型可以理解的输入的嵌入方法。表示方法以人类可读的格式存在，不能被计算机直接理解。因此，研究人员应用了各种嵌入方法将这些表示转换为数值格式。我们将在下一段中讨论嵌入技术。

Graph Embedding (32.6%) [97, 117]. This is the most commonly used embedding technique among the subject studies, accounting for 32.6%, which is mostly used by graph neural networks for its capability to capture the structural relationships between different code components.
图嵌入（32.6%）[ 97, 117]。这是主题研究中最常用的嵌入技术，占 32.6%，主要被图神经网络使用，因为它能够捕捉不同代码组件之间的结构关系。

Token Vector Embedding (29.7%) [79, 190]. This is the second most popular technique used by subject studies, accounting for 29.7% of examined papers. In this technique, input is converted into a sequence of tokens and each token is transformed into a numeric value. Then, these values are fed into ML models for training operations.
标记向量嵌入（29.7%）[ 79, 190]。这是学科研究中使用最广泛的第二种技术，占所有审查论文的 29.7%。在这种技术中，输入被转换为一系列标记，每个标记被转换为一个数值。然后，这些数值被输入到机器学习模型中进行训练操作。

Hybrid (16.6%) [19, 41]. We find that 16.6% of the subject studies use multiple embedding techniques to convert inputs to ML models. Different embedding techniques capture different aspects of the data. By combining multiple techniques, researchers can leverage the complementary information provided by each technique. For example, some embedding techniques may focus on syntax, whereas others may capture semantic or contextual information.
混合（16.6%）[19, 41]。我们发现 16.6%的研究对象使用多种嵌入技术将输入转换为机器学习模型。不同的嵌入技术捕捉数据的不同方面。通过结合多种技术，研究人员可以利用每种技术提供的互补信息。例如，一些嵌入技术可能专注于句法，而其他技术可能捕捉语义或上下文信息。

Transformer Embedding (7.2%) [48, 153]. Transformer embedding is used in 7.2% of the subject studies. Despite its lower prevalence, the use of Transformers is notable because of their powerful capabilities in natural language processing, which can be adapted to analyze code.
Transformer 嵌入（7.2%）[ 48, 153]。Transformer 嵌入在 7.2%的主题研究中被使用。尽管其使用频率较低，但 Transformer 的使用值得关注，因为它们在自然语言处理方面的强大能力可以适应分析代码。

Others (13.7%) [126, 146, 163]. The remaining 13.7% that seldom emerge and do not belong to any group are classified as Others.
其他人（13.7%）[126, 146, 163]。其余 13.7%，很少出现且不属于任何群体，被归类为“其他人”。

4.3 RQ3: What Is the Distribution of ML and DL Models Used for Software Vulnerability Detection?
4.3 研究问题 3：用于软件漏洞检测的机器学习（ML）和深度学习（DL）模型的分布情况如何？

In this section, we provide detailed information about the various ML models utilized for software vulnerability detection. Initially, we present an analysis of the usage distribution of models based on the subject studies. Subsequently, we investigate the distribution of the usage of specific DL models used in the subject studies over time. However, we have not extensively analyzed the distribution of classic ML models since their prevalence is relatively small compared to DL models. However, we provide a comprehensive list of classic ML models that have been commonly used in subject studies.
本节中，我们提供了关于用于软件漏洞检测的各种机器学习模型的详细信息。最初，我们分析了基于主题研究的模型使用分布情况。随后，我们研究了在主题研究中使用的特定深度学习模型的使用分布随时间的变化。然而，由于与深度学习模型相比，经典机器学习模型的普及率相对较小，我们没有对经典机器学习模型的使用分布进行深入分析。但是，我们提供了一份在主题研究中常用到的经典机器学习模型的完整列表。

The majority of studies (88.4%) use DL models for software vulnerability detection [82, 127, 159], whereas only 7.2% of the studies use classic ML models [19, 107, 184]. Some of the subject studies also use a combination of DL and ML models, accounting for 1.4% of studies. The remaining (2.8%) are classified as Others.
大多数研究（88.4%）使用深度学习模型进行软件漏洞检测[82, 127, 159]，而只有 7.2%的研究使用经典机器学习模型[19, 107, 184]。部分研究对象研究还使用深度学习和机器学习模型的组合，占 1.4%的研究。其余（2.8%）被归类为其他。

The graph in Figure 5 illustrates the usage trend of DL models in detecting software vulnerabilities from 2016 to 2024. According to the trend, DL models were first introduced in 2016 for vulnerability detection, and since then, the use of RNNs for vulnerability detection has shown an upward trend. The graph also demonstrates a rising trend in using GNNs for vulnerability detection from 2021 to 2024. This can be because GNNs are more powerful than RNNs in detecting vulnerabilities, as they can capture more meaningful and semantic representations of input source code.
图 5 展示了从 2016 年到 2024 年深度学习模型在检测软件漏洞中的应用趋势。根据这一趋势，深度学习模型首次于 2016 年被引入用于漏洞检测，此后，用于漏洞检测的循环神经网络（RNNs）的使用呈现上升趋势。图表还显示了从 2021 年到 2024 年使用图神经网络（GNNs）进行漏洞检测的趋势上升。这可能是因为 GNNs 在检测漏洞方面比 RNNs 更强大，因为它们可以捕捉到更有意义和语义的输入源代码表示。

Fig. 5.

Table 7 shows the distribution of DL models used in the subject studies. As shown in the table, Recurrent Models are the most commonly used DL models for software vulnerability detection. In this category, BiLSTM is the most frequently used recurrent model, appearing in 20 studies. GRU and LSTM are also popular models with 14 and 13 studies, respectively. Graph Models are the second most widely used class of DL models for software vulnerability detection. It is observable that GCN is the most prevalent model, appearing in 22 studies. GNN, GGNN, and GAT are also commonly used, accounting for 13, 9, and 8 subject studies, respectively. The presence of these models highlights the importance of capturing graph structures and relationships between code elements in vulnerability detection. Convolutional Models are used in 19 studies. While not as prevalent as recurrent or graph models, CNNs are still considered effective for capturing local patterns and features in vulnerability detection tasks.
表 7 显示了在主题研究中使用的深度学习模型的分布。如表所示，循环模型是用于软件漏洞检测最常用的深度学习模型。在这个类别中，BiLSTM 是最常用的循环模型，出现在 20 项研究中。GRU 和 LSTM 也是流行的模型，分别有 14 和 13 项研究。图模型是用于软件漏洞检测的第二大广泛使用的深度学习模型类别。可以观察到，GCN 是最普遍的模型，出现在 22 项研究中。GNN、GGNN 和 GAT 也常被使用，分别占 13、9 和 8 项主题研究。这些模型的存在突出了在漏洞检测中捕获代码元素之间图结构和关系的重要性。卷积模型在 19 项研究中被使用。虽然不如循环或图模型普遍，但 CNNs 仍被认为在漏洞检测任务中能够有效地捕获局部模式和特征。

Table 7.

Category 类别	Model Name 型号名称	# Studies # 研究	Total 总计	References 参考文献
Recurrent Models 循环模型	BiLSTM 双向长短期记忆网络	20	65	[47, 57, 63, 64, 83, 85, 87, 113, 120, 131, 136, 139, 148, 159, 168, 176, 182, 185, 189, 190] [47, 57, 63, 64, 83, 85, 87, 113, 120, 131, 136, 139, 148, 159, 168, 176, 182, 185, 189, 190]
	GRU	14		[47, 54, 63, 73, 76, 77, 78, 79, 125, 139, 142, 144, 147, 181] [47, 54, 63, 73, 76, 77, 78, 79, 125, 139, 142, 144, 147, 181]
	LSTM	13		[26, 28, 63, 82, 94, 95, 125, 139, 149, 152, 158, 181, 186] [26, 28, 63, 82, 94, 95, 125, 139, 149, 152, 158, 181, 186]
	BGRU	10		[32, 63, 68, 75, 83, 84, 138, 139, 164, 190] [32, 63, 68, 75, 83, 84, 138, 139, 164, 190]
	TreeLSTM Tree LSTM	3		[8, 78, 147] [8, 78, 147]
	RNN	3		[37, 139, 154] [37, 139, 154]
	BRNN	2		[109, 139] [109, 139]
Graph Models 图模型	GCN	22	63	[7, 8, 20, 21, 31, 41, 46, 75, 78, 81, 91, 97, 104, 110, 127, 136, 147, 150, 172, 179, 181, 187] [7, 8, 20, 21, 31, 41, 46, 75, 78, 81, 91, 97, 104, 110, 127, 136, 147, 150, 172, 179, 181, 187]
	GNN	13		[3, 8, 10, 11, 20, 28, 105, 106, 136, 143, 151, 152, 183] [3, 8, 10, 11, 20, 28, 105, 106, 136, 143, 151, 152, 183]
	GGNN	9		[29, 31, 36, 77, 92, 129, 142, 154, 188] [29, 31, 36, 77, 92, 129, 142, 154, 188]
	GAT	8		[20, 38, 41, 46, 110, 174, 175, 178] [20, 38, 41, 46, 110, 174, 175, 178]
	RGCN	4		[13, 31, 158, 180] [ 13, 31, 158, 180 ]
	HGNN	1		[103] [ 103 ]
	RGAT	1		[30] [30]
	DGCNN	1		[117] [ 117 ]
	HGCN	1		[68] [ 68 ]
	GCL	1		[144] [144]
	BGNN	1		[10] [ 10 ]
	GGRU	1		[157] [ 157 ]
Convolutional Models 卷积模型	CNN	11	19	[17, 37, 48, 56, 73, 74, 79, 137, 138, 155, 190] [17, 37, 48, 56, 73, 74, 79, 137, 138, 155, 190]
	TextCNN 文本 CNN	6		[9, 64, 132, 141, 164, 176] [9, 64, 132, 141, 164, 176]
	TextRCNN	1		[30] [30]
	QCNN	1		[60] [60]
General Models 通用模型	FCN	2	13	[81, 170] [ 81, 170 ]
	TCN	2		[16, 17] [16, 17]
	Auto Encoders 自动编码器	1		[71] [ 71 ]
	Memory Neural Network 记忆神经网络	1		[23] [ 23 ]
	GAN	1		[109] [109]
	Feed Forward 前馈	1		[118] [ 118 ]
	Representation Learning 表示学习	1		[108] [ 108 ]
	DRSN	1		[16] [16]
	DCN	1		[76] [ 76 ]
	Others 其他人	1		[4] [4]
	DBN	1		[146] [ 146 ]
Transformers 变压器	BERT	2	9	[93, 134] [93, 134]
	GraphCodeBERT	1		[153] [153]
	CodeBERT 代码 BERT	1		[111] [ 111 ]
	HGT	1		[165] [165]
	GPT-4	1		[98] [98]
	GPT-3.5_turbo	1		[135] [ 135 ]
	Code-T5 代码-T5	1		[169] [ 169 ]
	Transformer Encoder Transformer 编码器	1		[171] [ 171 ]
Attention Models 注意模型	–	8	8	[22, 34, 49, 62, 80, 96, 140, 177] [22, 34, 49, 62, 80, 96, 140, 177]
Unique Total 独特总计	–	–	124	–

Table 7. Distribution of DL Models in the Subject Studies
表 7. 主题研究中的深度学习模型分布

Table 8 shows the distribution of classic ML models used in subject studies. As shown in the table, Random Forest is the most frequently used ML model, appearing in seven studies. Naive Bayes, SVM, and KNN are popular choices, with 5, 4, and 4 occurrences, respectively. Random Forest is an ensemble learning method that builds multiple Decision Trees and merges their outputs to make a final prediction. This ensemble approach helps improve the robustness and accuracy of detection, making it effective for detecting software vulnerabilities. Naive Bayes is popular because it is computationally efficient and easy to implement. It requires less training data compared to more complex algorithms, making it faster in both training and prediction phases [2, 50, 51].
表 8 显示了在主题研究中使用的经典机器学习模型的分布。如表所示，随机森林是最常用的机器学习模型，在七项研究中出现。朴素贝叶斯、SVM 和 KNN 是流行的选择，分别出现 5 次、4 次和 4 次。随机森林是一种集成学习方法，它构建多个决策树并将它们的输出合并以做出最终预测。这种集成方法有助于提高检测的鲁棒性和准确性，使其在检测软件漏洞方面非常有效。朴素贝叶斯之所以受欢迎，是因为它计算效率高且易于实现。与更复杂的算法相比，它需要的训练数据更少，因此在训练和预测阶段都更快[2, 50, 51]。

Table 8.

Category 类别	Model Name 型号名称	# Studies # 研究	Total 总计	References 参考文献
Classic ML Models 经典机器学习模型	Random Forest 随机森林	7	38	[19, 70, 94, 114, 122, 166, 184] [19, 70, 94, 114, 122, 166, 184]
	Naive Bayes 朴素贝叶斯	5		[19, 70, 114, 122, 184] [19, 70, 114, 122, 184]
	SVM	4		[19, 115, 122, 184] [19, 115, 122, 184]
	K-NN	4		[19, 95, 122, 184] [19, 95, 122, 184]
	Logistic Regression 逻辑回归	3		[70, 114, 184] [70, 114, 184]
	AdaBoost	3		[19, 33, 184] [19, 33, 184]
	Decision Tree 决策树	2		[70, 122] [70, 122]
	Gradient Boosting 梯度提升	2		[19, 184] [ 19, 184 ]
	PCA	1		[162] [ 162 ]
	Kernel Machine 内核机器	1		[107] [ 107 ]
	ADTree AD 树	1		[114] [ 114 ]
	TAN	1		[70] [70]
	Gradient Boosting Classifier 梯度提升分类器	1		[33] [33]
	SGDClassifier SGD 分类器	1		[33] [33]
	AdaBoostClassifier AdaBoost 分类器	1		[33] [33]
	TrAdaBoost	1		[33] [33]
Distance/Similarity Measures 距离/相似度度量	–	3	3	[45, 58, 163] [45, 58, 163]
Language Models 语言模型	N-gram	1	1	[126] [ 126 ]
Unique Total 独特总计	–	–	14	–

Table 8. Distribution of Classic ML and Other Models in the Subject Studies
表 8. 主题研究中经典机器学习和其他模型的分布

Table 7 also shows one study that uses n-gram models for software vulnerability detection. N-gram models serve an important role in capturing local context using word sequence probabilities. An n-gram model predicts the likelihood of a word based on the preceding n-1 words, successfully describing the local structure of the language [18, 123]. N-gram models are effective at identifying patterns within sequences of tokens (e.g., words, characters, or code elements). In the context of code, an n-gram model can be trained on large codebases to understand the typical sequences of code elements.
表 7 还显示了一项使用 n-gram 模型进行软件漏洞检测的研究。n-gram 模型在利用词序列概率捕捉局部上下文中发挥着重要作用。n-gram 模型根据前 n-1 个词预测一个词的可能性，成功地描述了语言的局部结构[18, 123]。n-gram 模型在识别标记序列（例如，单词、字符或代码元素）中的模式方面非常有效。在代码的上下文中，n-gram 模型可以在大型代码库上训练，以理解代码元素的典型序列。

4.3.1 Comparison of ML Models with Manual Code Analysis.
4.3.1 机器学习模型与人工代码分析的对比

When it comes to software vulnerability detection, ML models are far superior to conventional manual code analysis techniques. ML-based software vulnerability detection facilitates efficiency and scalability by automating the analysis of massive amounts of code. This ability is essential in the current software development environment, where quick and comprehensive security evaluations are required due to complex systems and frequent changes. This efficiency lowers the possibility of human error that comes with manual inspections while simultaneously speeding up the detection process. Additionally, preemptive threat detection and ongoing monitoring are made easier by ML models. But even with these benefits, human code analysis is still essential for handling some crucial situations. The best people to handle special circumstances like zero-day vulnerabilities [5]—vulnerabilities when exploits are found and used before software developers have a chance to mitigate them—are human analysts.
在软件漏洞检测方面，机器学习模型远优于传统的手动代码分析技术。基于机器学习的软件漏洞检测通过自动化大量代码的分析，提高了效率和可扩展性。这种能力在当前软件开发环境中至关重要，因为复杂的系统和频繁的变更要求快速而全面的网络安全评估。这种效率降低了手动检查中可能出现的人为错误，同时加快了检测过程。此外，机器学习模型还简化了先发制人的威胁检测和持续监控。但即便有这些好处，人类代码分析在处理一些关键情况时仍然是必不可少的。处理像零日漏洞[5]这样的特殊情况的最好人选是人类分析师——即在软件开发者有机会缓解之前，攻击者就已经发现并使用了漏洞的情况。

4.3.2 Transfer Learning for Software Vulnerability Detection.
4.3.2 软件漏洞检测的迁移学习

Transfer learning is crucial for software vulnerability detection. First, high-quality labeled datasets for software vulnerability detection are often scarce and expensive to produce because labeling requires expert knowledge [19, 87, 95]. Second, software vulnerability detection often requires understanding domain-specific languages and contexts, which can vary widely between different applications and systems [33, 95].
迁移学习对于软件漏洞检测至关重要。首先，用于软件漏洞检测的高质量标注数据集通常稀缺且生产成本高昂，因为标注需要专业知识[19, 87, 95]。其次，软件漏洞检测通常需要理解特定领域的语言和上下文，这些在不同应用程序和系统之间可能差异很大[33, 95]。

Among the studies we reviewed, six studies utilized transfer learning for software vulnerability detection. Liu et al. [95] minimized distribution disparities between domains by improving cross-domain representations using a metric transfer learning framework). With this method, the model can still generalize well even in cases when the projects or vulnerability types in the test and training data are different. Du et al. [33] presented a system for detecting software vulnerabilities that makes use of the transfer learning algorithm TrAdaBoost. By using labeled bug reports from one project to predict issue categories in another where labeled data is insufficient, their method identifies bug types across several projects. Sendner et al. [125] customized transfer learning for smart contract software vulnerability detection. Their method, called ESCORT, uses a common feature extractor to understand the semantics of the bytecode, with different branches responding to different kinds of vulnerabilities. The transfer learning capability of ESCORT increases system flexibility by making it easier to include new vulnerability types with less data. Zhou et al. [182] presented a framework for adversarial multi-task learning that integrates common and task-specific components to maximize feature extraction while using adversarial transfer learning to reduce noise and interference between private and general features. Li et al. [77] explored the identification of cross-domain vulnerabilities using VulGDA, a system that combines graph embedding and deep-domain adaptation methods. To capture syntactic and semantic links and improve feature extraction through domain-invariant feature generation, VulGDA transforms samples of source code into graph representations. Zhang et al. [174] proposed CPVD, a cross-domain vulnerability detection method that utilizes labeled data from one source to accurately predict vulnerability labels. CPVD encodes code as property graphs and uses a graph attention network and convolution pooling network for feature extraction.
在所审查的研究中，有六项研究使用了迁移学习进行软件漏洞检测。刘等人[95]通过改进跨域表示来最小化域之间的分布差异（使用度量迁移学习框架）。这种方法使得模型即使在测试和训练数据中的项目或漏洞类型不同的情况下，仍能很好地泛化。杜等人[33]提出了一种利用迁移学习算法 TrAdaBoost 检测软件漏洞的系统。通过使用一个项目的标记错误报告来预测另一个项目中标记数据不足的问题类别，他们的方法能够识别出多个项目中的错误类型。森纳等人[125]为智能合约软件漏洞检测定制了迁移学习。他们的方法称为 ESCORT，使用一个通用的特征提取器来理解字节码的语义，不同的分支响应不同类型的漏洞。ESCORT 的迁移学习能力通过使包含新的漏洞类型更容易，从而提高了系统的灵活性。周等人 [182] 提出了一种对抗多任务学习框架，该框架整合了通用和任务特定组件，以最大化特征提取，同时使用对抗迁移学习来减少私有特征和通用特征之间的噪声和干扰。Li 等人[77]探讨了使用 VulGDA 系统识别跨域漏洞，该系统结合了图嵌入和深度域自适应方法。为了捕获句法和语义链接并通过域不变特征生成来提高特征提取，VulGDA 将源代码样本转换为图表示。Zhang 等人[174]提出了 CPVD，一种跨域漏洞检测方法，它利用一个来源的标记数据来准确预测漏洞标签。CPVD 将代码编码为属性图，并使用图注意力网络和卷积池化网络进行特征提取。

4.4 RQ4: What Is the Most Frequent Type of Vulnerability Covered in the Subject Studies?
4.4 研究问题 4：在研究对象中，最常见的漏洞类型是什么？

Software vulnerability detection datasets support different vulnerability types. For example, NVD and SARD benchmarks together support 96 types of vulnerabilities. This RQ intends to summarize the most popular vulnerability types covered by subject studies and their frequency. Table 9 shows the statistics regarding the vulnerability types. The column CWE-Type indicates the type of CWE.¹⁵ There are many categories on the CWE website for vulnerability categorization including categorization by software development, categorization by hardware design, and categorization by research concepts. The categorization shown in Table 9 is based on categorization by research concepts, as this categorization is a perfect match for vulnerability types reported in the subject studies.
软件漏洞检测数据集支持不同类型的漏洞。例如，NVD 和 SARD 基准共同支持 96 种漏洞类型。本研究问题旨在总结主题研究中涵盖的最流行漏洞类型及其频率。表 9 显示了关于漏洞类型的统计数据。CWE-Type 列表示 CWE 的类型。 ¹⁵ CWE 网站上有很多用于漏洞分类的类别，包括按软件开发分类、按硬件设计分类和按研究概念分类。表 9 中所示分类是基于研究概念分类的，因为这种分类与主题研究中报告的漏洞类型完美匹配。

Table 9.

Category 类别	CWE-Type CWE 类型	Severity Score 严重度评分	# Studies # 研究	Total 总计	References 参考文献
Resource 资源	CWE-119	–	29	121	[9, 11, 16, 20, 29, 30, 34, 36, 37, 38, 47, 49, 71, 75, 85, 87, 95, 98, 107, 110, 121, 131, 132, 140, 141, 142, 151, 159, 164] [9, 11, 16, 20, 29, 30, 34, 36, 37, 38, 47, 49, 71, 75, 85, 87, 95, 98, 107, 110, 121, 131, 132, 140, 141, 142, 151, 159, 164]
	CWE-476	–	13		[11, 29, 30, 36, 47, 98, 110, 131, 132, 142, 151, 152, 159] [11, 29, 30, 36, 47, 98, 110, 131, 132, 142, 151, 152, 159]
	CWE-399	–	13		[9, 16, 34, 37, 49, 75, 85, 98, 110, 131, 132, 141, 159] [9, 16, 34, 37, 49, 75, 85, 98, 110, 131, 132, 141, 159]
	CWE-400	–	10		[9, 20, 30, 47, 132, 141, 142, 143, 159, 165] [9, 20, 30, 47, 132, 141, 142, 143, 159, 165]
	CWE-22	–	10		[9, 20, 30, 38, 41, 140, 141, 142, 151, 159] [9, 20, 30, 38, 41, 140, 141, 142, 151, 159]
	CWE-787	–	9		[9, 11, 20, 38, 98, 132, 141, 151, 165] [9, 11, 20, 38, 98, 132, 141, 151, 165]
	CWE-125	–	9		[9, 11, 20, 98, 110, 132, 141, 151, 152] [9, 11, 20, 98, 110, 132, 141, 151, 152]
	CWE-416		9		[11, 29, 30, 98, 110, 131, 132, 151, 159] [11, 29, 30, 98, 110, 131, 132, 151, 159]
	CWE-122	–	7		[9, 11, 23, 121, 138, 141, 152] [9, 11, 23, 121, 138, 141, 152]
	CWE-121	–	6		[11, 121, 138, 141, 152, 164] [11, 121, 138, 141, 152, 164]
	CWE-362	–	6		[98, 110, 131, 140, 142, 151] [98, 110, 131, 140, 142, 151]
Validation 验证	CWE-20	–	13	37	[9, 20, 30, 38, 98, 110, 131, 132, 141, 142, 151, 159, 165] [9, 20, 30, 38, 98, 110, 131, 132, 141, 142, 151, 159, 165]
	CWE-78		9		[9, 20, 41, 75, 83, 141, 142, 151, 165] [9, 20, 41, 75, 83, 141, 142, 151, 165]
	CWE-841	–	8		[4, 8, 62, 125, 134, 168, 169, 171] [4, 8, 62, 125, 134, 168, 169, 171]
	CWE-200	–	7		[30, 38, 98, 131, 132, 140, 142] [30, 38, 98, 131, 132, 140, 142]
Numeric 数值	CWE-190		23	36	[4, 8, 9, 20, 29, 38, 58, 62, 98, 110, 120, 125, 131, 132, 140, 141, 142, 143, 151, 152, 165, 168, 182] [4, 8, 9, 20, 29, 38, 58, 62, 98, 110, 120, 125, 131, 132, 140, 141, 142, 143, 151, 152, 165, 168, 182]
	CWE-189		7		[30, 87, 98, 110, 131, 132, 190] [30, 87, 98, 110, 131, 132, 190]
	CWE-191		6		[4, 125, 140, 142, 168, 182] [4, 125, 140, 142, 168, 182]
Unique Total 独特总计	–	–	–	48	–

Table 9. Top Vulnerability Types Covered in the Subject Studies
表 9. 主题研究中涵盖的主要漏洞类型

Table 9 indicates that the vulnerability category that receives the highest attendance is related to Resource vulnerabilities, mentioned in 121 studies. This category primarily involves managing a system’s resources, which are created, utilized, and disposed of according to a pre-defined set of instructions. It is observable that CWE-119 [95, 107, 121] is the most frequent vulnerability type addressed by the subject studies. This vulnerability occurs when a software system attempts to access or write to a memory location outside the permitted boundary of the system’s buffer. The second most frequent vulnerability type is Null Pointer Dereference (CWE-476), accounting for 13 subject studies. This vulnerability occurs when a program attempts to read or write to a memory location through a pointer that has not been properly initialized and points to NULL (no valid memory address).
表 9 表明，获得最高关注度的漏洞类别与资源漏洞相关，在 121 项研究中被提及。该类别主要涉及管理系统的资源，这些资源根据预定义的指令被创建、使用和废弃。观察发现，CWE-119 [95, 107, 121] 是被研究对象研究中最频繁的漏洞类型。这种漏洞发生在软件系统试图访问或写入系统缓冲区允许边界之外的内存位置时。第二常见的漏洞类型是空指针解引用（CWE-476），占 13 项研究对象。这种漏洞发生在程序试图通过未正确初始化且指向 NULL（无有效内存地址）的指针读取或写入内存位置时。

Validation-related vulnerabilities is the second major family of vulnerability types, covered by 37 subject studies. In this type, the attackers exploit input and output data when they are malformed or not validated properly. As can be seen, CWE-20 [20, 142] is the most frequent type of vulnerability, accounting for 13 subject studies. CWE-20 refers to a situation where input validation is not done properly in software systems, making them vulnerable to attacks by malicious individuals who can exploit input data. This occurs when the input data is not verified to be safe or in line with the pre-defined specifications. CWE-78 is the second major vulnerability type, covered by 9 subject studies [11, 20, 38]. This category of security vulnerability pertains to OS command injection, in which an external attacker can construct an OS command by using input data from components that have not been adequately verified.
验证相关漏洞是漏洞类型的第二大类，共有 37 项主题研究涉及。在这类漏洞中，攻击者在输入和输出数据格式错误或未正确验证时利用这些数据。如所见，CWE-20 [20, 142] 是最常见的漏洞类型，占 13 项主题研究。CWE-20 指的是软件系统中输入验证未正确执行的情况，这使得它们容易受到恶意个人利用输入数据进行的攻击。这种情况发生在输入数据未经过验证以确保其安全性或符合预定义规范时。CWE-78 是第二大漏洞类型，由 9 项主题研究涉及[11, 20, 38]。这类安全漏洞与操作系统命令注入有关，其中外部攻击者可以使用未经充分验证的组件的输入数据构建操作系统命令。

Vulnerabilities related to Numeric are the third most frequent type of vulnerabilities covered in the subject studies, accounting for 36 studies in total. Within this class, Integer Overflow (CWE-190) is the most frequently covered vulnerability type [20, 58, 142]. Integer overflow is a condition that occurs when an arithmetic operation attempts to create a numeric value that is outside the range that can be represented with a given number of bits. For example, an 8-bit unsigned integer can represent values from 0 to 255, whereas a 32-bit signed integer typically ranges from –2,147,483,648 to 2,147,483,647. When an arithmetic operation produces a value that exceeds these limits, an overflow occurs.
与数值相关的漏洞是主题研究中覆盖的第三种最常见的漏洞类型，共涉及 36 项研究。在此类别中，整数溢出（CWE-190）是最常被讨论的漏洞类型[20, 58, 142]。整数溢出是指当算术运算尝试创建一个超出用给定位数表示范围的数值时发生的情况。例如，一个 8 位无符号整数可以表示从 0 到 255 的值，而一个 32 位有符号整数通常范围从-2,147,483,648 到 2,147,483,647。当算术运算产生的值超过这些限制时，就会发生溢出。

4.5 RQ5: What Are the Most Frequently Used Tools for Software Vulnerability Detection?
4.5 研究问题 5：软件漏洞检测中最常用的工具是什么？

In this section, we summarize the most commonly used tools for software vulnerability detection. Table 10 shows the distribution of the tools. We summarized the tools into three categories, including Model Building Tools, Code Analysis/Compilation, and Data Tools.
本节中，我们总结了最常用的软件漏洞检测工具。表 10 显示了工具的分布。我们将工具分为三类，包括模型构建工具、代码分析/编译和数据工具。

Table 10.

Category 类别	Tool Name 工具名称	# Studies # 研究	Total 总计	References 参考文献
Model Building Tools 模型构建工具	Keras/TensorFlow	42	116	[10, 11, 16, 17, 23, 26, 29, 32, 37, 41, 60, 62, 63, 64, 68, 71, 74, 76, 83, 84, 85, 87, 95, 96, 97, 106, 107, 109, 114, 118, 125, 131, 139, 142, 144, 148, 149, 154, 164, 186, 189, 190] [10, 11, 16, 17, 23, 26, 29, 32, 37, 41, 60, 62, 63, 64, 68, 71, 74, 76, 83, 84, 85, 87, 95, 96, 97, 106, 107, 109, 114, 118, 125, 131, 139, 142, 144, 148, 149, 154, 164, 186, 189, 190]
	PyTorch	42		[3, 8, 9, 13, 20, 21, 22, 28, 30, 31, 38, 48, 49, 54, 57, 64, 68, 77, 82, 91, 120, 127, 129, 132, 138, 140, 141, 143, 143, 151, 152, 155, 170, 174, 175, 178, 179, 180, 181, 183, 185, 188] [3, 8, 9, 13, 20, 21, 22, 28, 30, 31, 38, 48, 49, 54, 57, 64, 68, 77, 82, 91, 120, 127, 129, 132, 138, 140, 141, 143, 143, 151, 152, 155, 170, 174, 175, 178, 179, 180, 181, 183, 185, 188]
	Scikit-learn	11		[19, 33, 41, 54, 60, 62, 70, 87, 142, 174, 175] [19, 33, 41, 54, 60, 62, 70, 87, 142, 174, 175]
	GenSim	9		[41, 54, 64, 87, 95, 140, 154, 174, 183] [41, 54, 64, 87, 95, 140, 154, 174, 183]
	DGL	6		[7, 8, 129, 174, 175, 180] [7, 8, 129, 174, 175, 180]
	Theano	2		[26, 85] [26, 85]
	sent2vec	2		[155, 188] [155, 188]
	Transformers 变压器	2		[38, 48] [38, 48]
Code Analysis/Compilation 代码分析/编译	Joern 乔恩	24	35	[9, 11, 22, 29, 30, 68, 71, 110, 129, 132, 137, 141, 142, 143, 147, 150, 152, 155, 170, 174, 175, 183, 188, 189] [9, 11, 22, 29, 30, 68, 71, 110, 129, 132, 137, 141, 142, 143, 147, 150, 152, 155, 170, 174, 175, 183, 188, 189]
	Soot 烟灰	3		[79, 80, 142] [79, 80, 142]
	Clang	2		[22, 48] [22, 48]
	tree-sitter	2		[152, 153] [152, 153]
	CodeSensor 代码传感器	2		[94, 95] [94, 95]
	ANTLR	2		[135, 142] [135, 142]
Data Tools 数据工具	NetworkX 网络 X	5	9	[7, 9, 41, 141, 155] [7, 9, 41, 141, 155]
	NLTK	4		[54, 56, 73, 114] [54, 56, 73, 114]
Unique Total 独特总计	–	–	96	–

Table 10. Most Commonly Used Tools for Software Vulnerability Detection
表 10. 软件漏洞检测中最常用的工具

As can be seen in the table, Keras with TensorFlow backend¹⁶ is the most commonly used library for building ML-based software vulnerability detection techniques, accounting for 42 studies, and PyTorch¹⁷ comes as the second most commonly used library, with 42 studies in total. Scikit-learn¹⁸ is the third most popular library for model building, accounting for 11 studies in total. Scikit-learn provides a user-friendly and consistent API, making it easy to implement and experiment with various ML algorithms. Scikit-learn includes a diverse set of classification algorithms such as Logistic Regression, SVM, Decision Trees, Random Forests, KN, and Naive Bayes. GenSim¹⁹ is the fourth commonly used tool for building software vulnerability detection models. GenSim’s ability to efficiently handle large datasets, combined with its powerful topic modeling and word embedding functionalities, makes it an indispensable tool for model building in natural language processing and text mining. DGL²⁰ is the fifth most commonly used model building tool, accounting for 6 studies. DGL is specifically designed for constructing and training GNNs, making it a go-to library for researchers and practitioners working on graph-related problems. It abstracts the complexity of implementing GNNs, providing easy-to-use APIs for building and applying various GNN models.
如表所示，使用 TensorFlow 后端 ¹⁶ 的 Keras 是最常用的基于机器学习的软件漏洞检测技术库，占 42 项研究，其次是使用 PyTorch ¹⁷ 的库，总共有 42 项研究。Scikit-learn ¹⁸ 是用于模型构建的第三大流行库，总共有 11 项研究。Scikit-learn 提供了一个用户友好且一致的 API，使得实现和实验各种机器学习算法变得容易。Scikit-learn 包括一系列分类算法，如逻辑回归、SVM、决策树、随机森林、KN 和朴素贝叶斯。GenSim ¹⁹ 是构建软件漏洞检测模型的第四大常用工具。GenSim 高效处理大数据集的能力，结合其强大的主题建模和词嵌入功能，使其成为自然语言处理和文本挖掘中模型构建不可或缺的工具。DGL ²⁰ 是第五大常用的模型构建工具，占 6 项研究。 DGL 专门设计用于构建和训练 GNN，使其成为研究者和从事图相关问题的实践者的首选库。它抽象了实现 GNN 的复杂性，提供了易于使用的 API，用于构建和应用各种 GNN 模型。

In the category of Code Analysis/Compilation, the most commonly used tool is Joern, accounting for 24 studies in total. Joern was first proposed by Yamaguchi et al. [160], and it converts source code into a graph representation, specifically AST, CFG, and PDG. The second most commonly used tool for code processing is Soot,²¹ which provides various intermediate representations of Java bytecode.
在代码分析/编译类别中，最常用的工具是 Joern，总共有 24 项研究使用。Joern 最初由山口等人[160]提出，它将源代码转换为图表示，具体为 AST、CFG 和 PDG。用于代码处理的第二常用工具是 Soot， ²¹ ，它提供了 Java 字节码的各种中间表示。

In the category of Data Tools, NetworkX is the most commonly used data tool, accounting for five studies in total. NetworkX²² uses native Python data structures (like dictionaries and lists) to represent graphs. This allows seamless integration with other Python libraries and makes it easy to manipulate and explore graph data. NLTK²³ provides robust tools for breaking down source code and text into tokens, which is essential for analyzing software vulnerability data.
在数据工具类别中，NetworkX 是最常用的数据工具，总共有五项研究使用。NetworkX 使用原生 Python 数据结构（如字典和列表）来表示图。这使得它可以与其他 Python 库无缝集成，并便于操作和探索图数据。NLTK 提供了强大的工具，用于将源代码和文本分解成标记，这对于分析软件漏洞数据至关重要。

4.6 RQ6: What Are Possible Challenges and Open Directions in Software Vulnerability Detection?
4.6 RQ6：软件漏洞检测中可能面临的挑战和开放方向是什么？

4.6.1 Challenges. 4.6.1 挑战。

Challenge 1: Heterogeneous Data Sources. The biggest challenge in vulnerability detection through learning is the inadequate modeling of the comprehensive semantics of complex vulnerabilities by current models [26, 27, 126]. Existing ML models often fail to capture the complex patterns of software vulnerabilities because they treat source code like natural language. Unlike natural language, source code contains structural and logical information requiring AST, dataflow, and control flow analysis. To address this, the detection pipeline must use rich representation techniques like control flow and dataflow graphs and proper embeddings to convert these representations into a numerical format for graph-based neural networks.
挑战 1：异构数据源。通过学习进行漏洞检测的最大挑战是当前模型无法充分建模复杂漏洞的综合语义 [26, 27, 126]。现有的机器学习模型往往无法捕捉软件漏洞的复杂模式，因为它们将源代码视为自然语言。与自然语言不同，源代码包含结构和逻辑信息，需要抽象语法树（AST）、数据流和控制流分析。为了解决这个问题，检测管道必须使用丰富的表示技术，如控制流和数据流图以及适当的嵌入，将这些表示转换为基于图神经网络的数值格式。

Challenge 2: Detection Granularity. The effectiveness of DL models in identifying vulnerabilities depends on input granularity. Current models use coarse inputs like methods and files. To achieve finer granularity, program slicing can select crucial statements for detection, but it must be done effectively to reduce noise. Existing tools focus on library/API calls and operations, but these alone are insufficient. A promising approach is using code changes from GitHub, focusing on added and deleted lines, which often have the highest impact on vulnerability detection.
挑战 2：检测粒度。深度学习模型在识别漏洞方面的有效性取决于输入粒度。当前模型使用粗粒度输入，如方法和文件。为了实现更细的粒度，程序切片可以选择关键语句进行检测，但必须有效地进行以减少噪声。现有工具主要关注库/API 调用和操作，但仅此不足以。一种有前景的方法是使用 GitHub 上的代码更改，重点关注新增和删除的行，这些行通常对漏洞检测影响最大。

Challenge 3: Lack of Training Data. A significant weakness of DL models, particularly in software vulnerability detection, is their insatiable need for data [24, 111]. In domains like image classification, abundant labeled data, and pre-trained models enable effective DL training. However, in software vulnerability detection, data scarcity is a major issue due to the difficulty of labeling ground truth information. Platforms like Stack Overflow, GitHub, and issue-tracking systems provide extensive records, but labeling is often manual and challenging. Automatic labeling is a potential solution but tends to generate many false positives. Some researchers use unsupervised classification, but this method also has limited precision.
挑战 3：训练数据不足。深度学习模型的一个显著弱点，尤其是在软件漏洞检测方面，就是它们对数据的无尽需求[24, 111]。在图像分类等领域，丰富的标记数据和预训练模型能够有效地进行深度学习训练。然而，在软件漏洞检测中，由于标记真实信息困难，数据稀缺成为一个主要问题。像 Stack Overflow、GitHub 和问题跟踪系统这样的平台提供了大量的记录，但标记通常是手动且具有挑战性的。自动标记是一个潜在的解决方案，但往往会产生许多误报。一些研究人员使用无监督分类，但这种方法也具有有限的精度。

4.6.2 Open Directions. 4.6.2 开放方向。

Multi-Modal Learning. Performing a simple vulnerability detection with source code snippets is not sufficient to have accurate and effective models. Various artifacts are needed to feed into ML models to increase vulnerability detection performance. For example, feeding code comments will increase classification performance remarkably. Some subject studies that use commits [19] argue that feeding source code is not enough and commit characteristics as metadata are required for software vulnerability detection.
多模态学习。仅使用源代码片段进行简单的漏洞检测不足以构建准确有效的模型。需要各种工件来输入机器学习模型，以提高漏洞检测性能。例如，输入代码注释将显著提高分类性能。一些使用提交[19]的专题研究认为，仅输入源代码是不够的，软件漏洞检测还需要提交特征作为元数据。

Just-in-Time Vulnerability Detection. One possible direction for software vulnerability detection is using the just-in-time approaches. This approach focuses on detecting vulnerabilities as they occur or are introduced, hence offering real-time protection [56, 114]. This method allows for faster reaction and mitigation of vulnerabilities before they are exploited.
即时漏洞检测。软件漏洞检测的一个可能方向是采用即时检测方法。这种方法侧重于在漏洞出现或引入时进行检测，因此提供实时保护[56, 114]。这种方法允许在漏洞被利用之前更快地做出反应和缓解。

Leveraging Foundation Models (LLMs) for Vulnerability Detection. Recently, LLMs have been used in a wide variety of software engineering tasks including automatic program repair [59], test case generation, and root cause analysis of incidents in cloud environments. However, the application of LLMs for software vulnerability detection has not been yet discovered comprehensively as it should be. In our survey, we identified some subject studies that utilize LLMs for software vulnerability detection [32, 98, 135, 169]. However, their frequency is still negligible compared to the widespread usage of typical DL models.
利用基础模型（LLMs）进行漏洞检测。最近，LLMs 已被广泛应用于各种软件工程任务中，包括自动程序修复[ 59]、测试用例生成以及云环境中事件的根本原因分析。然而，LLMs 在软件漏洞检测中的应用尚未得到全面发现，正如它应该做到的那样。在我们的调查中，我们确定了某些利用 LLMs 进行软件漏洞检测的研究[ 32, 98, 135, 169]。然而，与典型深度学习模型的广泛应用相比，它们的频率仍然微不足道。

5 Threats to Validity 5 有效性威胁

In this section, we discuss threats to the validity of each RQ. We discuss various threats to the RQs that we address in this study.
在这一节中，我们讨论了每个 RQ 有效性的威胁。我们讨论了在本研究中我们解决的 RQs 的各种威胁。

RQ1: Trend of Studies . The selection of studies might be biased if certain types of studies are more likely to be indexed or retrieved by our web crawler. To address selection bias, we defined diverse key terms to extract the most relevant research papers related to software vulnerability detection. The target papers should use ML-based software vulnerability detection techniques. To increase the accuracy of data selection, we refined the initial search results in three steps to ensure that the most relevant studies were selected for taxonomy creation and refinement. These steps have been performed by multiple authors simultaneously. The choice of digital libraries could impact construct validity if they do not equally represent all relevant studies. To mitigate this threat, we selected the most widely used digital libraries: ACM Digital Library, ScienceDirect, IEEE Xplore, and Google Scholar. These libraries are representative of the software vulnerability detection field because they contain a sufficient number of records that match our key terms for data extraction. One of the major threats to the external validity of the first RQ is that the trends we observed from January 2011 to June 2024 may not apply to future research beyond this period. As technologies evolve rapidly, new techniques and tools for software vulnerability detection may emerge. However, we believe that our findings accurately represent the current state-of-the-art technology for software vulnerability detection at the time of this study.
RQ1：研究趋势。如果某些类型的研究更有可能被我们的网络爬虫索引或检索，那么研究的选择可能会存在偏差。为了解决选择偏差，我们定义了多种关键词来提取与软件漏洞检测相关的最相关的研究论文。目标论文应使用基于机器学习的软件漏洞检测技术。为了提高数据选择的准确性，我们通过三个步骤细化了初始搜索结果，以确保为分类创建和细化选择了最相关的研究。这些步骤是由多位作者同时执行的。如果数字图书馆不能平等地代表所有相关研究，那么它们的选择可能会影响构建效度。为了减轻这种威胁，我们选择了最广泛使用的数字图书馆：ACM 数字图书馆、ScienceDirect、IEEE Xplore 和 Google Scholar。这些图书馆代表了软件漏洞检测领域，因为它们包含足够多的与我们的关键词匹配的记录，用于数据提取。第一个 RQ 的外部效度的主要威胁之一是，我们从 2011 年 1 月到 2024 年 6 月观察到的趋势可能不适用于此时期之后的研究。随着技术的快速发展，可能会出现新的软件漏洞检测技术和工具。然而，我们相信我们的发现准确地代表了本研究时软件漏洞检测的最新技术水平。

RQ2: Characteristics of Software Vulnerability Detection Datasets . Datasets might focus on specific types of software or languages that threaten the generalizability of our findings. To overcome this limitation, we focused on software vulnerability detection in three major language domains, including software vulnerability in Java, C/C++, and smart contracts. Java is prevalent in enterprise and web applications, C/C++ is fundamental in system and performance-critical programming, and smart contracts are crucial in blockchain technology. This diverse selection reduces selection bias, provides a holistic view of vulnerabilities, and ensures that the findings are more broadly applicable and relevant to real-world software development contexts. Although our findings are based on datasets from studies published between January 2011 and June 2024, the identified characteristics are expected to apply to future datasets due to ongoing advancements in software vulnerability detection techniques. We provide detailed criteria and procedures for selecting and analyzing datasets, enabling other researchers to replicate and validate our findings, thus enhancing the generalizability and reliability of our conclusions.
RQ2：软件漏洞检测数据集的特征。数据集可能专注于特定类型的软件或语言，这可能会威胁到我们研究结果的普适性。为了克服这一局限性，我们专注于三个主要语言领域的软件漏洞检测，包括 Java、C/C++和智能合约中的软件漏洞。Java 在企业和 Web 应用中普遍存在，C/C++在系统和性能关键编程中是基础，而智能合约在区块链技术中至关重要。这种多样化的选择减少了选择偏差，提供了对漏洞的整体视角，并确保研究结果更广泛地适用于现实世界的软件开发环境。尽管我们的研究结果基于 2011 年 1 月至 2024 年 6 月间发表的研究数据集，但由于软件漏洞检测技术的持续进步，预计所识别的特征也适用于未来的数据集。我们提供了详细的标准和程序，用于选择和分析数据集，使其他研究人员能够复制和验证我们的发现，从而增强我们结论的普遍性和可靠性。

RQ3: Distribution of ML and DL Models in Software Vulnerability Detection . There are multiple threats to this RQ. First, ML models evolve quickly, and models that are effective today might become obsolete or be replaced by more advanced ones soon. To overcome this threat, we expanded our study selection bias to cover the last 2 years—that is, 2023 and 2024 to cover the most state-of-the-art ML technology for software vulnerability detection. This results in identifying three promising studies that use foundation models for software vulnerability detection.
RQ3：机器学习（ML）和深度学习（DL）模型在软件漏洞检测中的应用分布。这一研究问题存在多个威胁。首先，ML 模型发展迅速，今天有效的模型可能很快就会过时或被更先进的模型所取代。为了克服这一威胁，我们将我们的研究选择偏差扩展到覆盖过去两年——即 2023 年和 2024 年，以涵盖最先进的软件漏洞检测 ML 技术。这导致我们发现了三个使用基础模型进行软件漏洞检测的有前景的研究。

RQ4: Frequent Software Vulnerability . To ensure construct validity in this RQ, it is crucial to provide clear and precise definitions of each type of vulnerability. We first identify reputable sources like OWASP and MITRE’s CWE. OWASP provides a widely recognized list of common security vulnerabilities, particularly in web applications. CWE offers a comprehensive list of software weaknesses, providing detailed descriptions and classifications. We then reviewed the subject studies and identified the types of vulnerabilities that are mentioned frequently. Often these vulnerabilities can be identified by CWE IDs that are explicitly mentioned in the research papers.
RQ4：频繁的软件漏洞。为确保本 RQ 的构建效度，提供每种漏洞的清晰和精确定义至关重要。我们首先确定了像 OWASP 和 MITRE 的 CWE 这样的可靠来源。OWASP 提供了一份广为人知的常见安全漏洞列表，尤其是在 Web 应用程序中。CWE 提供了一份全面的软件弱点列表，包括详细的描述和分类。然后我们审查了主题研究，并确定了经常提到的漏洞类型。通常，这些漏洞可以通过在研究论文中明确提到的 CWE ID 来识别。

RQ5: Tools for Software Vulnerability Detection . The threat to this question is that there may be biases in the selection of tools for study, influenced by an important factor such as popularity (like TensorFlow and PyTorch). This can skew the findings toward more well-known tools, neglecting equally effective but less publicized options. To overcome this threat, we classified the tools into three broad categories. For each category, we extracted the most popular and the least popular tools including a balanced mix of tools to avoid over-representation of any particular subset.
RQ5：软件漏洞检测工具。这个问题受到的威胁是，在工具选择上可能存在偏差，受到诸如流行度（如 TensorFlow 和 PyTorch）等重要因素的影响。这可能导致研究结果偏向于更知名的工具，而忽视了同样有效但知名度较低的选择。为了克服这一威胁，我们将工具分为三大类。对于每一类，我们提取了最受欢迎和最不受欢迎的工具，包括平衡的工具组合，以避免任何特定子集的过度代表。

RQ6: Challenges and Open Directions . To ensure the construct validity of this RQ, we thoroughly analyzed two key sections of each study. First, we examined the context section of the abstract to gain a general understanding of the problem being addressed. Next, we analyzed the introduction section to extract relevant text that further elaborates on the problem. By combining this information, we generalized the problem and created a concise taxonomy for classification.
RQ6：挑战与开放方向。为确保该 RQ 的构建效度，我们对每项研究的两个关键部分进行了彻底分析。首先，我们检查了摘要中的背景部分，以获得对所解决问题的总体理解。接下来，我们分析了引言部分，以提取进一步阐述问题的相关文本。通过结合这些信息，我们对问题进行了概括，并创建了一个简洁的分类法。

6 Conclusion 6 结论

In this study, we conducted a systematic survey to investigate various characteristics of ML-based software vulnerability detection studies using six RQs. We extracted initial studies from four widely-used online digital libraries—ACM Digital Library, IEEE Xplore, ScienceDirect, and Google Scholar—using a custom web scraper. After manually filtering out irrelevant studies unrelated to software vulnerability detection, we created taxonomies and addressed the RQs.
本研究对基于机器学习的软件漏洞检测研究进行了系统调查，使用六个研究问题（RQs）来探究其各种特征。我们使用自定义网络爬虫从四个广泛使用的在线数字图书馆——ACM 数字图书馆、IEEE Xplore、ScienceDirect 和 Google Scholar——中提取了初步研究。在手动过滤掉与软件漏洞检测无关的不相关研究后，我们创建了分类法并解决了研究问题。

Our findings indicated a notable increase in the use of ML techniques to detect software vulnerabilities in recent years. We found that prominent conference venues include ICSE, ISSRE, MSR, and FSE, whereas the leading journal venues are IST, C&S, and JSS. Additionally, we found that 39.1% of the subject studies use hybrid as the sources of data, whereas 37.6% of the subject studies use benchmark data for software vulnerability detection. Among the data types analyzed, code-based data is the most prevalent, with source code being the most common sub-type. Graph-based and token-based input representations are the most popular techniques, utilized in 57.2% and 24.6% of the studies, respectively. For input embedding, graph embeddings and token vector embeddings are the most frequently employed methods, appearing in 32.6% and 29.7% of studies. Furthermore, 88.4% of the examined studies use DL models, with RNNs and GNNs being the most popular, whereas only 7.2% use traditional ML models. The most frequently addressed vulnerability types are CWE-119, CWE-20, and CWE-190. In terms of tools for software vulnerability detection, Keras and PyTorch are the most widely used tools. Joern is the leading tool for code analysis and representation. Finally, we summarized the challenges and future directions in the context of software vulnerability detection, providing valuable insight for researchers and practitioners in the field. This comprehensive survey aimed to bridge the existing gap and provide a clearer understanding of the current landscape and future opportunities in the detection of software vulnerabilities using ML techniques.
近年来，我们发现使用机器学习技术检测软件漏洞的应用显著增加。我们发现，重要的会议包括 ICSE、ISSRE、MSR 和 FSE，而领先的期刊包括 IST、C&S 和 JSS。此外，我们发现 39.1%的研究使用混合数据作为数据来源，而 37.6%的研究使用基准数据用于软件漏洞检测。在分析的数据类型中，基于代码的数据最为普遍，其中源代码是最常见的子类型。基于图和基于标记的输入表示是最受欢迎的技术，分别被 57.2%和 24.6%的研究使用。对于输入嵌入，图嵌入和标记向量嵌入是最常用的方法，分别出现在 32.6%和 29.7%的研究中。此外，88.4%的研究使用了深度学习模型，其中 RNN 和 GNN 最受欢迎，而只有 7.2%的研究使用传统的机器学习模型。最常讨论的漏洞类型是 CWE-119、CWE-20 和 CWE-190。在软件漏洞检测工具方面，Keras 和 PyTorch 是最广泛使用的工具。 Joern 是代码分析和表示的领先工具。最后，我们在软件漏洞检测的背景下总结了挑战和未来方向，为该领域的学者和实践者提供了有价值的见解。这项全面调查旨在弥合现有差距，并更清晰地理解当前软件漏洞检测领域使用机器学习技术的现状和未来机遇。

Footnotes 脚注

https://github.com/dmc1778/CSURSurvey

Go to Footnote 前往脚注

Please note that it is possible to detect memory leak vulnerabilities using static analysis techniques; however, application of dynamic analysis is more effective compared to static analysis.
请注意，可以使用静态分析技术检测内存泄漏漏洞；然而，与静态分析相比，动态分析的应用更为有效。

Go to Footnote 前往脚注

https://joern.io/

Go to Footnote 前往脚注

⁴

https://github.com/dmc1778/CSURSurvey

Go to Footnote 前往脚注

⁵

https://pypi.org/project/selenium/

Go to Footnote 前往脚注

⁶

https://pypi.org/project/beautifulsoup4/

Go to Footnote 前往脚注

⁷

https://nvd.nist.gov/general/news

Go to Footnote 前往脚注

⁸

https://nvd.nist.gov/general/brief-history

Go to Footnote 前往脚注

⁹

https://github.com/dmc1778/CSURSurvey

Go to Footnote 前往脚注

¹⁰

Please note that if we could not find the dataset name and source in the experimental setup section, we looked for other sections.
请注意，如果在实验设置部分找不到数据集名称和来源，我们就在其他部分寻找。

Go to Footnote 前往脚注

¹¹

https://www.nist.gov/

Go to Footnote 前往脚注

¹²

https://www.cve.org/ProgramOrganization/CNAs

Go to Footnote 前往脚注

¹³

https://github.com/smartbugs/smartbugs-wild

Go to Footnote 前往脚注

¹⁴

https://etherscan.io/

Go to Footnote 前往脚注

¹⁵

https://cwe.mitre.org/

Go to Footnote 前往脚注

¹⁶

https://www.tensorflow.org/

Go to Footnote 前往脚注

¹⁷

https://pytorch.org/

Go to Footnote 前往脚注

¹⁸

https://scikit-learn.org

Go to Footnote 前往脚注

¹⁹

https://radimrehurek.com/gensim/

Go to Footnote 前往脚注

²⁰

https://www.dgl.ai/

Go to Footnote 前往脚注

²¹

https://soot-oss.github.io/soot/

Go to Footnote 前往脚注

²²

https://networkx.org/

Go to Footnote 前往脚注

²³

https://www.nltk.org/

Go to Footnote 前往脚注

References 参考文献

[1] [1] [1]

Faranak Abri, Sima Siami-Namini, Mahdi Adl Khanghah, Fahimeh Mirza Soltani, and Akbar Siami Namin. 2019. Can machine/deep learning classifiers detect zero-day malware with high accuracy? In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data’19). IEEE, 3252–3259.
Faranak Abri，Sima Siami-Namini，Mahdi Adl Khanghah，Fahimeh Mirza Soltani，以及 Akbar Siami Namin. 2019. 机器/深度学习分类器能否以高精度检测零日恶意软件？载于 2019 年 IEEE 国际大数据会议（Big Data’19）论文集。IEEE，3252–3259。

Abstract 摘要

1 Introduction 1 引言

2 Background and Related Work2 背景及相关工作

2.1 Background 2.1 背景

2.1.1 Vulnerability Detection.2.1.1 漏洞检测。

2.1.2 Vulnerability Analysis.2.1.2 漏洞分析。

2.1.3 Vulnerability Remediation.2.1.3 漏洞修复。

2.1.4 ML for Software Vulnerability Detection.2.1.4 软件漏洞检测中的机器学习

2.2 Related Work 2.2 相关工作

3 Methodology 3 方法论

3.1 Sources of Information3.1 信息来源

3.2 Search Terms 3.2 搜索词

3.3 Study Selection and Quality Assessment3.3 研究选择与质量评估

3.3.1 Title Filtering Stage.3.3.1 标题过滤阶段。

3.4 Taxonomy Development and Classification Methodology3.4 分类发展及分类方法

4 Results 4 结果

4.1 RQ1: What Is the Trend of Studies?4.1 RQ1：研究趋势是什么？

4.1.1 RQ1.1: What Is the Trend of Studies over Time?.4.1.1 RQ1.1：研究随时间推移的趋势是什么？

4.1.2 RQ1.2: What Is the Distribution of Publication Venues?.4.1.2 RQ1.2：出版物场所的分布情况如何？

4.2 RQ2: What Are the Characteristics of Software Vulnerability Detection Datasets?4.2 研究问题 2：软件漏洞检测数据集的特征是什么？

4.2.1 RQ2.1: What Is the Source of Datasets?.4.2.1 RQ2.1：数据集的来源是什么？

4.2.2 RQ2.2: What Are the Most Commonly Used Data Types?.4.2.2 RQ2.2：最常见的使用数据类型是什么？

4.2.3 RQ2.3: What Are the Most Commonly Used Input Representations?.4.2.3 RQ2.3：最常见的输入表示是什么？

4.2.4 RQ2.4: What Are the Most Commonly Used Embedding Approaches?.4.2.4 RQ2.4：最常见的嵌入方法有哪些？

4.3 RQ3: What Is the Distribution of ML and DL Models Used for Software Vulnerability Detection?4.3 研究问题 3：用于软件漏洞检测的机器学习（ML）和深度学习（DL）模型的分布情况如何？

4.3.1 Comparison of ML Models with Manual Code Analysis.4.3.1 机器学习模型与人工代码分析的对比

4.3.2 Transfer Learning for Software Vulnerability Detection.4.3.2 软件漏洞检测的迁移学习

4.4 RQ4: What Is the Most Frequent Type of Vulnerability Covered in the Subject Studies?4.4 研究问题 4：在研究对象中，最常见的漏洞类型是什么？

4.5 RQ5: What Are the Most Frequently Used Tools for Software Vulnerability Detection?4.5 研究问题 5：软件漏洞检测中最常用的工具是什么？

4.6 RQ6: What Are Possible Challenges and Open Directions in Software Vulnerability Detection?4.6 RQ6：软件漏洞检测中可能面临的挑战和开放方向是什么？

4.6.1 Challenges. 4.6.1 挑战。

4.6.2 Open Directions. 4.6.2 开放方向。

5 Threats to Validity 5 有效性威胁

6 Conclusion 6 结论

Footnotes 脚注

References 参考文献

Index Terms 索引术语

Recommendations 建议

Systematic literature reviews in software engineering - A systematic literature review软件工程中的系统文献综述 - 系统文献综述

Cyberbullying detection and machine learning: a systematic literature review

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning检测使用机器学习的盲跨站脚本攻击

Comments 注释

2 Background and Related Work
2 背景及相关工作

2.1.1 Vulnerability Detection.
2.1.1 漏洞检测。

2.1.2 Vulnerability Analysis.
2.1.2 漏洞分析。

2.1.3 Vulnerability Remediation.
2.1.3 漏洞修复。

2.1.4 ML for Software Vulnerability Detection.
2.1.4 软件漏洞检测中的机器学习

3.1 Sources of Information
3.1 信息来源

3.3 Study Selection and Quality Assessment
3.3 研究选择与质量评估

3.3.1 Title Filtering Stage.
3.3.1 标题过滤阶段。

3.4 Taxonomy Development and Classification Methodology
3.4 分类发展及分类方法

4.1 RQ1: What Is the Trend of Studies?
4.1 RQ1：研究趋势是什么？

4.1.1 RQ1.1: What Is the Trend of Studies over Time?.
4.1.1 RQ1.1：研究随时间推移的趋势是什么？

4.1.2 RQ1.2: What Is the Distribution of Publication Venues?.
4.1.2 RQ1.2：出版物场所的分布情况如何？

4.2 RQ2: What Are the Characteristics of Software Vulnerability Detection Datasets?
4.2 研究问题 2：软件漏洞检测数据集的特征是什么？

4.2.1 RQ2.1: What Is the Source of Datasets?.
4.2.1 RQ2.1：数据集的来源是什么？

4.2.2 RQ2.2: What Are the Most Commonly Used Data Types?.
4.2.2 RQ2.2：最常见的使用数据类型是什么？

4.2.3 RQ2.3: What Are the Most Commonly Used Input Representations?.
4.2.3 RQ2.3：最常见的输入表示是什么？

4.2.4 RQ2.4: What Are the Most Commonly Used Embedding Approaches?.
4.2.4 RQ2.4：最常见的嵌入方法有哪些？

4.3 RQ3: What Is the Distribution of ML and DL Models Used for Software Vulnerability Detection?
4.3 研究问题 3：用于软件漏洞检测的机器学习（ML）和深度学习（DL）模型的分布情况如何？

4.3.1 Comparison of ML Models with Manual Code Analysis.
4.3.1 机器学习模型与人工代码分析的对比

4.3.2 Transfer Learning for Software Vulnerability Detection.
4.3.2 软件漏洞检测的迁移学习

4.4 RQ4: What Is the Most Frequent Type of Vulnerability Covered in the Subject Studies?
4.4 研究问题 4：在研究对象中，最常见的漏洞类型是什么？

4.5 RQ5: What Are the Most Frequently Used Tools for Software Vulnerability Detection?
4.5 研究问题 5：软件漏洞检测中最常用的工具是什么？

4.6 RQ6: What Are Possible Challenges and Open Directions in Software Vulnerability Detection?
4.6 RQ6：软件漏洞检测中可能面临的挑战和开放方向是什么？

Systematic literature reviews in software engineering - A systematic literature review
软件工程中的系统文献综述 - 系统文献综述

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
检测使用机器学习的盲跨站脚本攻击