CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. (arXiv:2107.08760v1 [cs.SE])

Data-driven research on the automated discovery and repair of security
vulnerabilities in source code requires comprehensive datasets of real-life
vulnerable code and their fixes. To assist in such research, we propose a
method to automatically collect and curate a comprehensive vulnerability
dataset from Common Vulnerabilities and Exposures (CVE) records in the public
National Vulnerability Database (NVD). We implement our approach in a fully
automated dataset collection tool and share an initial release of the resulting
vulnerability dataset named CVEfixes.

The CVEfixes collection tool automatically fetches all available CVE records
from the NVD, gathers the vulnerable code and corresponding fixes from
associated open-source repositories, and organizes the collected information in
a relational database. Moreover, the dataset is enriched with meta-data such as
programming language, and detailed code and security metrics at five levels of
abstraction. The collection can easily be repeated to keep up-to-date with
newly discovered or patched vulnerabilities. The initial release of CVEfixes
spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754
open-source projects that were addressed in a total of 5495 vulnerability
fixing commits.

CVEfixes supports various types of data-driven software security research,
such as vulnerability prediction, vulnerability classification, vulnerability
severity prediction, analysis of vulnerability-related code changes, and
automated vulnerability repair.