SPI: Automated Identification of Security Patches via Commits. (arXiv:2105.14565v2 [cs.CR] UPDATED)

Security patches in open-source software, providing security fixes to
identified vulnerabilities, are crucial in protecting against cyberattacks.
Despite the National Vulnerability Database (NVD) publishes identified
vulnerabilities, a vast majority of vulnerabilities and their corresponding
security patches remain beyond public exposure, e.g., in the open-source
libraries that are heavily relied on by developers. An extensive security
patches dataset could help end-users such as security companies, e.g., building
a security knowledge base, or researchers, e.g., aiding in vulnerability
research. To curate security patches including undisclosed patches at a large
scale and low cost, we propose a deep neural-network-based approach built upon
commits of open-source repositories. We build security patch datasets that
include 38,291 security-related commits and 1,045 CVE patches from four C
libraries. We manually verify each commit, among the 38,291 security-related
commits, to determine if they are security-related. We devise a deep
learning-based security patch identification system that consists of two neural
networks: one commit-message neural network that utilizes pretrained word
representations learned from our commits dataset; and one code-revision neural
network that takes code before and after revision and learns the distinction on
the statement level. Our evaluation results show that our system outperforms
SVM and K-fold stacking algorithm, achieving as high as 87.93% F1-score and
precision of 86.24%. We deployed our pipeline and learned model in an
industrial production environment to evaluate the generalization ability of our
approach. The industrial dataset consists of 298,917 commits from 410 new
libraries that range from a wide functionality. Our experiment results and
observation proved that our approach identifies security patches effectively
among open-sourced projects.