Modeling Skewness in Vulnerability Discovery HyunChul Joh a * † and Yashwant K. Malaiya b A vulnerability discovery model attempts to model the rate at which the vulnerabilities are discovered in a software product. Recent studies have shown that the S-shaped Alhazmi–Malaiya Logistic (AML) vulnerability discovery model often ﬁts better than other models and demonstrates superior prediction capabilities for several major software systems. However, the AML model is based on the logistic distribution, which assumes a symmetrical discovery process with a peak in the center. Hence, it can be expected that when the discovery process does not follow a symmetrical pattern, an asymmetrical distribution based discovery model might perform better. Here, the relationship between performance of S-shaped vulnerability discovery models and the skewness in target vulnerability datasets is examined. To study the possible dependence on the skew, alternative S-shaped models based on the Weibull, Beta, Gamma and Normal distributions are introduced and evaluated. The models are ﬁtted to data from eight major software systems. The applicability of the models is examined using two separate approaches: goodness of ﬁt test to see how well the models track the data, and prediction capability using average error and average bias measures. It is observed that an excellent goodness of ﬁt does not necessarily result in a superior prediction capability. The results show that when the prediction capability is considered, all the right skewed datasets are represented better with the Gamma distribution-based model. The symmetrical models tend to predict better for left skewed datasets; the AML model is found to be the best among them. Copyright © 2013 John Wiley & Sons, Ltd. Keywords: data models; security; empirical studies; vulnerability discovery model (VDM); skewness 1. Introduction B efore software developers release a product to the customers, it needs to satisfy not only the functional and technical requirements, it also should be sufﬁciently reliable and secure. After the release, developers must ensure that patches are available as soon as possible for the vulnerabilities that will be discovered. If software development managers can make accurate projections for the vulnerability discovery process, they can optimally allocate the needed resources that are likely to be needed for rapid patch development. A quantitative characterization of the vulnerability discovery rates is necessary to assess the risks associated with the product. A vulnerability is deﬁned as a defect or weakness in the security system which might be exploited by a malicious user causing loss or harm. 1 A critical vulnerability can provide an attacker the ability to gain full control of the system or leakage of highly sensitive information. For non-security-related software defects, the most used reliability metrics are residual fault density and failure intensity. 2 These measures can be used in data-driven quantitative analysis methods that can be used by the developers to control development in order to achieve the target reliability levels. The software reliability growth models (SRGMs) that attempt to relate the defect discovery to the testing time form a core part of the software reliability engineering discipline. 3,4 The vulnerability discovery models (VDMs) proposed recently are somewhat analogous to the SRGMs, but there are signiﬁcant differences. Vulnerabilities, which are security-related defects, tend to have a different proﬁle than ordinary software defects. 5,6 Ordinary defects found after release are frequently ignored and not ﬁxed until the next release because they do not represent a high degree of risk. On the other hand, software developers need to patch vulnerabilities right after they are found due to the high risks they represent. The security issues can greatly impact organizations such as banks, brokerage houses, on-line merchants, government ofﬁces as well as individuals. A quantitative analysis for software vulnerability discovery process is required for optimizing testing, maintenance and risk assessment of the software systems because the quantitative methods provide actual data-driven analytical methods. For a quantitative assessment to become feasible, it is necessary for the software systems to have been around for a sufﬁciently long time, so that the related datasets are signiﬁcant enough to be analyzed. 7,8 a School of General Studies, Gwangju Institute of Science and Technology, Gwangju, Korea b Computer Science Department, Colorado State University, Fort Collins, CO 80523, USA *Correspondence to: HyunChul Joh, School of General Studies, Gwangju Institute of Science and Technology, B-312 GIST College, 123 Cheomdan-gwagiro, Buk-gu, Gwangju, 500-712, Korea. † E-mail: joh@gist.ac.kr Copyright © 2013 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 1445–1459 Research Article (wileyonlinelibrary.com) DOI: 10.1002/qre.1567 Published online 2 September 2013 in Wiley Online Library 1445