21-25 March 2022
Academia Sinica
Europe/Zurich timezone

Software defect prediction: A study on software metrics using statistical and machine learning methods

Mar 21, 2022, 3:30 PM
Room 1

Room 1

Oral Presentation Track 10: Artificial Intelligence (AI) Artificial Intelligence


Dr Marco Canaparo (INFN CNAF)


Software defect prediction is aimed at identifying defect prone software modules in order to allocate testing resources [1, 2]. In the development software life cycle, software testing plays an essential role: its criticality is proved by the significant amount of spending that companies allocate to it [3]. In the last decades, furthermore, software systems are becoming more and more complex in order to meet functional or non-functional requirements [4], this complexity represents a suitable environment for defects. Several researchers have striven to develop models able to identify defective modules with the aim of reducing time and cost of software testing [5, 6]. Such models are typically trained on software measurements i.e. software metrics [7]. Software metrics are of paramount importance in the field of software engineering because they describe the characteristics of a software project such as size, complexity and code churn [8]. They reduce the subjectivity of software quality assessment and can be relied on for decision making e.g. to decide where to focus software tests [9, 10].

Our study is based on different kinds of software dataset metrics derived from different software projects [11, 12, 13]. The collected metrics belong to three main categories: size, complexity, object oriented [14, 15]; our research has highlighted the lack of consistency [16] among metrics' names: on the one hand some metrics have similar names but measure different software features, on the other hand different metrics' names measure similar software features. The involved datasets are both labelled and unlabelled i.e. they may (or may not) contain the information on the defectiveness of the software modules. Moreover, some datasets include metrics computed leveraging metrics' thresholds [17] - by default available in software application used to conduct metrics' measurement.

Software defect prediction models use both statistical and machine learning (ML) methods as described in previous literature [18, 19]. Due to the characteristics of the data, usually with a non Gaussian distribution, this work includes techniques, such as Decision Tree, Random Forest, Support Vector Machine, LASSO, Stepwise Regression. We have also employed statistical techniques that enable to compare all these algorithms by performance indicators such as precision, recall and accuracy as well as nonparametric tests [20].

To make our study available to research community, we have developed an open source and extensible R application that supports researchers to load the selected kinds of datasets, to filter them according to the their features and to apply all the mentioned statistical and ML techniques.

[1] 1. Akimova EN, Bersenev AY, Deikov AA, et al. A survey on software defect prediction using deep learning. Mathematics. 2021;9(11):1180. doi: http://dx.doi.org/10.3390/math9111180.

[2] Peng He, Bing Li, Xiao Liu, Jun Chen, and Yutao Ma. 2015. An empirical study on software defect prediction with a simplified metric set. Inf. Softw. Technol. 59, C (March 2015), 170–190. DOI:https://doi.org/10.1016/j.infsof.2014.11.006

[3] S. Huda et al., "A Framework for Software Defect Prediction and Metric Selection," in IEEE Access, vol. 6, pp. 2844-2858, 2018, doi: 10.1109/ACCESS.2017.2785445.

[4] M. Cetiner and O. K. Sahingoz, "A Comparative Analysis for Machine Learning based Software Defect Prediction Systems," 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2020, pp. 1-7, doi: 10.1109/ICCCNT49239.2020.9225352.

[4] L. Šikić, P. Afrić, A. S. Kurdija and M. ŠIlić, "Improving Software Defect Prediction by Aggregated Change Metrics," in IEEE Access, vol. 9, pp. 19391-19411, 2021, doi: 10.1109/ACCESS.2021.3054948.

[5] Meiliana, S. Karim, H. L. H. S. Warnars, F. L. Gaol, E. Abdurachman and B. Soewito, "Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset," 2017 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), 2017, pp. 19-23, doi: 10.1109/CYBERNETICSCOM.2017.8311708.

[6] R. Jadhav, S. D. Joshi, U. Thorat and A. S. Joshi, "A Survey on Software Defect Prediction in Cross Project," 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), 2019, pp. 1014-1019.

[7] Wang, H., Khoshgoftaar, T.M., & Seliya, N. (2011). How Many Software Metrics Should be Selected for Defect Prediction? FLAIRS Conference.

[8] T. Honglei, S. Wei and Z. Yanan, "The Research on Software Metrics and Software Complexity Metrics," 2009 International Forum on Computer Science-Technology and Applications, 2009, pp. 131-136, doi: 10.1109/IFCSTA.2009.39.

[9] H. M. Olague, L. H. Etzkorn, S. Gholston and S. Quattlebaum, "Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes," in IEEE Transactions on Software Engineering, vol. 33, no. 6, pp. 402-419, June 2007, doi: 10.1109/TSE.2007.1015.

[10] Ulan, M., Löwe, W., Ericsson, M. et al. Weighted software metrics aggregation and its application to defect prediction. Empir Software Eng 26, 86 (2021). https://doi.org/10.1007/s10664-021-09984-2

[11] M. D'Ambros, M. Lanza and R. Robbes, "An extensive comparison of bug prediction approaches," 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), 2010, pp. 31-41, doi: 10.1109/MSR.2010.5463279.

[12] Tóth Z., Gyimesi P., Ferenc R. (2016) A Public Bug Database of GitHub Projects and Its Application in Bug Prediction. In: Gervasi O. et al. (eds) Computational Science and Its Applications -- ICCSA 2016. ICCSA 2016. Lecture Notes in Computer Science, vol 9789. Springer, Cham. https://doi.org/10.1007/978-3-319-42089-9_44

[13] M. Shepperd, Q. Song, Z. Sun and C. Mair, "Data Quality: Some Comments on the NASA Software Defect Datasets," in IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1208-1215, Sept. 2013, doi: 10.1109/TSE.2013.11

[14] Malhotra, R., & Jain, A. (2012). Fault Prediction Using Statistical and Machine Learning Methods for Improving Software Quality. Journal of Information Processing Systems, 8(2), 241–262

[15] S. R. Chidamber and C. F. Kemerer, "A metrics suite for object oriented design," in IEEE Transactions on Software Engineering, vol. 20, no. 6, pp. 476-493, June 1994, doi: 10.1109/32.295895

[16] Siket, I., Beszédes, Á., & Taylor, J. (2014). Differences in the Definition and Calculation of the LOC Metric in Free Tools.

[17] A. Boucher, M. Badri, Software metrics thresholds calculation techniques to predict fault-proneness: An empirical comparison, Information and Software Technology, Volume 96, 2018, Pages 38-67, ISSN 0950-5849, https://doi.org/10.1016/j.infsof.2017.11.005.

[18] Esteves, G., Figueiredo, E., Veloso, A. et al. Understanding machine learning software defect predictions. Autom Softw Eng 27, 369–392 (2020). https://doi.org/10.1007/s10515-020-00277-4

[19] Atif, Farah & Rodriguez, Manuel & Araújo, Luiz & Amartiwi, Utih & Akinsanya, Barakat & Mazzara, Manuel. (2021). A Survey on Data Science Techniques for Predicting Software Defects. 10.1007/978-3-030-75078-7_31

[20] Demšar, Janez Statistical Comparisons of Classifiers over Multiple Data Sets 2006 Journal of Machine Learning Research , Vol. 7, No. 1 p. 1-30

Primary authors

Elisabetta Ronchieri (INFN CNAF) Mr Gianluca Bertaccini (Department of Statistical Sciences, University of Bologna) Dr Marco Canaparo (INFN CNAF)

Presentation materials

There are no materials yet.