21-25 March 2022
Academia Sinica
Europe/Zurich timezone

Software defect prediction: A study on software metrics using statistical and machine learning methods

Mar 21, 2022, 3:30 PM
Room 1

Room 1

Oral Presentation Track 10: Artificial Intelligence (AI) Artificial Intelligence


Dr Marco Canaparo (INFN CNAF)


Software defect prediction is aimed at identifying defect prone software modules in order to allocate testing resources [1, 2]. In the development software life cycle, software testing plays an essential role: its criticality is proved by the significant amount of spending that companies allocate to it [3]. In the last decades, furthermore, software systems are becoming more and more complex in order to meet functional or non-functional requirements [4], this complexity represents a suitable environment for defects. Several researchers have striven to develop models able to identify defective modules with the aim of reducing time and cost of software testing [5, 6]. Such models are typically trained on software measurements i.e. software metrics [7]. Software metrics are of paramount importance in the field of software engineering because they describe the characteristics of a software project such as size, complexity and code churn [8]. They reduce the subjectivity of software quality assessment and can be relied on for decision making e.g. to decide where to focus software tests [9, 10].

Our study is based on different kinds of software dataset metrics derived from different software projects [11, 12, 13]. The collected metrics belong to three main categories: size, complexity, object oriented [14, 15]; our research has highlighted the lack of consistency [16] among metrics' names: on the one hand some metrics have similar names but measure different software features, on the other hand different metrics' names measure similar software features. The involved datasets are both labelled and unlabelled i.e. they may (or may not) contain the information on the defectiveness of the software modules. Moreover, some datasets include metrics computed leveraging metrics' thresholds [17] - by default available in software application used to conduct metrics' measurement.

Software defect prediction models use both statistical and machine learning (ML) methods as described in previous literature [18, 19]. Due to the characteristics of the data, usually with a non Gaussian distribution, this work includes techniques, such as Decision Tree, Random Forest, Support Vector Machine, LASSO, Stepwise Regression. We have also employed statistical techniques that enable to compare all these algorithms by performance indicators such as precision, recall and accuracy as well as nonparametric tests [20].

To make our study available to research community, we have developed an open source and extensible R application that supports researchers to load the selected kinds of datasets, to filter them according to the their features and to apply all the mentioned statistical and ML techniques.

