DOI: https://doi.org/10.15368/theses.2019.123
Available at: https://digitalcommons.calpoly.edu/theses/2111
Date of Award
12-2019
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Davide Falessi
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Data from software repositories are a very useful asset to building dierent kinds of
models and recommender systems aimed to support software developers. Specically,
the identication of likely defect-prone les (i.e., classes in Object-Oriented systems)
helps in prioritizing, testing, and analysis activities. This work focuses on automated
methods for labeling a class in a version as defective or not. The most used methods
for automated class labeling belong to the SZZ family and fail in various circum-
stances. Thus, recent studies suggest the use of aect version (AV) as provided by
developers and available in the issue tracker such as JIRA. However, in many cir-
cumstances, the AV might not be used because it is unavailable or inconsistent. The
aim of this study is twofold: 1) to measure the AV availability and consistency in
open-source projects, 2) to propose, evaluate, and compare to SZZ, a new method
for labeling defective classes which is based on the idea that defects have a stable
life-cycle in terms of proportion of versions needed to discover the defect and to x
the defect. Results related to 212 open-source projects from the Apache ecosystem,
featuring a total of about 125,000 defects, show that the AV cannot be used in the
majority (51%) of defects. Therefore, it is important to investigate automated meth-
ods for labeling defective classes. Results related to 76 open-source projects from the
Apache ecosystem, featuring a total of about 6,250,000 classes that are are aected
by 60,000 defects and spread over 4,000 versions and 760,000 commits, show that the
proposed method for labeling defective classes is, in average among projects and de-
fects, more accurate, in terms of Precision, Kappa, F1 and MCC than all previously
proposed SZZ methods. Moreover, the improvement in accuracy from combining SZZ
with defects life-cycle information is statistically signicant but practically irrelevant
(
overall and in average, more accurate via defects' life-cycle than any SZZ method.