Date of Award


Degree Name

MS in Computer Science


Computer Science


College of Engineering


Davide Falessi

Advisor Department

Computer Science

Advisor College

College of Engineering


Data from software repositories are a very useful asset to building dierent kinds of

models and recommender systems aimed to support software developers. Specically,

the identication of likely defect-prone les (i.e., classes in Object-Oriented systems)

helps in prioritizing, testing, and analysis activities. This work focuses on automated

methods for labeling a class in a version as defective or not. The most used methods

for automated class labeling belong to the SZZ family and fail in various circum-

stances. Thus, recent studies suggest the use of aect version (AV) as provided by

developers and available in the issue tracker such as JIRA. However, in many cir-

cumstances, the AV might not be used because it is unavailable or inconsistent. The

aim of this study is twofold: 1) to measure the AV availability and consistency in

open-source projects, 2) to propose, evaluate, and compare to SZZ, a new method

for labeling defective classes which is based on the idea that defects have a stable

life-cycle in terms of proportion of versions needed to discover the defect and to x

the defect. Results related to 212 open-source projects from the Apache ecosystem,

featuring a total of about 125,000 defects, show that the AV cannot be used in the

majority (51%) of defects. Therefore, it is important to investigate automated meth-

ods for labeling defective classes. Results related to 76 open-source projects from the

Apache ecosystem, featuring a total of about 6,250,000 classes that are are aected

by 60,000 defects and spread over 4,000 versions and 760,000 commits, show that the

proposed method for labeling defective classes is, in average among projects and de-

fects, more accurate, in terms of Precision, Kappa, F1 and MCC than all previously

proposed SZZ methods. Moreover, the improvement in accuracy from combining SZZ

with defects life-cycle information is statistically signicant but practically irrelevant


overall and in average, more accurate via defects' life-cycle than any SZZ method.