Key points are not available for this paper at this time.
Context: There is considerable diversity in the range and design of computational experiments to assess classifiers for software defect prediction. This is particularly so, regarding the choice of classifier performance metrics. Unfortunately some widely used metrics are known to be biased, in particular F1.
Yao et al. (Wed,) studied this question.