It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction

Author(s): Kim Herzig, Sascha Just, Andreas Zeller
Venue: International Conference on Software Engineering
Date: 2013

Type of Experiement: Quasi-Controlled Experiment
Sample Size: 7000
Class/Experience Level: Graduate Student
Participant Selection: Classwork
Data Collection Method: Project Artifact(s)


Herzig et al. provide a detailed description of the process by which they manually classified 7401 issue reports spanning 5 open source java projects. The process involved a two-stage evaluation, in which each issue was inspected by two evaluators. If the evaluators issue classification (bug, rfe, impr, doc, refac, other) conflicted, a third joint-inspection was conducted by both original evaluators. This effort alone was significant -- averaging about 4 minutes per issue report, and totaling 725 hours (90 working days).

From this effort, Herzig et al. was able to make some substantial claims (that may or may not invalidate large bodies of previous work). In general, report classifications are unreliable (from open source projects) -- more than 40% of issue reports are incorrectly classified (e.g. labeled as bug when it's actually feature). More than 30% of bug reports are not actually bugs. More importantly, the errors associated with classifying issue reports affects software defect prediction: 39% of files marked as defective actually never had a bug, and between 16-40% of the predicted top 10% of error prone files do not actually belong in this category.

All future research into SDP should seriously consider the implications of this paper. If you must use open source projects, you will likely need to conduct some sort of manual verification of issue reports, otherwise your results will likely be questionable.