Local versus Global Lessons for Defect Prediction and Effort Estimation

Author(s): Tim Menzies, Andrew Butcher, David Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann
Venue: Transactions on Software Engineering
Date: 2013

Type of Experiement: Controlled Experiment
Sample Size: 9
Class/Experience Level: Graduate Student
Participant Selection: Classwork
Data Collection Method: Project Artifact(s)


Menzies et al. focus is on which source of data is most appropriate when building a software defect prediction model. A majority of literature in the field indicates that within-project data leads to the highest performing models. However, some research has indicated that across-project data can lead to models that perform comparably well and are generally more robust (as well as having the advantage that they can be applied to a new project with no within-project training data).

In this paper, Menzies et al. provide evidence that conflicts with previous findings of their very own: within-project data is not necessarily the best source for software defect prediction. By making use of a (somewhat) novel clustering approach (WHERE/WHICH) they were able to conclude that "the best lessons for a project from one source come from neighboring clusters with data from nearby sources, but not inside that source." This means that SDP should ignore all existing source divisions -- all available data should be considered for use. From the set of all existing data, neighbor clusters of 'X' should be identified and used for training a model used on 'X'.