: Tim Menzies, Andrew Butcher, David Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas ZimmermannVenue
: Transactions on Software EngineeringDate
: 2013Type of Experiement
: Controlled ExperimentSample Size
: 9Class/Experience Level
: Graduate StudentParticipant Selection
: ClassworkData Collection Method
: Project Artifact(s)
Menzies et al. focus is on which source of data is most appropriate when building a software defect prediction model. A majority of literature in the field indicates that within-project data leads to the highest performing models. However, some research has indicated that across-project data can lead to models that perform comparably well and are generally more robust (as well as having the advantage that they can be applied to a new project with no within-project training data).
In this paper, Menzies et al. provide evidence that conflicts with previous findings of their very own: within-project data is not necessarily the best source for software defect prediction. By making use of a (somewhat) novel clustering approach (WHERE/WHICH) they were able to conclude that "the best lessons for a project from one source come from neighboring clusters with data from nearby sources, but not inside that source." This means that SDP should ignore all existing source divisions -- all available data should be considered for use. From the set of all existing data, neighbor clusters of 'X' should be identified and used for training a model used on 'X'.