Does Bug Prediction Support Human Developers? Findings from a Google Case Study

Author(s): Lewis, C; Zhongpeng Lin ; Sadowski, C. ; Xiaoyan Zhu ; Rong Ou ; Whitehead, E.J.
Venue: Internation Conference on Software Engineering, San Francisco, CA
Date: 2013

Type of Experiement: Controlled Experiment
Sample Size: 19
Class/Experience Level: Professional
Participant Selection: Responded to volunteer
Data Collection Method: Observation

Quality
4

A few bug prediction algorithms are in widespread use but the authors of the article noted that it has not been well tested by multiple empirical studies in software development settings.

These authors looked to answer the following three research questions.

  • According to expert opinion, given a collection of bug prediction algorithms, which is preferred?
  • What are the desirable characteristics a bug prediction algorithm should have?
  • Using the knowledge gained from the other two questions to design a likely algorithm, do developers modify their behavior when presented with bug prediction results?

The researchers setup an experiment to test two large code bases at Google with three different existing bug finding algorithms (FixCache, DurationCache, and Rahman). Each algorithm would produce a list of files it found to be bug-prone. The 19 volunteers then would have 30 minutes to go through each list of files for either project A or B and would mark the file as 'bug-prone', 'not bug-prone', 'ambivalence' (no feeling either way), or 'unknown' (no experience with the file). Finally the volunteers were asked to rank the lists as a whole with the same markings. The experiment was double-blind, neither the researchers or the volunteers knew which list was which. The volunteers were allowed to ask questions, but the answers were vague as to not sway their markings. The Rahman algorithm performed significantly better than the FixCache algorithms at identifying bug-prone files. Both the Duration Cache and Cache-20 lists mostly contained unknown files. Only the Rahman algorithm resulted in a list where interviewees both knew the files on the list, and felt those files to be bug-prone. This result does not necessarily mean that the files identified in the Duration Cache and Cache- 20 lists are not bug-prone, but that developers did not have enough experience with those files to make any comment.

The researchers then asked what developers desired of a bug finding algorithm. They wanted something that was:

  • Actionable - clear steps to fix a flagged area
  • Obvious Reasoning- strong, visible and obvious reason why the flagging took place
  • Bias Towards the New - Developers worried more about old files than new ones

Additionally they desire an algorithm that is scalable and parallelizable.

The researchers then modified the Rahman algorithm to have a bias towards new files, but the developers found that there was still no real use to the algorithm. They hope that one day a useful bug finding tool will be created to answer the desires of developers.

0