Does Bug Prediction Support Human Developers? Findings from a Google Case Study

Author(s): C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, E.J. Whitehead
Venue: 2013 35th International Conference on Software Engineering (ICSE)
Date: 18-26 May 2013

Type of Experiement: Case Study
Sample Size: 19
Class/Experience Level: Professional
Participant Selection: Subjects responded to email request sent out to two project teams at Google.
Data Collection Method: Survey


This paper explores the helpfulness of bug prediction software to professional software developers. The authors of this paper partnered with Google's Engineering Tools department to evaluate different bug prediction algorithms and attempt to develop one for use at Google.

First, they began by evaluating three of the most popular bug prediction algorithms, FixCache with a reduced cache size, FixCache with output ranked by duration, and Rahman. They interviewed 19 members of two anonymous projects at Google, showing each person a list of flagged files from each of the 3 algorithms and asking them to rate how bug-prone they believed that list to be. Results showed that developers far preferred the Rahman algorithm over the other two. The authors also talked to their interviewees about the top three characteristics they would like in a bug prediction algorithm. These characteristics were found to be actionable messages, obvious reasoning, and bias towards the new. None of the algorithms perfectly matched these characteristics. On top of looking for these three characteristics, the authors were also concerned with the scalability of bug prediction algorithms. The FixCache algorithm takes too much cache space and is not easily parallelizable which made it bad fit for a company with such a large code base as Google. The Rahman algorithm on the other hand met the scalability needs of Google. For these reasons, the authors decided to modify the Rahman algorithm to create a bug prediction algorithm for Google.

Next, the authors created the Time-Weighted Risk (TWR) algorithm based on the Rahman algorithm and deployed it through Google code review software, Mondrian. This system would attach a comment to line 1 of any file that was flagged by the algorithm in the code review process. Three months later they came back and generated a list of files flagged by the system to analyze the average time from submission to approval and the average number of comments on a review for reviews contained those bug-prone files.

In conclusion, the authors found that bug prediction deployment had no effect on developers. Feedback from the developers showed that in order to find the program useful, they needed a way to remove the bug-prone flag on files. Actionable messages should be a key feature of any bug prediction software used by developers in the future.