Sound Empirical Evidence in Software Testing

Author(s): Gordon Fraser, Andrea Arcuri
Venue: Software Engineering (ICSE), 2012 34th International Conference
Date: 2012

Sample Size: 8784
Class/Experience Level: Other
Participant Selection: "Randomly selected 100 Java projects from SourceForge, which is the most popular open source repository"
Data Collection Method: Project Artifact(s)


This article recognizes the strong need for testing in Software. The authors recognize the time investment it takes to write good tests, and analyze a technique for automatic test creation to achieve branch coverage in Object-Oriented Software. They explain that for complex test method analysis, researchers must rely on empirical studies to prove their results since mathematical proofs of test methods are too difficult. Although a technique performs well in a lab during an empirical study, it may not perform well in the real world and case studies chosen for empirical studies are usually not done systematically. They propose and execute an empirical study where the choice of test data (from open source software) is statistically sound, proposing a more system approach to choosing case studies.

For their experiment, they randomly select 100 Java projects from SourceForge, which consisted of 8784 classes. Due to random selection, this sample size is statistically representative of open source software. Then, EVOSUITE (search based test generation tool) was applied to each project 10 times with different random seeds to account for randomness of the search algorithm. They showed that "test generation can indeed achieve high coverage – but only on a certain type of classes". They learned that environment dependencies were the key factor that hindered test coverage, and suggested future research in this area.

The authors assert that their "selection of 100 SourceForge projects ...can serve as a corpus of classes for the ?eld of test generation for object-oriented software." After completing a statistically sound analysis of EVOSUITE the authors explain that "empirical analysis
(Section III) has shown that 90.7% of classes may lead to interactions with their environment". So for most applications, EVOSUITE would not achieve high test coverage, but an empirical study of the test tool with a hand-selected sample size can product high test coverage (by choosing classes that do not interact with their environment). The authors explain that software engineering case studies suffer from threats to external validity and that these threats are the main barrier of converting research into practice. They conclude by proposing a challenge to the research community: to "develop novel testing techniques to achieve at least 80% of bytecode branch coverage" using their corpus provided above. They hope that with this 80% coverage testing techniques will be more applicable to real-world situations.