Recovering Traceability Links Between Code and Documentation

Author(s): Antoniol, G., Canfora, G., Casazza, G., De Lucia, A., Merlo, E.
Venue: IEEE Transactions on Software Engineering
Date: October 2002

Type of Experiement: Case Study
Sample Size: 8
Class/Experience Level: Undergraduate Student, Graduate Student
Data Collection Method: Project Artifact(s)



The authors studied the application of two information retrieval (IR) techniques to tracing source code to documentation. The techniques applied are a probabilistic model and a vector space model known as term frequency-inverse document frequency (tf-idf). In the first experiment, they applied these techniques to tracing C++ source code to manual pages which were generated from the source code. In the second experiment, they applied the techniques to tracing Java code to functional requirements. Lastly, they compared the results from the IR techniques to two groups of students. Both groups received the set of functional requirements and source code from the second experiment. One of the groups also received the ranked list of candidate links obtained from the probabilistic IR method.

The first two experiments show comparable results between the probabilistic and vector space models, although the probabilistic model achieves higher levels of recall for fewer documents retrieved and the vector space model approaches 100% recall faster than the probabilistic model with more documents retrieved. In the last experiment, both groups of students performed better than the IR method alone; however, the group of students that received the ranked list of candidate links had better results than the control group.