Extreme Programming: A Survey of Empirical Data from a Controlled Case Study

Author(s): Pekka Abrahamsson and Juha Koskela
Venue: Empirical Software Engineering, 2004. ISESE '04
Date: Aug. 2004

Type of Experiement: Case Study
Sample Size: 4

Quality
3

SUMMARY
Extreme Programming: A Survey of Empirical Data from a Controlled Case Study describes and analyzes the data from am empirical controlled case study, collected when four software engineers were tasked to implement a web-based system (7698Locs, 820 hours) in eight weeks using extreme programming practices. The purpose of this paper is to set a reference point for researchers and practitioners in the software engineering field from an empirical viewpoint on extreme programming. The collected empirical data comes from five system releases, each of which were tested by 17 customers. Overall, system release defect-density was 1.43 defects/KLOC, team overall productivity 16.90 LOCs/hour, and rework costs were 9.8% of the total development effort. Quantitative data collected came from time, size and defect rates, and qualitative data was gathered from development diaries maintained by the developers, post-mortem analysis session recordings, and developer interviews. The quality of the data obtained was systematically monitored by the project manager, dedicated metrics, an on-site customer, and the customer organization’s management. The project’s development schedule and resources were fixed, so only the delivered functionality was somewhat flexible. The requirements for the system were well known before the project was even initiated due to a large number of potential users (300+), and as such could not be modified.

The team was comprised of 5th-6th year university students with 1 to 4 years of industrial experience in software development, who all were well-versed in Java and object-oriented techniques. Two weeks prior to project launch the team performed a self-study by studying two basic books on extreme programming.

At the completion of the project, data showed that roughly 10%of overall effort was spent on planning the release contents. Project management activities (data collection and analysis, monitoring project progress, and developing a project plan) required 13.4% of the total effort. Project meetings took 4.5% of the total effort, and the majority of the developer efforts were in unit test development, production code, development spikes, and refactoring, at 54.7% of the total project effort. Additionally, the team had major difficulties in decomposing the contents of the first release into small tasks, which aligns with other extreme programming research that claims the ability to estimate accurately is a skill that is learned only over time, through experience. Even so, estimation error variance remained high throughout the project, and no clear data was reported which supports an improvement in “guesstimation,” even as the project progressed. But, while estimates were not accurate in terms of error percentage, there were identifiable improvement in terms of hours lost by faulty estimates.

The number of hours dedicated to tasks was reduced from the initial 70% to 50-60% in 2-week release cycle, but below 50% in one-week cycle. This result indicates that there is an increase in over-head in very short development cycles. The authors stress that a one week release cycle to end-user testing actually was sometimes not appreciated by users, as it was seen as disturbing to have a new version in hand in only one week’s time.

Although time-consuming, testing resulted in 38 improvement suggestions (new or improved user functionality) that improved the overall system. Surprisingly, one of the important outcomes of the study is the finding that even though a customer was on-site for the developers to interact with, there ended up being little need for “actual customer involvement” in the project. In fact, the majority of the customer’s involvement was required on planning game (42.8%) and acceptance testing (29.9%) only. The presence of a customer on-site, however, did seem to positively affect the team in other, less tangible ways.

Paired programming was mandatory for the first release, and programmers could choose whether or not to pair for later releases. The fact that the pair programming time remained well above 70% in subsequent releases is shown to reflect how comfortable the team was with the process. The data obtained does not, however, show a relation between the use pair programming and the level of achieved productivity.

Refactoring data, on the other hand, is asserted to reveal interesting insights. The highest levels of productivity were achieved when only 5.9% of the effort was used for refactoring, indirectly providing support for the argument that extensive levels of refactoring decreases team productivity.

0