A case study of an empirical approach to component requirements in developing a plagiarism detection tool

Author(s): Hanakawa, Noriko, and Mike Barker
Venue: 13th Asia Pacific Software Engineering Conference (APSEC'06)
Date: 2006

Type of Experiement: Case Study
Sample Size: 76



The paper discusses how the authors were able to extract detailed component requirements by automatically collecting empirical data of students’ behavior investigating websites to develop a plagiarism detection tool. The plagiarism detection tool highlights which sentences (including Japanese) of the reports copied from some websites with a corresponding URL. The empirical data are the logs from WebTracer, a web browser, monitoring students’ behaviors investigating websites. The logs then create a web document space for the plagiarism detection tool. The paper concludes that with the empirical data collected and prototype, the detailed requirements producing a plagiarism detection tool reached a 71% accuracy.

Process Outline

  1. Create a candidate scenario by detecting target search engines, keywords for searching, and the number of web pages searched by search engine
  2. Make a web document space based on the candidate scenario by making a program to collect the web pages, and building a web document space
  3. Execute detection of plagiarized sentences in reports by the prototype
  4. Evaluate the results by the prototype by comparing the plagiarized sentences
  5. Tuning the candidate scenario if neccessary

The accuracy of the plagiarism detection tool was 71% and was sufficient for this experiment using the following detailed requirements to build a web document space:

  1. Top six keywords
  2. Keyword lists are generated by combining up to six keywords
  3. 10 clauses extracted from report assignment to the keywords lists
  4. After Google search with the keywords lists, every 10 web pages found by a search are downloaded