Detecting Similar Repositories on GitHub

Author(s): Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, Jianling Sun
Venue: Automated Software Engineering
Date: 2017

Type of Experiement: Controlled Experiment
Sample Size: 50
Class/Experience Level: Undergraduate Student
Participant Selection: PhD students evaluated repositories for similarity.
Data Collection Method: Observation


This paper presents a novel method for detecting similar repositories on GitHub and discusses the comparative analysis of RepoPal's, their recommendation systems, results with those of a CLAN, an existing system which recommends similar repositories on GitHub for Java projects. The CLAN system attempts to determine the most similar repositories for a given 'query' repository by analyzing the Java API usage patterns of the source files in these repositories. This system yielded improvements over prior research in this space, however, RepoPal's system design takes significantly more information publicly available on GitHub to formulate its recommendation of similar repositories. They describe their philosophy for repository similarity using 3 heuristics.

Heuristic 1: Readme-based Relevance uses term-frequency-inverse document frequency analysis, a natural language processing technique, to compute a similarity score from the text content of the Repositories readme file. The motivating idea behind Heuristic 1 is that repositories which use similar language to describe the repository are very likely to be similar.

Heuristic 2: Stargazer-based relevance uses publicly available GitHub to compute the similarity between 2 'stargazers' (user on GitHub who stars a repository so that they will be notified of future updates to this repository). Stars in GitHub signify community interest and following of a repository. The premise behind this similarity score is that given 2 users A and B, how many starred repositories are shared in common between all repositories starred by users A and B combined (proportion of commonly starred repositories to total starred repositories for the given users)

Heuristic 3: Time-based relevance examines the period between starred repositories for a given GitHub user. The intuition behind this metric is that as a GitHub user explores repositories looking to find a solution to their current problem, related repositories that appear useful may be starred by that User in order to keep tabs on the progress of that repository and potentially use it in the future. This heuristic is an interesting method of evaluating repository similarity since it uses temporal data to infer user behavior instead of actually examining the contents of the repository.

The experiment conducted by these authors started with collecting 1000 java repos hosted on GitHub in order to compare their recommendation results with those of CLAN. Java repositories are used because CLAN only operates on Java source files. However, the authors claim that RepoPal is generalized and should work with same performance on other sets of repositories.

The results of their comparative analysis yielded statistically significant improvements over CLAN in each of their 3 evaluation metrics: proportion of successful top-5 recommendations among all recommendations where the success criteria was having at least 'highly relevent' or 'relevant' repository in the recommendation set. Confidence was the second metric and it described the median and mean relevance degress participants gave to all retrieved repositories recommended by the system. Lastly, Precision was evaluated by computing the proportion of relevant and highly relevant repositories for a given query. This is the average relevance score of a recommendation set. RepoPal saw improvements in these metrics ranging from 36% to over 66% using 99.9% confidence interval for significance testing.