A Large Scale Study of License Usage on GitHub

Author(s): Christopher Vendome
Venue: International Conference on Software Engineering
Date: 2015

Type of Experiement: Case Study
Sample Size: 16221
Class/Experience Level: Undergraduate Student, Graduate Student, Professional
Participant Selection: Open source Java projects mined from GitHub
Data Collection Method: Observation


This paper addresses the evolution of licenses in open source projects as development communities and new legal issues continue to arise. It reports the results of a large empirical study issued over the development history of 16,221 open source Java projects found on GitHub. The study aims to describe license usage and adoption over the course of ten years to explain a pattern of interaction between licenses.

There were two parts to identifying and classifying licenses. First, the study used MARKOS code analyzer to extract licenses of individual files by commits. Then, the data from each project was compiled and analyzed to class licenses by version number and commit date. Next, the study depicted the relative usage of licenses per year and compared how new versions of licenses dominated previous versions. The results of the study demonstrate how the open source community benefit strongly from licenses that offer more legal protection, yet are still permissive. For example, the Apache v2 license was widely used because it granted the original author patent protection.