G-Repo: a Tool to Support MSR Studies on GitHub Simone Romano, Maria Caulo, Matteo Buompastore, Leonardo Guerra, Anas Mounsif, Michele Telesca, Maria Teresa Baldassarre, and Giuseppe Scanniello University of Bari, Bari, Italy University of Basilicata, Potenza, Italy simone.romano@uniba.it, maria.caulo@unibas.it, {matteo.buompastore, leonardo.guerra, anas.mounsif, michele.telesca001}@studenti.unibas.it, mariateresa.baldassarre@uniba.it, giuseppe.scanniello@unibas.it Abstract—GitHub currently hosts more than 100 million public repositories. This has made it very popular to conduct Mining Software Repositories (MSR) studies. Researchers have been exploiting the information stored in GitHub (e.g., commits, pull requests, or issues) to investigate both developer- and project- related aspects. GitHub provides the REST API to make queries without cloning repositories. In this tool-demo paper, we highlight some issues we noticed when conducting an MSR study on GitHub by using the REST API and present G-Repo: a tool developed to support researchers when tackling these issues able to ease the creation of datasets for MSR studies. Also, we provide a manually-annotated dataset with information about the kind and the (spoken) languages of 1,500 repositories hosted on GitHub. A video showing the functioning of G-Repo is available at: https://youtu.be/mb9CIALBFZk. Index Terms—MSR, GitHub, G-Repo I. I NTRODUCTION Mining Software Repositories (MSR) is a research field in the Software Engineering area that has significantly grown in the last two decades. MSR studies aim to gather empir- ical evidence about both developers and software projects by exploiting the valuable information available in software repositories [1], [2]. To conduct an MSR study, researchers usually perform the following steps: 1) Define the context—i.e., define which are the repositories of interest for the study (e.g., the 1,500 most popular Java repositories hosted on GitHub 1 ). 2) Create the raw dataset by identifying potential reposi- tories of interest and collecting the related data. It is often necessary to clone the repositories on the own work computer. 3) Clean the raw dataset by removing, for example, those repositories that do not concern software development (e.g., repositories hosted on GitHub can actually be websites or tutorials [3]). 4) Analyze the clean dataset by processing the collected data, thus gathering empirical evidences. Depending on the context of MSR studies, researchers can either exploit publicly-available datasets (e.g., [4]) or interact with forges 2 (e.g., GitHub), which provide easy access to repository data [5], to build their own datasets (e.g., [6], [7]). GitHub is one of the most suitable forges for MSR purposes 1 https://github.com 2 A forge is a web-based collaborative platform for both developing and sharing software applications. since it currently hosts more than 100 million repositories. GitHub makes public repositories available for cloning/forking activities to any user. Also, it shares information about both repositories and actions performed by users/developers within repositories (e.g., who made a change or when) through the GitHub REST API (version v3). 3 This API allows, among other uses, gathering information about repositories without cloning those repositories from GitHub to the own work computer. GitHub has also recently integrated some social features—e.g., users can watch and star projects, and follow other users. When conducting MSR studies on GitHub, researchers need to tackle some issues related to the repositories hosted on GitHub and the REST API. Recently, we have taken a snapshot of the 1,500 most popular (i.e., top starred) Java repositories by excluding forks 4 —one of the trends to define the context of an MSR study consists of selecting a certain number of top starred repositories (e.g., [8], [9])—and we ran into the following issues: 1) API limitations. The Search (REST) API 5 —it allows searching for specific items (e.g., repositories or users) on GitHub—returns maximum 1,000 results for each query. Thus, if the results satisfying a query are more than 1,000, the results are truncated at 1,000. Also, there is a pagination limit of 100 results. This means that, for a query returning 1,000 results, ten HTTPS requests (one for each page of results) to the Search API are needed to get the 1,000 results. Finally, each authenticated GitHub user can send a maximum of 30 HTTPS requests per minute (while non-authenticated GitHub users are limited to ten HTTPS requests per minute). 2) Wrong programming language of repositories. When searching for repositories (through the Search API) by a certain programming language, the results can (wrongly) include repositories that do not contain files of that programming language. 3) Repositories that do not concern software devel- opment. GitHub can be used for purposes different from software development [3]. Therefore, the search of repositories (through the Search API) can return repos- 3 docs.github.com/en/free-pro-team@latest/rest 4 A fork is a copy of a repository. 5 docs.github.com/en/free-pro-team@latest/rest/reference/search