OpenHub: A scalable architecture for the analysis of software quality attributes Gabriel Farah Universidad de Los Andes Cra 1E #19-40, Bogotá - Colombia 57 1 339-4949 g.farah38@uniandes.edu.co Juan Sebastian Tejada Universidad de Los Andes Cra 1E #19-40, Bogotá - Colombia 57 1 339-4949 js.tejada157@uniandes.edu.co Dario Correal Universidad de Los Andes Cra 1E #19-40, Bogotá - Colombia 57 1 339-4949 dcorreal@uniandes.edu.co ABSTRACT There is currently a vast array of open source projects available on the web, and although they are searchable by name or description in the search engines, there is no way to search for projects by how well they perform on a given set of quality attributes (e.g. usability or maintainability). With OpenHub, we present a scalable and extensible architecture for the static and runtime analysis of open source repositories written in Python, presenting the architecture and pinpointing future possibilities with it. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance and Enhancement. General Terms Algorithms, Measurement, Experimentation. Keywords GitHub, Python, Quality Attributes, Architecture 1. INTRODUCTION Currently, the web hosts millions of freely accessible open source repositories across different services such as GitHub, BitBucket, Sourceforge, etc. Finding the right repository to use for a project amounts for a big part in the development pipeline. This includes; searching repositories by name or description, and then continuously integrating and testing projects until a fit meets a desired set of quality attributes (e.g. performance, testability or usability)[5]. This is a time consuming task since there is no proper tool to search for repositories by how well they perform on a set of given quality attributes. OpenHub aims to be a system that easily provides access to information about millions of open source repositories and their performance on different quality attributes. It also aims to be easily extensible, allowing the analysis of different quality attributes and technologies/languages beyond the ones covered in this paper. Prior to the release of this paper, we had more than 1.7 million projects crawled and more than 140,000 Python repositories analyzed for a total of 9.8TB of data processed using the OpenHub architecture. The resulting 3GB dataset collected is available for download. In this paper, we present the general design; go through the challenges and limitations of working with the dataset and outline research opportunities that emerge from it. 2. DATA COLLECTION The crawler component was implemented to constantly crawl all public repositories available through GitHub, store or update (if they already existed) the repositories in a MongoDB data store, and eventually, push them to a queue for quality attribute analysis. Figure 1 presents the BSON document schema used to store the repository data, the instructions for loading them into MongoDB can be found in: goo.gl/fXqVWq. The type of JSON documents stored inside the MongoDB instance are exactly the same as provided by the GitHub API (http://developer.github.com/v3/repos/), with additional fields that are added for every quality attribute analyzed. Figure 1. BSON Schema used by OpenHub. Figure 2 presents an overview of the process and the main components present in OpenHub.