1 On Using Machine Learning to Automatically Classify Software Applications into Domain Categories Mario Linares-Vásquez, Collin McMillan, Denys Poshyvanyk, Mark Grechanik Abstract Software repositories hold applications that are often categorized to improve the effectiveness of various maintenance tasks. Properly categorized applications allow stakeholders to identify requirements related to their applications and predict maintenance problems in software projects. Manual categorization is expensive, tedious, and laborious – this is why automatic categorization approaches are gaining widespread importance. Unfortunately, for different legal and organizational reasons, the applications’ source code is often not available, thus making it difficult to automatically categorize these applications. In this paper, we propose a novel approach in which we use Application Programming Interface (API) calls from third-party libraries for automatic categorization of software applications that use these API calls. Our approach is general since it enables different categorization algorithms to be applied to repositories that contain both source code and bytecode of applications, since API calls can be extracted from both the source code and byte-code. We compare our approach to a state-of-the-art approach that uses machine learning algorithms for software categorization, and conduct experiments on two large Java repositories: an open-source repository containing 3,286 projects and a closed-source repository with 745 applications, where the source code was not available. Our contribution is twofold: we propose a new approach that makes it possible to categorize software projects without any source code using a small number of API calls as attributes, and furthermore we carried out a comprehensive empirical evaluation of automatic categorization approaches. Keywords Closed-source · Open-source · Software categorization · Machine learning. 1 Introduction Different software repositories have mushroomed in the past decade with many of them containing massive amounts of source code and different software artifacts. To facilitate browsing and searching of these repositories, software systems are placed into categories (e.g., text editors, financial, or databases). Since many stakeholders are engaged in maintaining software, these stakeholders benefit from properly categorized software repositories for two reasons. First, grouping applications with similar features allows stakeholders to decide what features they should implement in their own applications that belong to same groups or categories (Kawaguchi et al. 2006; Dumitru et al. 2011). Second, stakeholders can determine what problems or bugs are common to many applications in the same category, and in turn predict what problems or bugs other applications from the same category are likely to encounter (Weiss et al. 2007; Zimmermann et al. 2009); this type of prediction could be used as a quality assurance technique to recognize typical bad smells or mistakes in the code that should be avoided during programming. Automatic categorization of software applications in repositories is increasingly gaining acceptance since it reduces the manual effort significantly (Di Lucca et al. 2002; Ugurel et al. 2002;