Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search Lun Du ∗2 lun.du@microsoft.com Microsoft Research Asia Beijing, China Xiaozhou Shi ∗3 xzh0u.sxz@gmail.com Beijing University of Technology Beijing, China Yanlin Wang yanlwang@microsoft.com Microsoft Research Asia Beijing, China Ensheng Shi 3 s1530129650@stu.xjtu.edu.cn Xi’an Jiaotong University Xi’an, China Shi Han shihan@microsoft.com Microsoft Research Asia Beijing, China Dongmei Zhang dongmeiz@microsoft.com Microsoft Research Asia Beijing, China ABSTRACT Recently, deep learning methods have become mainstream in code search since they do better at capturing semantic correlations be- tween code snippets and search queries and have promising per- formance. However, code snippets have diverse information from diferent dimensions, such as business logic, specifc algorithm, and hardware communication, so it is hard for a single code represen- tation module to cover all the perspectives. On the other hand, as a specifc query may focus on one or several perspectives, it is difcult for a single query representation module to represent diferent user intents. In this paper, we propose MuCoS, a multi- model ensemble learning architecture for semantic code search. It combines several individual learners, each of which emphasizes a specifc perspective of code snippets. We train the individual learn- ers on diferent datasets which contain diferent perspectives of code information, and we use a data augmentation strategy to get these diferent datasets. Then we ensemble the learners to capture comprehensive features of code snippets. The experiments show that MuCoS has better results than the existing state-of-the-art methods. Our source code and data are anonymously available at https://github.com/Xzh0u/MuCoS. CCS CONCEPTS · Software and its engineering Reusability; Search-based software engineering; · Information systems Novelty in information retrieval. Equal Contribution 2 Corresponding Author 3 Work performed during the internship at MSRA Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. CIKM ’21, November 1ś5, 2021, Virtual Event, QLD, Australia © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00 https://doi.org/10.1145/3459637.3482127 KEYWORDS code search, ensemble learning, data augmentation, deep learning ACM Reference Format: Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han, and Dongmei Zhang. 2021. Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ’21), November 1ś5, 2021, Virtual Event, QLD, Australia. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3459637.3482127 1 INTRODUCTION Code search is the most frequent developer activity in software development process [16]. Reusable code examples help improve the efciency of developers in their developing process [1, 18]. Given a natural language query that describes the developer’s intent, the goal of code search is to fnd the most relevant code snippet from a large source code corpus. Many code search engines have been developed for code search. They mainly rely on traditional information retrieval (IR) tech- niques such as keyword matching [13] or a combination of text similarity and Application Program Interface (API) matching [14]. Recently, many works have taken steps to apply deep learning meth- ods [3, 8, 20, 22, 24] to code search [2, 4, 5, 7, 10ś12, 18, 23, 25, 26], using neural networks to capture deep and semantic correlations between natural language queries and code snippets, and have achieved promising performance improvements. These methods employ various types of model structures, including sequential models [2, 4, 5, 7, 10, 18, 23, 25, 26], graph models [6, 12], and transformers [4]. Existing deep learning code search methods mainly use a sin- gle model to represent queries and code snippets. However, code may have diverse information from diferent dimensions, such as business logic, specifc algorithm, and hardware communication, making it hard for a single code representation module to cover all the perspectives. On the other hand, as a specifc query may focus on several perspectives, it is difcult for a single query representa- tion module to represent diferent user intents. To address the problems above, we propose MuCoS: Multi- Model for Code Search. First, we use data augmentation strategy to train multiple models that focus on diferent perspectives of code. Short Paper Track CIKM ’21, November 1–5, 2021, Virtual Event, Australia 2994