Is a Single Model Enough? MuCoS: A Multi-Model Ensemble
Learning Approach for Semantic Code Search
Lun Du
∗2
lun.du@microsoft.com
Microsoft Research Asia
Beijing, China
Xiaozhou Shi
∗3
xzh0u.sxz@gmail.com
Beijing University of Technology
Beijing, China
Yanlin Wang
yanlwang@microsoft.com
Microsoft Research Asia
Beijing, China
Ensheng Shi
3
s1530129650@stu.xjtu.edu.cn
Xi’an Jiaotong University
Xi’an, China
Shi Han
shihan@microsoft.com
Microsoft Research Asia
Beijing, China
Dongmei Zhang
dongmeiz@microsoft.com
Microsoft Research Asia
Beijing, China
ABSTRACT
Recently, deep learning methods have become mainstream in code
search since they do better at capturing semantic correlations be-
tween code snippets and search queries and have promising per-
formance. However, code snippets have diverse information from
diferent dimensions, such as business logic, specifc algorithm, and
hardware communication, so it is hard for a single code represen-
tation module to cover all the perspectives. On the other hand,
as a specifc query may focus on one or several perspectives, it
is difcult for a single query representation module to represent
diferent user intents. In this paper, we propose MuCoS, a multi-
model ensemble learning architecture for semantic code search. It
combines several individual learners, each of which emphasizes a
specifc perspective of code snippets. We train the individual learn-
ers on diferent datasets which contain diferent perspectives of
code information, and we use a data augmentation strategy to get
these diferent datasets. Then we ensemble the learners to capture
comprehensive features of code snippets. The experiments show
that MuCoS has better results than the existing state-of-the-art
methods. Our source code and data are anonymously available at
https://github.com/Xzh0u/MuCoS.
CCS CONCEPTS
· Software and its engineering → Reusability; Search-based
software engineering; · Information systems → Novelty in
information retrieval.
∗
Equal Contribution
2
Corresponding Author
3
Work performed during the internship at MSRA
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’21, November 1ś5, 2021, Virtual Event, QLD, Australia
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00
https://doi.org/10.1145/3459637.3482127
KEYWORDS
code search, ensemble learning, data augmentation, deep learning
ACM Reference Format:
Lun Du, Xiaozhou Shi, Yanlin Wang, Ensheng Shi, Shi Han, and Dongmei
Zhang. 2021. Is a Single Model Enough? MuCoS: A Multi-Model Ensemble
Learning Approach for Semantic Code Search. In Proceedings of the 30th
ACM International Conference on Information and Knowledge Management
(CIKM ’21), November 1ś5, 2021, Virtual Event, QLD, Australia. ACM, New
York, NY, USA, 5 pages. https://doi.org/10.1145/3459637.3482127
1 INTRODUCTION
Code search is the most frequent developer activity in software
development process [16]. Reusable code examples help improve the
efciency of developers in their developing process [1, 18]. Given a
natural language query that describes the developer’s intent, the
goal of code search is to fnd the most relevant code snippet from a
large source code corpus.
Many code search engines have been developed for code search.
They mainly rely on traditional information retrieval (IR) tech-
niques such as keyword matching [13] or a combination of text
similarity and Application Program Interface (API) matching [14].
Recently, many works have taken steps to apply deep learning meth-
ods [3, 8, 20, 22, 24] to code search [2, 4, 5, 7, 10ś12, 18, 23, 25, 26],
using neural networks to capture deep and semantic correlations
between natural language queries and code snippets, and have
achieved promising performance improvements. These methods
employ various types of model structures, including sequential
models [2, 4, 5, 7, 10, 18, 23, 25, 26], graph models [6, 12], and
transformers [4].
Existing deep learning code search methods mainly use a sin-
gle model to represent queries and code snippets. However, code
may have diverse information from diferent dimensions, such as
business logic, specifc algorithm, and hardware communication,
making it hard for a single code representation module to cover all
the perspectives. On the other hand, as a specifc query may focus
on several perspectives, it is difcult for a single query representa-
tion module to represent diferent user intents.
To address the problems above, we propose MuCoS: Multi-
Model for Code Search. First, we use data augmentation strategy to
train multiple models that focus on diferent perspectives of code.
Short Paper Track CIKM ’21, November 1–5, 2021, Virtual Event, Australia
2994