CodeMatch: Obfuscation Won’t Conceal Your Repackaged App
Leonid Glanz, Sven Amann, Michael Eichberg, Michael Reif, Ben Hermann, Johannes Lerch, and
Mira Mezini
Technische Universität Darmstadt
Germany
{glanz,amann,eichberg,reif,hermann,mezini}@cs.tu-darmstadt.de,lerch@st.informatik.tu-darmstadt.de
ABSTRACT
An established way to steal the income of app developers, or to
trick users into installing malware, is the creation of repackaged
apps. These are clones of ś typically ś successful apps. To conceal
their nature, they are often obfuscated by their creators. But, given
that it is a common best practice to obfuscate apps, a trivial identi-
fcation of repackaged apps is not possible. The problem is further
intensifed by the prevalent usage of libraries. In many apps, the
size of the overall code base is basically determined by the used
libraries. Therefore, two apps, where the obfuscated code bases are
very similar, do not have to be repackages of each other.
To reliably detect repackaged apps, we propose a two step ap-
proach which frst focuses on the identifcation and removal of
the library code in obfuscated apps. This approach ś LibDetect ś
relies on code representations which abstract over several parts
of the underlying bytecode to be resilient against certain obfusca-
tion techniques. Using this approach, we are able to identify on
average 70% more used libraries per app than previous approaches.
After the removal of an app’s library code, we then fuzzy hash the
most abstract representation of the remaining app code to ensure
that we can identify repackaged apps even if very advanced ob-
fuscation techniques are used. This makes it possible to identify
repackaged apps. Using our approach, we found that ≈ 15% of all
apps in Android app stores are repackages.
CCS CONCEPTS
· Security and privacy → Software reverse engineering; ·
Software and its engineering → Software libraries and repos-
itories;· Applied computing → System forensics;
KEYWORDS
library detection, repackage detection, obfuscation, code analysis
ACM Reference format:
Leonid Glanz, Sven Amann, Michael Eichberg, Michael Reif, Ben Hermann,
Johannes Lerch, and Mira Mezini. 2017. CodeMatch: Obfuscation Won’t
Conceal Your Repackaged App. In Proceedings of ESEC/FSE’17, Paderborn,
Germany, September 04-08, 2017, 11 pages.
https://doi.org/10.1145/3106237.3106305
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from permissions@acm.org.
ESEC/FSE’17, September 04-08, 2017, Paderborn, Germany
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5105-8/17/09. . . $15.00
https://doi.org/10.1145/3106237.3106305
1 INTRODUCTION
Popular apps in the Google Play Store are installed on millions of
devices. This attracts malicious actors to create altered, repackaged
versions of those apps to steal the original owner’s revenue, or to
trick users and infect their mobile devices with malware. Detecting
such repackaged apps is therefore necessary for a secure and viable
app market.
Several techniques for repackage detection have already been
proposed and can be broadly classifed as being code-agnostic [20,
42, 43], graph-based [10, 15, 16, 25, 47], user-interface-based [17,
41], and code-signature-based [9, 22, 39, 45, 46]. The Code-agnostic
approaches hash internal fles of an app without considering the
fle content or type; as a result, the hashes could be evaded by single
bit changes. Graph-based techniques derive the control-fow, data-
fow or call graph of the analyzed app and measure the similarity
by comparing isomorphic sub-graphs of the derived properties.
Given that graph matching is a hard problem, these approaches
potentially sufer from scalability issues [15]. Those approaches
which try to abstract from the concrete graphs to achieve scalability,
e.g., by using metrics, sufer from high false positive rates [10]. User-
interface-based techniques also construct a graph, but use views as
nodes and the transitions from one view to another as edges. These
graphs can easily be fooled by changing or introducing pseudo-
views. Code-signature-based approaches create signatures based
on an apps’ code to address the weaknesses of the graph-based
approaches; the proposed approach also belongs to this category.
Challenges. A challenge for all existing repackage detection tech-
niques are code transformations. Developers regularly minify and
optimize their apps to increase performance. Additionally, they
obfuscate their apps to protect their intellectual property. However,
attackers also apply obfuscation to hide malicious code and to evade
signature-based detectors, such as anti-virus software.
Current repackage detection techniques can only handle basic
forms of obfuscation such as one-by-one identifer renaming, re-
placing types, and reordering of felds and methods [7, 31]. More
sophisticated obfuscation techniques, such as moving classes be-
tween packages or changing Android API calls are not supported.
Our evaluation of Google Play Store apps revealed that 60% [21]
are at least partially obfuscated and that at least 20% use more
advanced techniques. The efectiveness of repackage detection is
further inhibited through the prevalent reuse of libraries in apps.
Wang et al. [39] reported that more than 60% of the sub-packages in
Android apps belong to library code. Hence, separating the library
code from the app code is necessary. Otherwise, apps which use
(nearly) the same libraries automatically share a large portion of the
overall code base and are always identifed as repackages ś even if
638