CodeMatch: Obfuscation Won’t Conceal Your Repackaged App Leonid Glanz, Sven Amann, Michael Eichberg, Michael Reif, Ben Hermann, Johannes Lerch, and Mira Mezini Technische Universität Darmstadt Germany {glanz,amann,eichberg,reif,hermann,mezini}@cs.tu-darmstadt.de,lerch@st.informatik.tu-darmstadt.de ABSTRACT An established way to steal the income of app developers, or to trick users into installing malware, is the creation of repackaged apps. These are clones of ś typically ś successful apps. To conceal their nature, they are often obfuscated by their creators. But, given that it is a common best practice to obfuscate apps, a trivial identi- fcation of repackaged apps is not possible. The problem is further intensifed by the prevalent usage of libraries. In many apps, the size of the overall code base is basically determined by the used libraries. Therefore, two apps, where the obfuscated code bases are very similar, do not have to be repackages of each other. To reliably detect repackaged apps, we propose a two step ap- proach which frst focuses on the identifcation and removal of the library code in obfuscated apps. This approach ś LibDetect ś relies on code representations which abstract over several parts of the underlying bytecode to be resilient against certain obfusca- tion techniques. Using this approach, we are able to identify on average 70% more used libraries per app than previous approaches. After the removal of an app’s library code, we then fuzzy hash the most abstract representation of the remaining app code to ensure that we can identify repackaged apps even if very advanced ob- fuscation techniques are used. This makes it possible to identify repackaged apps. Using our approach, we found that 15% of all apps in Android app stores are repackages. CCS CONCEPTS · Security and privacy Software reverse engineering; · Software and its engineering Software libraries and repos- itoriesApplied computing → System forensics; KEYWORDS library detection, repackage detection, obfuscation, code analysis ACM Reference format: Leonid Glanz, Sven Amann, Michael Eichberg, Michael Reif, Ben Hermann, Johannes Lerch, and Mira Mezini. 2017. CodeMatch: Obfuscation Won’t Conceal Your Repackaged App. In Proceedings of ESEC/FSE’17, Paderborn, Germany, September 04-08, 2017, 11 pages. https://doi.org/10.1145/3106237.3106305 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. ESEC/FSE’17, September 04-08, 2017, Paderborn, Germany © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5105-8/17/09. . . $15.00 https://doi.org/10.1145/3106237.3106305 1 INTRODUCTION Popular apps in the Google Play Store are installed on millions of devices. This attracts malicious actors to create altered, repackaged versions of those apps to steal the original owner’s revenue, or to trick users and infect their mobile devices with malware. Detecting such repackaged apps is therefore necessary for a secure and viable app market. Several techniques for repackage detection have already been proposed and can be broadly classifed as being code-agnostic [20, 42, 43], graph-based [10, 15, 16, 25, 47], user-interface-based [17, 41], and code-signature-based [9, 22, 39, 45, 46]. The Code-agnostic approaches hash internal fles of an app without considering the fle content or type; as a result, the hashes could be evaded by single bit changes. Graph-based techniques derive the control-fow, data- fow or call graph of the analyzed app and measure the similarity by comparing isomorphic sub-graphs of the derived properties. Given that graph matching is a hard problem, these approaches potentially sufer from scalability issues [15]. Those approaches which try to abstract from the concrete graphs to achieve scalability, e.g., by using metrics, sufer from high false positive rates [10]. User- interface-based techniques also construct a graph, but use views as nodes and the transitions from one view to another as edges. These graphs can easily be fooled by changing or introducing pseudo- views. Code-signature-based approaches create signatures based on an apps’ code to address the weaknesses of the graph-based approaches; the proposed approach also belongs to this category. Challenges. A challenge for all existing repackage detection tech- niques are code transformations. Developers regularly minify and optimize their apps to increase performance. Additionally, they obfuscate their apps to protect their intellectual property. However, attackers also apply obfuscation to hide malicious code and to evade signature-based detectors, such as anti-virus software. Current repackage detection techniques can only handle basic forms of obfuscation such as one-by-one identifer renaming, re- placing types, and reordering of felds and methods [7, 31]. More sophisticated obfuscation techniques, such as moving classes be- tween packages or changing Android API calls are not supported. Our evaluation of Google Play Store apps revealed that 60% [21] are at least partially obfuscated and that at least 20% use more advanced techniques. The efectiveness of repackage detection is further inhibited through the prevalent reuse of libraries in apps. Wang et al. [39] reported that more than 60% of the sub-packages in Android apps belong to library code. Hence, separating the library code from the app code is necessary. Otherwise, apps which use (nearly) the same libraries automatically share a large portion of the overall code base and are always identifed as repackages ś even if 638