China Communications • July 2013 71 NETWORK TECHNOLOGY AND APPLICATION Parallelized Jaccard-Based Learning Method and MapReduce Implementation for Mobile Devices Recognition from Massive Network Data LIU Jun 1 , LI Yinzhou 1 , Felix Cuadrado 2 , Steve Uhlig 2 , LEI Zhenming 1 1 Beijing Key Laboratory of Network System Architecture and Convergence, Beijing University of Posts and Telecommunications, Beijing 100876, China 2 Department of Electronic Engineering and Computer Science, Queen Mary, University of London, London E1 4NS, UK Abstract: The ability of accurate and scalable mobile device recognition is critically impor- tant for mobile network operators and ISPs to understand their customers’ behaviours and en- hance their user experience. In this paper, we propose a novel method for mobile device model recognition by using statistical information derived from large amounts of mobile network traffic data. Specifically, we create a Jaccard- based coefficient measure method to identify a proper keyword representing each mobile de- vice model from massive unstructured textual HTTP access logs. To handle the large amount of traffic data generated from large mobile networks, this method is designed as a set of parallel algorithms, and is implemented through the MapReduce framework which is a distrib- uted parallel programming model with proven low-cost and high-efficiency features. Evalua- tions using real data sets show that our method can accurately recognise mobile client models while meeting the scalability and producer-ind- ependency requirements of large mobile net- work operators. Results show that a 91.5% acc- uracy rate is achieved for recognising mobile client models from 2 billion records, which is dramatically higher than existing solutions. Key words: mobile device recognition; data mining; Jaccard coefficient measurement; dis- tributed computing; MapReduce I. INTRODUCTION With increasing popularity of user-friendly mo- bile clients (smartphones, pads, and tablets), which are coupled with capable mobile appli- cations (location-aware, multimedia, and so- cial applications) supported by advanced cel- lular communication technology, mobile cli- ents become a part of people’s life. To some extent, mobile client and its usage data could be a natural candidate to support and eventu- ally host a user’s digital representative [1]. To be more specific, attributes of a mobile client that operators always concern include model, price, and features convincingly depicting cha- racteristics of a group of users as traditional personal information, such as age, sex, occu- pation, etc. Moreover, capabilities of a mobile client give a remarkable impact on experience and user’s desire of given application. There- fore, it is a new challenge as well as opportu- nity for mobile network operators and ISPs to understand attributes and capabilities of their customers’ mobile clients and associated be- haviour patterns for designing more efficient market promotion and achieving better user experience. As a critical step to address this challenge, it is imperative to extract mobile client models from massive network data. Unfortunately, there is little well done work that can help mo- Received: 2012-12-12 Revised: 2013-03-11 Editor: YUAN Baozong