Distributed Phishing Detection by Applying Variable Selection using Bayesian Additive Regression Trees Saeed Abu-Nimeh 1 , Dario Nappa 2 , Xinlei Wang 2 , and Suku Nair 1 SMU HACNet Lab Computer Science and Engineering Dept. Southern Methodist University Dallas, TX 75275 1 {sabunime, nair}@engr.smu.edu 2 {dnappa, swang}@smu.edu Abstract—Phishing continue to be one of the most drastic attacks causing both financial institutions and customers huge monetary losses. Nowadays mobile devices are widely used to access the Internet and therefore access financial and confidential data. However, unlike PCs and wired devices, such devices lack basic defensive applications to protect against various types of attacks. In consequence, phishing has evolved to target mobile users in Vishing and SMishing attacks recently. This study presents a client-server distributed architecture to detect phishing e-mails by taking advantage of automatic variable selection in Bayesian Additive Regression Trees (BART). When combined with other classifiers, BART improves their predictive accuracy. Further the overall architecture proves to leverage well in resource constrained environments. I. I NTRODUCTION Nowadays phishing attacks appear in various types and forms. Yet, traditional attacks delivered by spoofed e-mails remain the dominant type of phishing. Here the bad actor forges e-mails falsely mimicking legitimate ones and thus mails them to victims using mailers. Victims are then lured into divulging their confidential credentials, such as credit card information, social security numbers, or online login credentials. Vishing, or Voice over Internet Protocol (VoIP) phishing, has recently emerged as a new vector of phishing attacks, as it is easy to setup and take down by phishers. The attack can be carried by setting up a free VoIP account then using caller ID spoofing to mimic legitimate financial institutions’ phone numbers. Furthermore, because of the ubiquity of mobile devices and the various applications to access the Internet therein, many users are using blackberries, PDAs, or even cell phones to access their bank accounts and store sensitive personal data. New forms of phishing attacks that target mobile devices are on the rise. SMS phishing, dubbed as SMishing, is an emerging vector of phishing attacks where the victim receives a short message service (SMS) and thus is lured into clicking on a URL to download malware or is redirected to fraudulent sites. Surly, there are merely few solutions available to mitigate phishing attacks in mobile devices. In addition, several ubiq- uitous solutions available for desktop and wired computers are generally not as readily available across wireless and mobile devices. This is due to several known limitations in such devices. Due to power constraints, processing capabilities and storage capacities are limited, which in return affect security and privacy solutions built for such devices to protect users against various attacks. As a result, various attacks, including phishing, can easily take advantage of the limited or lack of security and defense applications in these devices. Although Bayesian Additive Regression Trees (BART) has proven to be competitive in classifying spam e-mails, previous research [1] showed that it is very demanding in terms of memory consumption and learning computational time. In consequence, it cannot be deployed in resource constrained devices. In this study we propose a distributed architecture for the detection of phishing e-mails in a mobile environment. The motivation behind the distributed architecture is to harden the attack detection at the client level and conceal the overhead associated with BART at the server level. A mutual feedback mechanism is deployed between the server and the clients. At the server side, that is the MTA (mail transfer agent), BART is applied to classify the majority of the e-mails received by the MTA. At the client side, lighter machine learning approaches are used to classify phishing e-mails in resource constrained devices taking advantage of automatic variable selection in BART. The rest of the paper is organized as follows. In Section II we present related work and describe BART briefly. In Section III we explain our distributed architecture in details. Section IV demonstrates the experimental studies. The results are discussed in Section V. We draw conclusion and motivate for future work in Section VI. II. RELATED WORK In [2], the authors investigated the application of Hill Climbing, Simulated Annealing, and Threshold Accepting techniques as feature selection algorithms for spam filtering and compared their performance against Linear Discriminate