Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts Kancherla Jonah Nishanth a , Vadlamani Ravi a , Narravula Ankaiah a , Indranil Bose b,⇑ a Institute for Development and Research in Banking Technology (IDRBT), Castle Hills Road #1, Masab Tank, Hyderabad-500 057, AP, India b Indian Institute of Management Calcutta, Diamond Harbour Road Joka, Kolkata 700 104, West Bengal, India article info Keywords: Data imputation K-means clustering Multilayer perceptron Phishing alerts Probabilistic neural networks Text mining abstract In this paper, we employ a novel two-stage soft computing approach for data imputation to assess the severity of phishing attacks. The imputation method involves K-means algorithm and multilayer percep- tron (MLP) working in tandem. The hybrid is applied to replace the missing values of ﬁnancial data which is used for predicting the severity of phishing attacks in ﬁnancial ﬁrms. After imputing the missing values, we mine the ﬁnancial data related to the ﬁrms along with the structured form of the textual data using multilayer perceptron (MLP), probabilistic neural network (PNN) and decision trees (DT) separately. Of particular signiﬁcance is the overall classiﬁcation accuracy of 81.80%, 82.58%, and 82.19% obtained using MLP, PNN, and DT respectively. It is observed that the present results outperform those of prior research. The overall classiﬁcation accuracies for the three risk levels of phishing attacks using the classiﬁers MLP, PNN, and DT are also superior. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be ana- lyzed using standard techniques for complete data. Missing data in real life data sets is an unavoidable problem in many disciplines. For analyzing the available data, completeness and quality of the data plays a major role because the inferences made from complete data are more accurate than those made from incomplete data (Abdella & Marwala, 2005). For example researchers rarely ﬁnd the survey data set with complete entries (Hai & Shouhong, 2010). The respondents may not give complete information be- cause of negligence, privacy reasons, or ambiguity of the survey questions. The missing parts of variables may be important things for analyzing the data. So in this situation data imputation plays a major role. Data imputation is also very useful in the control based applications like trafﬁc monitoring, industrial processes, telecom- munications and computer networks, automatic speech recogni- tion, ﬁnancial and business applications, and medical diagnosis, among others. Data in the databases may be missed because of data entry er- rors, system failures at the time of data retrieval or several other reasons like sensor failures, noisy channels, and cultural issues in updating the databases etc. According to Little and Rubin (2002), missing data is categorized into three categories: (i) missing com- pletely at random (MCAR), (ii) missing at random (MAR), (iii) not missing at random (NMAR). MCAR occurs if the probability of miss- ing value on some variable X is independent of the variable itself and on the values of any other variables in the dataset. For exam- ple, if the age of the husband is missing in a customer’s database then it does not depend on the any other variable of database which is meant for wife. MAR occurs if the probability of missing value of some variable X is independent of the variable but the pat- tern of missing data can be traceable or predictable from other variables in the database. For example, if income of a person is missing, then one can predict the missing value by using the values in profession and age. NMAR occurs when the probability of miss- ing value of some variable X depends on the variable X itself. For instance, if citizens do not participate in a survey, then NMAR occurs. MCAR and MAR data are recoverable, whereas NMAR data are irrecoverable. Missing data creates various problems in many research areas like data mining, mathematics, statistics, and various other ﬁelds (Abdella & Marwala, 2005). To impute with incomplete or missing data, several techniques based on statistical analysis are reported (Garcıa-Laencina, Sancho-Gomez, & Figueiras-Vidal, 2010). These methods include mean substitution methods, hot deck imputation, regression methods, expectation maximization, and multiple imputation methods. Other machine learning based methods in- clude self-organizing maps (Merlin, Sorjamaa, Maillet, & Lendasse, 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.02.138 ⇑ Corresponding author. Tel.: +91 33 2467 8300x157. E-mail addresses: jonah.nishanth@gmail.com (K.J. Nishanth), rav_padma@ yahoo.com (V. Ravi), ankireddy.cse@gmail.com (N. Ankaiah), indranil_bose@ yahoo.com (I. Bose). Expert Systems with Applications 39 (2012) 10583–10589 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa