Copyright © 2018 Chaman Verma et. al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. International Journal of Engineering & Technology, 7 (4) (2018) 3392-3396 International Journal of Engineering & Technology Website: www.sciencepubco.com/index.php/IJET doi: 10.14419/ijet.v7i4.14045 Research paper An Ensemble approach to identifying the student gender towards information and communication technology awareness in European schools using machine learning Chaman Verma 1 *, Veronika Stoffová 2 , Zoltán Illés 3 , 1,3 Department of Media and Educational Informatics, Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary 2 Department of Mathematics and Computer Science, Faculty of Education, Trnava University in Trnava, Trnava, Slovakia *Corresponding author E-mail: chaman@inf.elte.hu Abstract Data mining and machine learning play an important role in both research estimation and learning. The present study is conducted to identify the gender of student according to their answers given in survey related to information and communication technology (ICT) in European schools. The student dataset which consists of a total number of 156 attributes and 50478 instances are tested to identify stu- dent gender. To develop the ensemble predictive model after comparing prediction accuracy achieved by various supervised machine learning classifiers such as Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB), Artificial Neural network (ANN) and J48 tree with various k-fold cross-validation. The K-nearest neighbor (IbK or KNN) is also trained with data-set with varying value of k at 8-fold cross-validation. The dichotomous variable is gender and 131 predictors belong to ICT in education are taken into consid- eration after applying feature reduction methods. Findings of the study reveal that the maximum prediction is gained by SVM (76%) at each fold as compared to others. The total number (23535) of correct females are identified by RF at 6-fold and correct perdition of males is 14678 which is achieved by SVM at 2-fold. The authors also found lowest accuracy for prediction is achieved by NB classifier at each fold. Finally, the ensemble predictive model is presented by joining the best classifier such as SVM at 2-fold, ANN at 2-fold and RF at 6-fold to accurate identification of student gender over data-set. The ensemble confusion matrix also concludes the maximum pre- diction of the female student as compared to male student towards their response given to survey. Keywords: Binary Classification; Confusion Matrix; FPR; TPR. 1. Introduction Educational Data mining has emerged as the very important area of research to reveal presentable and applicable knowledge from large educational data repositories. Data mining algorithms are used to obtain the hidden information and desired benefits from these large data repositories [6]. Recently, analysis of educational data, for instance, learning analytics, academic analytics, educa- tional data mining, predictive analytics and learners' analytics has emerged as an innovative area of research [7]. Machine learning (ML) is the process of estimating unknown dependencies or struc- tures in a system using a limited number of observations and it is used in data mining applications to retrieve hidden information and used in decision-making [1]. The ML methods are rote learn- ing, learning by analogy, and inductive learning, which includes methods of learning by examples and learning by experimentation and discovery [12]. According to [11] for classification, and re- gression problem various classifiers can be used for learning deci- sion trees, rules, Bayes networks, artificial neural networks and support vector machines and different knowledge representation models can be used to support decision-making methods. Multiple, ensemble learning models have been theoretically and empirically shown to provide significantly better performance than single weak learners, especially while dealing with high dimensional, complex regression and classification problems [12]. Below 50% classification accuracy was obtained by OneR, J48 and Naïve Bayes (42.9%) technique to classify student age classification against ICT attitude [9]. Artificial Neural Networks (ANN) has a large generalization capability, and can approximate functions used for both regression and classification [3] and according to [13,5] SVM has discriminatory methods that learn boundaries between classes and performing a binary classification based on the separation of hyperplanes; a separator is chosen to maximize the distances of these hyperplanes and the nearest formation vec- tors, which are called support vectors. According to [14]. In KNN each sample data is assigned to the majority class of its k closest neighbors where k is a parameter. The training data samples are vectors in a multidimensional feature space, each with a given target class label. Logistic regression was applied to develop the model for the early and reliable prediction of students pass or fail status at the undergraduate level [8]. The key demographic varia- bles and assignment marks in the supervised machine learning algorithms (decision trees, artificial neural networks, naïve Bayes classifier, instance-based learning, logistic regression and support vector machines) to predict student’s performance at the Hellenic Open University [10]. The gender is one of the principal determi- nants of the probability of dropping out. In the binomial probit model they used, males have a higher probability of dropping out relative to the reference group of females [2]. In addition, experi-