Copyright © 2018 Chaman Verma et. al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Engineering & Technology, 7 (4) (2018) 3392-3396
International Journal of Engineering & Technology
Website: www.sciencepubco.com/index.php/IJET
doi: 10.14419/ijet.v7i4.14045
Research paper
An Ensemble approach to identifying the student gender
towards information and communication technology
awareness in European schools using machine
learning
Chaman Verma
1
*, Veronika Stoffová
2
, Zoltán Illés
3
,
1,3
Department of Media and Educational Informatics, Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary
2
Department of Mathematics and Computer Science, Faculty of Education, Trnava University in Trnava, Trnava, Slovakia
*Corresponding author E-mail: chaman@inf.elte.hu
Abstract
Data mining and machine learning play an important role in both research estimation and learning. The present study is conducted to
identify the gender of student according to their answers given in survey related to information and communication technology (ICT) in
European schools. The student dataset which consists of a total number of 156 attributes and 50478 instances are tested to identify stu-
dent gender. To develop the ensemble predictive model after comparing prediction accuracy achieved by various supervised machine
learning classifiers such as Support Vector Machine (SVM), Random Forest (RF), Naïve Bayes (NB), Artificial Neural network (ANN)
and J48 tree with various k-fold cross-validation. The K-nearest neighbor (IbK or KNN) is also trained with data-set with varying value
of k at 8-fold cross-validation. The dichotomous variable is gender and 131 predictors belong to ICT in education are taken into consid-
eration after applying feature reduction methods. Findings of the study reveal that the maximum prediction is gained by SVM (76%) at
each fold as compared to others. The total number (23535) of correct females are identified by RF at 6-fold and correct perdition of
males is 14678 which is achieved by SVM at 2-fold. The authors also found lowest accuracy for prediction is achieved by NB classifier
at each fold. Finally, the ensemble predictive model is presented by joining the best classifier such as SVM at 2-fold, ANN at 2-fold and
RF at 6-fold to accurate identification of student gender over data-set. The ensemble confusion matrix also concludes the maximum pre-
diction of the female student as compared to male student towards their response given to survey.
Keywords: Binary Classification; Confusion Matrix; FPR; TPR.
1. Introduction
Educational Data mining has emerged as the very important area
of research to reveal presentable and applicable knowledge from
large educational data repositories. Data mining algorithms are
used to obtain the hidden information and desired benefits from
these large data repositories [6]. Recently, analysis of educational
data, for instance, learning analytics, academic analytics, educa-
tional data mining, predictive analytics and learners' analytics has
emerged as an innovative area of research [7]. Machine learning
(ML) is the process of estimating unknown dependencies or struc-
tures in a system using a limited number of observations and it is
used in data mining applications to retrieve hidden information
and used in decision-making [1]. The ML methods are rote learn-
ing, learning by analogy, and inductive learning, which includes
methods of learning by examples and learning by experimentation
and discovery [12]. According to [11] for classification, and re-
gression problem various classifiers can be used for learning deci-
sion trees, rules, Bayes networks, artificial neural networks and
support vector machines and different knowledge representation
models can be used to support decision-making methods. Multiple,
ensemble learning models have been theoretically and empirically
shown to provide significantly better performance than single
weak learners, especially while dealing with high dimensional,
complex regression and classification problems [12]. Below 50%
classification accuracy was obtained by OneR, J48 and Naïve
Bayes (42.9%) technique to classify student age classification
against ICT attitude [9]. Artificial Neural Networks (ANN) has a
large generalization capability, and can approximate functions
used for both regression and classification [3] and according to
[13,5] SVM has discriminatory methods that learn boundaries
between classes and performing a binary classification based on
the separation of hyperplanes; a separator is chosen to maximize
the distances of these hyperplanes and the nearest formation vec-
tors, which are called support vectors. According to [14]. In KNN
each sample data is assigned to the majority class of its k closest
neighbors where k is a parameter. The training data samples are
vectors in a multidimensional feature space, each with a given
target class label. Logistic regression was applied to develop the
model for the early and reliable prediction of students pass or fail
status at the undergraduate level [8]. The key demographic varia-
bles and assignment marks in the supervised machine learning
algorithms (decision trees, artificial neural networks, naïve Bayes
classifier, instance-based learning, logistic regression and support
vector machines) to predict student’s performance at the Hellenic
Open University [10]. The gender is one of the principal determi-
nants of the probability of dropping out. In the binomial probit
model they used, males have a higher probability of dropping out
relative to the reference group of females [2]. In addition, experi-