Indonesian Journal of Electrical Engineering and Computer Science Vol. 21, No. 2, February 2021, pp. 1151~1159 ISSN: 2502-4752, DOI: 10.11591/ijeecs. v21.i2.pp1151-1159 1151 Journal homepage: http://ijeecs.iaescore.com Knowledge discovery from gene expression dataset using bagging lasso decision tree Umu Sa’adah, Masithoh Yessi Rochayani, Ani Budi Astuti Faculty of Mathematics and Natural Sciences, Universitas Brawijaya, Indonesia Article Info ABSTRACT Article history: Received Jun 18, 2020 Revised Aug 11, 2020 Accepted Sep 7, 2020 Classifying high-dimensional data are a challenging task in data mining. Gene expression data is a type of high-dimensional data that has thousands of features. The study was proposing a method to extract knowledge from high- dimensional gene expression data by selecting features and classifying. Lasso was used for selecting features and the classification and regression tree (CART) algorithm was used to construct the decision tree model. To examine the stability of the lasso decision tree, we performed bootstrap aggregating (Bagging) with 50 replications. The gene expression data used was an ovarian tumor dataset that has 1,545 observations, 10,935 gene features, and binary class. The findings of this research showed that the lasso decision tree could produce an interpretable model that theoretically correct and had an accuracy of 89.32%. Meanwhile, the model obtained from the majority vote gave an accuracy of 90.29% which showed an increase in accuracy of 1% from the single lasso decision tree model. The slightly increasing accuracy shows that the lasso decision tree classifier is stable. Keywords: Bagging Decision tree Feature selection Gene expression High-dimensional This is an open access article under the CC BY-SA license. Corresponding Author: Umu Sa‟adah Department of Statistics Universitas Brawijaya Jalan Veteran, Malang, Indonesia Email: u.saadah@ub.ac.id 1. INTRODUCTION Gene expression data have been used to study the differences in gene characteristics between patients with certain diseases and normal people. The major challenge to analyze gene expression data is it has many predictors (genes), but the sample is much less. Gene expression data is a type of high-dimensional data that consist of thousands, even tens of thousands of gene features, but the sample size is only hundreds. Therefore, a certain strategy is needed to deal with dimensional problems in gene expression data. One of the strategies in the classification of high-dimensional data is by reducing the dimension. There are two approaches in dimension reduction namely feature extraction and feature selection. The common dimension reduction approach in gene expression data is feature selection. Feature selection eliminates irrelevant and redundant features. Research [1] investigated the influence of feature selection on the accuracy of the classification of gene expression data. The result of the study was feature selection can increase accuracy up to 9%. Several methods that combine feature selection and classification have been implemented in the classification of gene expression data. Assawamakin et.al. [2] used recursive feature elimination (RFE) to select genes and support vector machine (SVM) to classify several gene expression data. Kang et.al. [3] proposed a hybrid method of Relaxed Lasso and Generalized SVM for the multiclass classification of gene expression data. In the paper, Kang et.al. mentioned the selected genes, but these results are not validated