Decision Support in Attribute Selection with Machine Learning Approach Wagner Arbex Brazilian Agricultural Research Corporation — Embrapa Juiz de Fora, MG, Brazil wagner.arbex@embrapa.br Fabyano Fonseca e Silva Federal University of Viçosa — UFV Viçosa, MG, Brazil Marcos Vinícius Gualberto Barbosa da Silva Brazilian Agricultural Research Corporation — Embrapa Juiz de Fora, MG, Brazil Fabrízzio Condé de Oliveira Federal University of Juiz de Fora — UFJF Juiz de Fora, MG, Brazil Luis Varona University of Zaragoza — UNIZAR Zaragoza, Spain Rui da Silva Verneque Brazilian Agricultural Research Corporation — Embrapa Juiz de Fora, MG, Brazil Carlos Cristiano Hasenclever Borges Federal University of Juiz de Fora — UFJF Juiz de Fora, MG, Brasil Abstract—This paper proposes a method to simultaneously select the most relevant single nucleotide polymorphisms (SNPs) markers — the attributes — for the characterization of any measurable phenotype described by a continuous variable using support vector regression (SVR) with Pearson VII Universal Kernel (PUK). The proposed study is multiattribute towards considering several markers simultaneously to explain the phenotype and is based jointly on a statistical tools, machine learning and computational intelligence. Keywords-decision support; attribute selection; machine learning; SVR; computational modeling INTRODUCTION Single nucleotide polymorphisms (SNPs) are an abundant form of genomic variation, which differ from rare variants [1] and the basic assumption for genome-wide association studies (GWAS) is that the evaluated characteristic can be explained from this type of marker. The traditional approach is to evaluate which markers that have a high association with the phenotype through the p- value of beta linear regression between each SNP and the phenotype. After this step, the most relevant SNPs are analyzed for proximity to some region that is associated with that feature or other features that can be indirectly correlated with the phenotype in question. Therefore, an alternative approach is to increase the number of markers, considering also those with small correlations on the trait. But, this fact creates two problems: the number of markers is high and many of them are correlated. According to [2], such analysis requires the use of statistical methods that consider the selection of covariates ದ i. e., the multicollinearity problem -- and the regularization of the estimation process ದ i.e., the problem of dimensionality. Other regression techniques were created to address this problem as ridge regression and partial least squares regression [3]. On the other hand, machine learning algorithms such as support vector machine (SVM) in GWAS considering multiple markers in classification problems, have demonstrated satisfactory performance as in [4], [5] and [6]. This study aims to propose a method that can simultaneously evaluate several SNPs in relation to the phenotype described by a continuous variable, unlike case- control dichotomous phenotypes addressed to the majority of GWAS studies. With this, there are two immediate benefits relative to standard methodology: one relating to the various levels of the phenotypes and the other by complex simultaneous interactions that may occur between the various markers. To demonstrate the proposed method was used a sample of 343 samples (bulls) genotyped provided by the Brazilian Agricultural Research Corporation (Embrapa), and only 244 animals have female offspring, allowing the measurement of the phenotype evaluated.