The 4 th Joint Symposium on Computational Intelligence (JSCI 4), 2 February 2018, Bangkok, Thailand SNP selection for Porcine breed classiﬁcation by a hybrid information gain and genetic algorithm Wanthanee Rathasamuth * , Kitsuchart Pasupa † , and Sissades Tongsima ‡ * Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand † National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand Email: * rathasamuth.wan@gmail.com, † kitsuchart@it.kmitl.ac.th, ‡ sissades@biotec.or.th Abstract—Single Nucleotide Polymorphism (SNP) is a vari- ability of DNA sequence that connects to a unique trait of an organism. A good SNP selection can provide a good porcine breed that grows fast with high yield. SNP selection can be done by a computerized feature selection method and classiﬁcation technique. At present, an effective classiﬁcation model can only handle a small number of features efﬁciently. Too large a number may cause an over-ﬁtting problem in the classiﬁcation. Therefore, SNPs or features need to be reduced to an optimum number for an effective porcine SNP analysis. This paper proposes an approach to reducing the number of features in porcine SNP analysis to an optimum number by a hybrid of Information Gain (IG) and genetic algorithm (GA) techniques. A performance test demonstrated that this approach was able to select a minimum number of features (at 1.51% of the total number of features) that provided an average classiﬁcation accuracy of 94.02%, as compared to 95.28% provided by the total number of features. Index Terms—Feature selection, Bioinformatics, Machine learning, Single Nucleotide Polymorphisms. I. I NTRODUCTION China might be the ﬁrst country that has started to se- lectively bred wild pigs 5,000 years ago. Pig breeding in Thailand was heavily inﬂuenced by the Chinese immigrants to this country. Today, pigs are an important economic animal in Thailand, hence selecting the right breed for the geo- graphical location is a very important issue. Diverse physical traits of pigs are the results of the differences in DNA base sequences which are called single nucleotide polymorphism (SNP). A thorough porcine SNP analysis can determine the SNPs that provide good growth and reproduction. The issue is that there are millions of SNPs for a single organism, and so a manual SNP analysis by an expert is out of the question, not to mention the huge amount of other kinds of resources needed. Today, a good way to address this issue is to use bioinformatics, an integration of computer science, biology, mathematics, and engineering. Machine learning [1] has been applied to genomics, proteomics, microarray, and system biology for classiﬁcation of genes. In [1], several classiﬁcation techniques for bioinformatics are presented such as support vector machine (SVM), decision tree, neural net- works, Bayesian classiﬁers, and nearest neighbors. Since these classiﬁcation techniques cannot effectively support too large a number of features that may cause a commonly encountered over-ﬁtting problem–high accuracy when used with training dataset but low accuracy with testing dataset, reducing the number of features into a subset of optimum features can make a classiﬁcation attempt successful. Papers that deal with this issue are such as [2] which is a review of feature selection applying to bioinformatics. The paper reports three types of feature selection: ﬁlter methods such as i-test, and information gain (IG); wrapper methods such as genetic algorithms (GA) and other nature inspired algorithms; and embedded methods such as random forest, and decision tree. In [3], a review of several nature-inspired algorithms that were used to perform feature selection was presented. The main point was how to increase selection efﬁciency and reduce prediction error. The most common problems found were too large a number of features and too small a training dataset. These are some of the challenges in doing a classiﬁcation analysis. The conceptual frameworks of feature selection by ﬁlter, wrapper, and embedded methods are quite different. Simple and efﬁcient, ﬁlter methods select features that have high index values, independent of the classiﬁcation method. Wrapper methods rely on the classiﬁcation method to select an optimal subset of features that provide high classiﬁcation accuracy– each round of feature evaluation includes a classiﬁcation step; therefore, a large number of features results in high computation time. Embedded methods are not very different from wrapper methods but include a feature reduction step that reduces their computation time. Wrapper and Embedded methods rely on a classiﬁcation step to select an optimal subset of features hence the selected features can facilitate better learning of training dataset but their predictions may suffer from an over-ﬁtting problem. On the other hand, ﬁlter methods tend to have less over-ﬁtting problem. This paper proposes using a hybrid feature selection tech- nique that combines IG with GA (IG+GA) for the purpose of selecting the best porcine SNPs for classiﬁcation of pig breeds. The next section presents a description of datasets, followed by experimental framework and its results. II. DATASETS The dataset used in this study consists of SNP data from 677 pig samples of 22 breeds, 356 samples from the dataset of porcine colonization of the Americas [4] and 321 samples from the dataset of the Project of Porcine Breed Improvement by Selection according to Whole Genome SNP Data supported 978-616-455-375-0 c ⃝2018 JSCI