ICACSIS 2013 ISBN: 978-979-1421-19-5zyxwvutsrqponmlk Clustering Metagenome Fragments Using Growing Self Organizing Map ~ Marlinda Vasty Overbeek. Wisnu Ananta Kusuma, AguszyxwvutsrqponmlkjihgfedcbaZYXWVUT BUQ[lQzyxwvutsrqponmlkjihgfedcbaZYXWVUTS Departement 0.( Computer Science - Faculty of Mathernatics and Natural Science. Bogor Agricultural University Email: marlinda_vasty@yahoo.com.ananta@ipb.ac.id.pudesha@yahoo.co.id Abstract=- The microorganism sarnples taken directly from environment are not easy to assemble because they contains mixtures of microorganism. If sampie cornplexity is Hr)' high and comes from highly diverse environment, the difficulty of assembling DNA sequences is increasing since the interspecies chimeras can happen, To avoid this problem, in this research, we proposed binning based on cornposition using unsupcrvised learning. We ernployed trinucleotide and tetranucleotide frequency as Ieatures and GSOM algorithm as clustering method. GSOM was implemented to map featurcs into high dimension feature space. We tested our method using small microbial community dataset. The quality of cluster was evaluated based on the folIowing parameters topographic error. quantization error, and error percentage. The evaluation results show that the best cluster can be obtained using GSOM and tetranucleotide. I. INTROOllCTION M ETMiENOMICS i~ a study of analyzing high complexity of rnicrobial community which allo ws culture - independent [IJ, 12J. As we know. only I % of microorganism can be cultured by standard cultivation techniques. The rest should be taken directly from the environment, narned as rnetagenome sarnple. This kind of sampie contains rnixtures of microorganisms. This characteristic makes assembling process becornes more difficult because it will yield more interspccics chirneras [5 J. To solve the problem, we used binning proccss before or after assembling metagcnome fragmcnts. Binning is a techniques to classify or cluster organism based on taxonomy [5]. [6]. There is two binning approach. the first approach is binning based on hornology such as [3LAST [7). [81 and MEGAN [9). The second one is cornposition based approach. The composition approach applicd unsupcrvised learning and supcrviscd learning as a method and oligonucleotide as an input in the features spaces. There are ma ny application developed based on th is approach. Sorne applications that ernployed unsupervised learning are TETRA [10), Self Organizing Clustering [II], Self Organizing Map [12J. and Growing Self Organizing Map [I], [13]. The ones that used supervised learning are PhyloPythia [14). Naive Bayessian Classification [15]. and Phymm [161- One of researches used GSOM combined with oligonuclcoude to explore the genorne signatures. Clear species-specific separation of sequence was obtained in the > 8 kbp fragrnents test. The fragrnents were derived from 30 species, which is separated into 3 dataset, 10 spec ies per set [ I]. ln this research. we employed binning based on composition with unsupervised learning. We proposed I kbp DNA sequence derived from 18 species. We reads the fragments uniformly. The previous research [ I]used long fragments (8 kbp). Using short length (I kbp) gene a poor performance [5], [17]. In this research, we will overcorne the Iimitation of using short fragrnent. The purpose of this research is to know the performance of GSOM in c1ustering the mctagcnorne fragrnents with short fragmenl (I kbp fragrnent lenght). II. MATERIAL AND METItODSzyxwvutsrqponmlkjih (jf'(JlI'il1R, Self Organizin Map (GSOM) GSOM consists of 3 main phase (Figure I). which wcrc initialization phase. growing phase and. smoothing phase [18). [19]. Initializotion phase ln this phase. the algorithm initialize four starting nodcs. Four starting nodes which were randomly selectcd from the input dataset. The initialization nodcs wcre shown in Figure 2. Next. the global parameter. Growth Threshold (GT) was calculatcd for the givcn dataset according to the user requirement. The GT value is defined as :