Dnyati.S.Randhave et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.7, July- 2014, pg. 832-840 © 2014, IJCSMC All Rights Reserved 832 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 3, Issue. 7, July 2014, pg.832 – 840 RESEARCH ARTICLE Generative Topic Modeling in Taxonomic Structure of Genomic Data using LDA Dnyati.S.Randhave 1 , S.N.Deshmukh 2 ¹Department of Computer Science and IT & Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India ²Department of Computer Science and IT & Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India 1 dnyati.randhave@gmail.com; 2 sndeshmukh@hotmail.com Abstract- Probabilistic topic models have been developed for applications in various domains such as text mining, information retrieval. In this work, we focus on developing probabilistic topic models for LDA and specifically, a probabilistic topic model is proposed for data analysis and function analysis using homogenous approach and composite approach. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a We firstly show that generative topic model can be used to model the taxon abundance information obtained by homology based approach and study the microbial core. The model considers each sample as a ‘document’, which has a mixture of functional groups, while each functional group (also known as a ‘latent topic’) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Secondly composition-based approach to break down DNA sequences into sub-reads called the ‘N-mer’ and represents the sequences by N-mer frequencies. Then, we introduce the Latent DirichletAllocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’ features. Each estimated latent topic represents a certain component of the whole genome. Keyword— Data mining, Bioinformatics (genome or protein) databases, Language models, Metagenomics