Dnyati.S.Randhave et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.7, July- 2014, pg. 832-840
© 2014, IJCSMC All Rights Reserved 832
Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IJCSMC, Vol. 3, Issue. 7, July 2014, pg.832 – 840
RESEARCH ARTICLE
Generative Topic Modeling in Taxonomic
Structure of Genomic Data using LDA
Dnyati.S.Randhave
1
, S.N.Deshmukh
2
¹Department of Computer Science and IT & Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India
²Department of Computer Science and IT & Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, India
1
dnyati.randhave@gmail.com;
2
sndeshmukh@hotmail.com
Abstract- Probabilistic topic models have been developed for applications in various domains such as text mining,
information retrieval. In this work, we focus on developing probabilistic topic models for LDA and specifically, a
probabilistic topic model is proposed for data analysis and function analysis using homogenous approach and
composite approach. In this paper, we aim to develop a new method that is able to analyze the genome-level
composition of DNA sequences, in order to characterize a set of common genomic features shared by the same
species and tell their functional roles. To achieve this end, we firstly apply a We firstly show that generative topic
model can be used to model the taxon abundance information obtained by homology based approach and study
the microbial core. The model considers each sample as a ‘document’, which has a mixture of functional groups,
while each functional group (also known as a ‘latent topic’) is a weight mixture of species. Therefore, estimating
the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent
topic) in each sample. Secondly composition-based approach to break down DNA sequences into sub-reads called
the ‘N-mer’ and represents the sequences by N-mer frequencies. Then, we introduce the Latent
DirichletAllocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’
features. Each estimated latent topic represents a certain component of the whole genome.
Keyword— Data mining, Bioinformatics (genome or protein) databases, Language models, Metagenomics