International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 2 Issue: 5 1078– 1080 _______________________________________________________________________________________________ 1078 IJRITCC | May 2014, Available @ http://www.ijritcc.org _______________________________________________________________________________________ Text Classification of Database of Genotypes and Phenotypes in Heart, Lung and Blood Studies Mr. Suresh S Kolekar Department of Computer Engineering and Information Technology College of Engineering Pune, India e-mail: suresh.kolekar123@gmail.com Mr. Satish S Kumbhar Department of Computer Engineering and Information Technology College of Engineering Pune, India e-mail: satish116@gmail.com Abstract —The database of Genotypes and Phenotypes (dbGaP) was developed by National Center for Biotechnology Information (NCBI) to archive and distribute the results of various studies that have examined the interaction of genotype and phenotype. It is public repository for individual level phenotype, exposure, genotype, sequence data and the associations between them. Searching relevant studies of particular interest accurately and completely is challenging task due to keyword based search method of dbGaP Entrez system. Text mining is emerging research field which enable users to extract useful information from text documents and deals with retrieval, classification, clustering and machine learning techniques to classify different text document. This paper surveys of text classification, process of text classification and different text classifications algorithms to classify studies of database of Genotypes and Phenotypes. Keywords-Text classification;Machine Learning;database;genome-wide association studies;GWAS. __________________________________________________*****_________________________________________________ I. INTRODUCTION The National Library of Medicine (NLM), part of the National Institutes of Health (NIH), announces the introduction of dbGaP, a new database designed to archive and distribute data from genome wide association (GWA) studies. GWA studies explore the association between specific genes and observable traits, such as blood pressure and weight, or the presence or absence of a disease or condition (phenotype information). Connecting phenotype and genotype data provides information about the genes that may be involved in a disease process or condition, which can be critical for better understanding the disease and for developing new diagnostic methods and treatments. dbGaP, the database of Genotype and Phenotype, for the first time will provide a central location for interested parties to see all study documentation and to view summaries of the measured variables in an organized and searchable web format. As of today, dbGaP contained 468 studies, including more than 144716 phenotype variables. However, retrieving relevant studies accurately and completely is challenging issue, because phenotypic information related to studies is often stored in a non-standardized format. For particular queries, the dbGaP Entrez system returns several studies that are irrelevant, and it does not make clear how particular studies are selected and why they appear in a particular order. Thus, users have to review each study description carefully to determine relevant studies, which can become a laborious and time-consuming task when there are many studies to be retrieved. Text classification is an important part of text mining in the field of document retrieval. Current research of text classification aims to improve the quality of text representation and develop high quality text classifier. Machine learning techniques have been actively explored for text classification. Among these are Naive bayes classifier, K- nearest neighbor classifiers, support vector machine, neural networks. II. CHALLENGING ISSUES IN EXISTING SYSTEM dbGaP Entrez system returns several studies that are not always relevant for particular queries. Consequently, users have to review each study description carefully to retrieve relevant studies, which can become a laborious and time-consuming task when there are many studies to be retrieved. A. Manual Performannce Evaluation of dbGaP Search System Four hundred and Sixty eight studies were available in dbGaP on May 1, 2014. Each title and abstract was manually reviewed and annotated into heart, lung, blood, and other categories. Here four different labels assigned to each documents. These labels are shown in table I, and were assigned manually depending on its relevance to the document.