Dragon exploratory system on Hepatitis C Virus (DESHCV) Samuel K. Kwofie a , Aleksandar Radovanovic b , Vijayaraghava S. Sundararajan a , Monique Maqungo a , Alan Christoffels a , Vladimir B. Bajic b, * a South African National Bioinformatics Institute, University of the Western Cape, Private Bag- X17, Modderdam Road, Bellville, Cape Town, South Africa b Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia 1. Introduction The skyrocketing chronicity and global infection rate of Hepatitis C Virus (HCV) necessitate the need to unlock the molecular etiology underlying the pathophysiology of HCV related diseases such as liver cancer. The plethora of essential molecular data in the corpus of published biomedical literature could be leveraged to augment efforts towards discovery of novel anti-viral drugs, cellular receptors and appropriate predictive biomarkers. Most of the data derived from high throughput and ‘‘omics’’ experiments exist in variety of formats, thereby making cross data integration difficult. The development of HCV specific database as repositories of information utilizable in cross discipline biology research is therefore vital. The Los Alamos Hepatitis C Virus sequence database (http://hcv.lanl.gov) offers annotated sequences and analysis tools (Kuiken et al., 2005). The Los Alamos hepatitis C immunology database (http://hcv.lanl.gov/content/ immuno/immuno-main.html) is a repository of biocurated immu- nological epitopes integrated with retrieval and analysis tools (Yusim et al., 2005). The Japanese HCV database integrated in the HVDB (http://s2as02.genes.nig.ac.jp) comprises data on phyloge- netic and provides java embedded viewers for visualizing phylogenetic trees and the HCV genome. The European Hepatitis C Virus database (euHCVdb, http://euhcvdb.ibcp.fr) provides annotated sequences and tools for analysis, and information on protein structure and function (Combet et al., 2007). Hepatitis C Virus sequence and immunology database and analytical applica- tions (HCVdb, http://www.hcvdb.org/index.asp?bhcp=1) offers data on analyzed protein sequence and features, epitopes, and curated knowledge on protein interactions and function. Binding site finder (BSFINDER, http://wilab.inha.ac.kr/bsfinder) enable prediction of HCV binding site residues and potential interacting protein partners using support vector machine (Chen and Han, 2009). A comprehensive review of selected HCV related database has highlighted the useful capabilities, utilities and applications of these resources (Kuiken et al., 2006). Hepatitis C Virus-specific database contain much useful information on molecular biology, sequences, immunology, protein structure and function, viral Infection, Genetics and Evolution 11 (2011) 734–739 ARTICLE INFO Article history: Received 30 April 2010 Received in revised form 30 November 2010 Accepted 8 December 2010 Available online 29 December 2010 Keywords: Hepatitis C Virus Text-mining Dictionaries Biomedical concepts Database Hypotheses generation ABSTRACT Even though Hepatitis C Virus (HCV) cDNA was characterized about 20 years ago, there is insufficient understanding of the molecular etiology underlying HCV infections. Current global rates of infection and its increasingly chronic character are causes of concern for health policy experts. Vast amount of data accumulated from biochemical, genomic, proteomic, and other biological analyses allows for novel insights into the HCV viral structure, life cycle and functions of its proteins. Biomedical text-mining is a useful approach for analyzing the increasing corpus of published scientific literature on HCV. We report here the first comprehensive HCV customized biomedical text-mining based online web resource, dragon exploratory system on Hepatitis C Virus (DESHCV), a biomedical text-mining and relationship exploring knowledgebase was developed by exploring literature on HCV. The pre-compiled dictionaries existing in the dragon exploratory system (DES) were enriched with biomedical concepts pertaining to HCV proteins, their name variants and symbols to make it suitable for targeted information exploration and knowledge extraction as focused on HCV. A list of 32,895 abstracts retrieved via PubMed database using specific keywords searches related to HCV were processed based on concept recognition of terms from several dictionaries. The web query interface enables retrieval of information using specified concepts, keywords and phrases, generating text-derived association networks and hypotheses, which could be tested to identify potentially novel relationship between different concepts. Such an approach could also augment efforts in the search for diagnostic or even therapeutic targets. DESHCV thus represents online literature-based discovery resource freely accessible for academic and non-profit users via http:// apps.sanbi.ac.za/DESHCV/ and its mirror site http://cbrc.kaust.edu.sa/deshcv/. ß 2010 Elsevier B.V. All rights reserved. * Corresponding author. E-mail address: vladimir.bajic@kaust.edu.sa (V.B. Bajic). Contents lists available at ScienceDirect Infection, Genetics and Evolution journal homepage: www.elsevier.com/locate/meegid 1567-1348/$ – see front matter ß 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.meegid.2010.12.006