GenSAS – An Online Integrated Genome Sequence Annotation Pipeline T. Lee, C. Peace, S. Jung, P. Zheng, and D. Main Horticulture and Landscape Architecture Washington State University Pullman, WA, U.S.A. I. Cho Computer Science and Information Systems Saginaw Valley State University University Center, MI, USA Abstract—We present a web-based genome sequence annotation tool called GenSAS, the Genome Sequence Annotation Server. Advances in DNA sequencing technology and Web technologies have significantly reduced the costs associated with sequencing an organism’s genome, including the operating costs of hardware and software. However, researchers need to use multiple annotation tools to increase the accuracy of the annotation results, and the labor cost of analyzing the sequence data is still high due to the non uniform input and output formats used in each tool. There are many individual web based tools available for gene annotation, but GenSAS is unique in that it offers a one- stop website with a single graphical interface for running multiple structural and functional annotation tools, visualization and manual curation of genome. GenSAS improves the overall performance by distributing the work between the server and client machine and supports a streamlined workflow that further simplifies the tasks and reduces the overall processing time. Keywords – genomics tool; sequencing; annotation; web application; genome browser I. INTRODUCTION There are many gene prediction methods and tools available to help gene identification. Gene prediction (or identification), however, is a complex task, and no one tool can work as cure-all. Each tool has its own strengths and weaknesses, and it is common for researchers to use multiple types of gene prediction tools in gene annotation and combine the results to increase the accuracy. Many tools work in a command line environment and produce output in non standard textual format or in MS Excel format. Command line tools are inherently hard to use for bench scientists and the textual outputs make it hard to understand the spatial relationships between genomic data. Therefore, use of genome viewers like GBrowse [1] which visually displays the spatial relationships is required to view the gene annotation results. To use GBrowse (or other genome viewers), users have to provide specific formats for the input data as required by the genome viewer. Converting one input format to another can be done relatively easily with a programming language like Perl, but it requires biologists consult with bioinformaticists. To provide biologists with a one-stop resource where they can use multiple gene annotation tools and view the results in a graphic interface, the development of GenSAS was initiated. Although GBrowse is one of the most commonly used genome browsers (or viewer), users sometimes experience rugged and choppy displays on the browser. This is caused because the results to be displayed (e.g., the resulting images) are prepared in the server machine and sent back to the browser in the client machine. Thus, the server has to run not only the genomic tools but also render the result image to be displayed in the web browser. The size of the image is rather large, and even with a fast network connection, the large size may cause network congestion. When the user wants to move to a region outside the visible page in the browser, the whole page needs to be reloaded – the client action (clicking the mouse to move) is sent to the sever and the server processes the user input to prepare for a new image to be displayed, which is then sent back to the client machine. The user may need to open a new window to view results of each tool and need to move from one window to another, which further distracts the user's attention. We developed the graphical interface of GenSAS to overcome most of these problems. We look at various tools used in gene annotation and conventional approaches taken in the gene annotation process in section 2. In section 3, we present the detailed features of GenSAS. This paper concludes with a summary and future work in section 4. II. GENE ANNOTATION TOOLS AND GENE ANNOTATION PROCESS Once a genome has been sequenced and the resulting pieces of DNA sequences have been assembled into a set of overlapping DNA segments, the process of genome annotation starts. The purpose of genome annotation is to understand the content of the genome through locating genes and other sequence features in a genome and determining gene putative function. There are many genome annotation tools and each has their own strengths and weaknesses, and species-specific features, making it unwise to draw conclusions based on only one gene prediction or homolog identification tool. Annotation can be categorized into structural and functional annotations [2, 3]. A. Functional Annotation Functional annotation is the process of attaching biological information to the genomic elements identified during structural annotation (discussed in next section). Such information includes, but is not limited to, name of protein,