A Little Semantic Web Goes a Long Way in Biology K. Wolstencroft and A. Brass and I. Horrocks and P. Lord and and U. Sattler and R. Stevens and D. Turi School of Computer Science, University of Manchester, UK Abstract. We show how state-of-the-art Semantic Web technology can be used in e-Science, in particular to automate the classiﬁcation of pro- teins in biology. We show that the resulting classiﬁcation was of compara- ble quality to one performed by a human expert, and how investigations using the classiﬁed data even resulted in the discovery of signiﬁcant infor- mation that had previously been overlooked, leading to the identiﬁcation of a possible drug-target. 1 Introduction Semantic Web research has seen impressive strides in the development of lan- guages, tools, and other infrastructure. In particular, the OWL ontology lan- guage, the Prot´eg´e ontology editor, and OWL reasoning tools such as FaCT++ and Racer are now in widespread use. In this paper, we report on an application of Semantic Web technology in the domain of biology, where an OWL ontology and an OWL classiﬁcation tool called the Instance Store were used to automate the classiﬁcation of protein data. We show that the resulting classiﬁcation was of comparable quality to one performed by a human expert, and how investigations using the classiﬁed data even resulted in the discovery of information that had previously been overlooked. While this example focuses on a particular protein family and a particular set of model organisms, the technique should be applicable to other protein families, and to data from any sequenced genome—in fact we believe that similar techniques should be applicable to a wide range of investigations in biology, and in e-Science more generally. If this proves to be the case, then Semantic Web technology is set to have a major impact on e-Science. Background and Motivation The volume of genomic data is increasing at a seem- ingly exponential rate. In particular, high throughput technology has enabled the generation of large quantities of DNA sequence information. However, this se- quence data needs further analysis before it is useful to most biologists. This process, called annotation, augments the raw DNA sequence, and the protein sequence derived from it, with signiﬁcant quantities of additional information describing the biological context. One important process during annotation is the classiﬁcation of proteins into diﬀerent families. This is an important step in understanding the molecular