FiberID—A Technique to Identify Fibrous Protein Subclasses Peter Waltman, 1 Anselm Blumer, 1 * and David Kaplan 2 1 Department of Computer Science, Tufts University, Medford, Massachusetts 02155 2 Department of Biomedical Engineering, Science and Technology Center, Tufts University, Medford, Massachusetts 02155 ABSTRACT Fibrous proteins such as collagen, silk, and elastin play critical biological roles, yet they have been the subject of few projects that use computational techniques to predict either their class or their structure. In this article, we present FiberID, a simple yet effective method for identify- ing and distinguishing three fibrous protein sub- classes from their primary sequences. Using a com- bination of amino acid composition and fast Fou- rier measurements, FiberID can classify fibrous proteins belonging to these subclasses with high ac- curacy by using two standard machine learning techniques (decision trees and Naı ¨ve Bayesian clas- sifiers). After presenting our results, we present several fibrous sequences that are regularly mis- classified by FiberID as sequences of potential in- terest for further study. Finally, we analyze the de- cision trees developed by FiberID for potential insights regarding the structure of these proteins. Proteins 2007;66:127–135. V V C 2006 Wiley-Liss, Inc. Key words: fibrous proteins; automatic classifica- tion; decision trees; naı ¨ve Bayes INTRODUCTION In recent years, significant bioinformatic research has focused on addressing the problem of identifying and pre- dicting protein structure and function. Because the gen- eral problem of protein structure prediction from first principles has proven to be computationally intractable, most of this research has shifted towards approximation and classification algorithms that attempt to categorize new sequences by recognizing attributes in these that are similar to those of previously solved structures. Although effective, predictive techniques that employ this strategy are limited to recognizing only those structures they have been trained to recognize, the overwhelming majority of the publicly available techniques are based on the recogni- tion of globular and transmembrane proteins. With the exception of the coiled-coils—a metafamily that includes both globular and fibrous proteins—for which there are several algorithms available, only one method could be found that applies to all fibrous proteins. However, as dis- cussed below, this method is of limited use. Because fi- brous proteins have received little attention thus far in the protein databases, despite their significant biological role, many of the publicly available tools are not optimized for researchers working with these proteins. To address this deficiency, we introduce a new method, FiberID, which quickly and accurately classifies three noncoiled-coil fibrous protein subclasses (elastins, colla- gens, and fibroins/spidroins). Although we recognize that this method will not fully answer the needs of all fibrous protein researchers, it clearly illustrates the viability of computational techniques that work with these proteins, and our hope is that it will serve to motivate further research towards developing new predictive techniques for all fibrous proteins. Motivation Fibrous proteins are the ‘‘materials of life,’’ and as such play a major role in determining the functional processes in biological systems. 1 Examples of the various types of fi- brous proteins include collagens (the material of skin, bones, teeth, and tendons), actins (muscle matter), kera- tins (fur, scales, and claws), elastins (spongy materials such as the lining of the lungs), and fibroins (silks). As one would expect because of their distinct functional role in biology, fibrous proteins differ significantly in structural composition from both transmembrane and globular pro- teins. For example, fibrous proteins are frequently charac- terized by long lengths (in some cases containing over 2000 residues) and highly repetitive structures. In addi- tion, each of the various subclasses of fibrous proteins is typically characterized by a common secondary structure. Examples of this include the triple helix that character- izes collagens, as well as the pleated beta sheets that char- acterize the fibroins/spidroins. Moreover, individual sub- classes can sometimes be further characterized by patterns within these secondary structures, such as the seven-residue long repeat of coiled coils, commonly called the heptad repeat. It is this repetitive nature that allows fibrous proteins to be organized into the complex macro- structures (i.e. hair, skin, extracellular matrix, etc.) that are associated with a given protein or subclass. Grant sponsor: NSF-DMR and NIH. *Correspondence to: Anselm Blumer, Department of Computer Sci- ence, Tufts University, Medford, MA 02155. E-mail: ablumer@cs.tufts.edu Received 23 October 2005; Revised 26 April 2006; Accepted 20 June 2006 Published online 12 October 2006 in Wiley InterScience (www. interscience.wiley.com). DOI: 10.1002/prot.21128 V V C 2006 WILEY-LISS, INC. PROTEINS: Structure, Function, and Bioinformatics 66:127–135 (2007)