Local phase quantization texture descriptor for protein classification Sheryl Brahnam 1 Loris Nanni 2 Jian-Yu Shi 3 Alessandra Lumini 2 1 Computer Information Systems, Missouri State University, 901 S. National, Springfield, MO 65804, USA sbrahnam@missouristate.edu 2 DEIS, IEIIT—CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy loris.nanni@unibo.it; alessandra.lumini@unibo.it 3 School of Life Science , School of Computer Science and Technology, Northwestern Polytechnical University, Xi’An, China jianyushi@nwpu.edu.cn Abstract In this work we propose a method for protein classification based on a texture descriptor, called local phase quantization that utilizes phase information computed from the image extracted from the 3-D tertiary structure of a given protein. To build this texture, the Euclidean distance is calculated between all the atoms that belong to the protein backbone. Moreover, we study classification fusion with a state-of-the-art method for describing the proteins: the Chou’s pseudo amino acid descriptor. Our experiments show that the fusion between the two approaches improves the performance of Chou’s pseudo amino acid descriptor. We use support vector machines as our base classifier. The effectiveness of our approach is demonstrated using four benchmark datasets (protein fold recognition, DNA-binding proteins recognition, biological processes and molecular functions recognition/enzyme classification). Keywords: protein classification; texture descriptors; primary structure; local phase quantization; support vector machines. 1 Introduction Finding effective feature extraction methods is still one of most important ongoing issues in protein classification[4]. There are two general views on how extraction should be accomplished: the indirect and direct methods. Indirect representation of protein spatial structure, is based on the widely held assumption that structural features are closely related to sequence composition [7, 8]. Thus this method extracts features from a sequence. Perhaps the most famous indirect representation is pseudo amino acid (PseAA) composition [10], with its many variants, see, for instance, [11-14]. In the direct approach feature extraction is accomplished via an analysis of the protein's spatial structure. The direct method of representation can be grouped into three general types: one based on the spatial atom distribution [15], a second on its topological structure [16], and a third on its geometrical shape [17]. Generally, the indirect representation is lower in computational cost but provides a higher dimensional feature set, whereas the direct representation is higher in computational cost but provides a lower dimensional feature set. While the lower computational cost involved in the indirect approach is desirable, the higher dimensional representation requires the application of the most advanced techniques in pattern recognition, see, e.g., [3, 18-20]. In this paper we apply a new pattern recognition techniques that combines an indirect (Chou’s amino acid) descriptor with a direct representation (namely, protein spatial structure features extracted from the distance matrix). The experimental results show that combining direct and indirect descriptors using an ensemble of classifiers outperforms previous standalone approaches. The remainder of this paper is organized as follows. In section 2, we introduce our feature extraction methods and ensemble approach. In section 3, we report experimental results obtained on four benchmark databases. Finally, in section 4, we summarize results and draw a few conclusions. 2 Proposed approach In [9] the authors show that Haralick features and the Radon transform produce a good texture descriptor for the distance matrix of the protein backbone. The main aim of this work is to propose a single set of texture features that works well in this problem. The protein descriptor used in our experiments is Chou’s well-known pseudo amino acid descriptor [11]. The architecture of our best performing system is presented in figure 2. A general description of each step in our approach is provided below. Int'l Conf. Bioinformatics and Computational Biology | BIOCOMP'10 | 159