Parameterization and Classification of the Protein Universe via Geometric Techniques Ashish V. Tendulkar 1 , Pramod P. Wangikar 2 *, Milind A. Sohoni 3 Vivekanand V. Samant 2 and Chetan Y. Mone 2 1 Kanwal Rekhi School of Information Technology Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India 2 Department of Chemical Engineering, Indian Institute of Technology, Bombay, Powai Mumbai 400 076, India 3 Department of Computer Science and Engineering Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India We present a scheme for the classification of 3487 non-redundant protein structures into 1207 non-hierarchical clusters by using recurring structural patterns of three to six amino acids as keys of classification. This results in several signature patterns, which seem to decide membership of a protein in a functional category. The patterns provide clues to the key residues involved in functional sites as well as in protein–protein interaction. The discovered patterns include a “glutamate double bridge” of superoxide dismutase, the functional interface of the serine protease and inhibitor, interface of homo/hetero dimers, and functional sites of several enzyme families. We use geometric invariants to decide superimposability of structural patterns. This allows the parameterization of patterns and dis- covery of recurring patterns via clustering. The geometric invariant-based approach eliminates the computationally explosive step of pair-wise com- parison of structures. The results provide a vast resource for the biologists for experimental validation of the proposed functional sites, and for the design of synthetic enzymes, inhibitors and drugs. Keywords: geometric invariants; clustering; protein structure comparison; functional site; protein – protein interface Introduction Unraveling of the evolutionary relationships between proteins has been of central interest to biologists for decades. This is achieved via either primary sequence alignment 1,2 or overall structural alignment. 3,4 The conserved amino acids in the sequence alignment led to construction of family- wise sequence signatures, which are now well documented. 5,6 The conserved amino acids are deemed to be important for function. 7 On the structural side, several methods are known for optimal pair-wise alignment of protein structures. Algorithms such as Protein Structure Alignment by Distance Matrices 8 and Vector Alignment Search Tool 9 are widely used, each optimizing a different measure of similarity. An extensive all-against-all structure comparison has led to hierarchical classification systems such as SCOP 10 and CATH. 11 These are extremely useful in understanding structural, evolutionary and functional relationship in proteins of known structure. Much of the structure analysis hinges on the hypothesis that nearly all proteins have structural similarities and, in many cases share a common evolutionary origin. 4,5 Furthermore, substructures of small number of amino acids are known to be conserved across several proteins. Methods have been developed to search for user-defined confor- mations, which are typically useful in searching for known active sites 12,13 or peptide segment conformations. 14 We have described an unbiased graph theoretic approach for detection of recurring side-chain patterns from protein families. 15 It is now well accepted that a family of proteins with common function typically conserves a functional site made up of a small number of amino acids, which is not necessarily detectable from the sequence signatures. 15,16 Classic examples include the Ser/His/Asp catalytic triad of serine pro- teases. 17 Thus, our first objective was to establish