J. theor. Biol. (2002) 216, 361–365 doi:10.1006/jtbi.2001.2512, available online at http://www.idealibrary.com on Amino Acid Encoding Schemes from Protein Structure Alignments: Multi-dimensional Vectors to Describe Residue Types KuangLin, Alex C.W. MayandWilliam R.Taylor n w wDivision of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill NW71AA, UK (Received on 28 August 2001, Accepted in revised form on 5 December 2001) Bioinformatic software has used various numerical encoding schemes to describe amino acid sequences. Orthogonal encoding, employing 20 numbers to describe the amino acid type of one protein residue, is often used with artificial neural network (ANN) models. However, this can increase the model complexity, thus leading to difficulty in implementation and poor performance. Here, we use ANNs to derive encoding schemes for the amino acid types from protein three-dimensional structure alignments. Each of the 20 amino acid types is characterized with a few real numbers. Our schemes are tested on the simulation of amino acid substitution matrices. These simplified schemes outperform the orthogonal encoding on small data sets. Using one of these encoding schemes, we generate a colouring scheme for the amino acids in which comparable amino acids are in similar colours. We expect it to be useful for visual inspection and manual editing of protein multiple sequence alignments. r 2002 Elsevier Science Ltd. All rights reserved. Introduction The artificial neural network (ANN) is a sophisticated modelling technique capable of modelling extremely complex functions and automatically learning the structure of data (Bishop, 1995). ANNs have been widely applied to many different problems in bioinformatics (for reviews, see Baldi & Brunak, 1998; Wu & McLarty, 2000). In neural network methodology, samples are often subdivided into ‘‘training’’ and ‘‘testing’’ sets. The training set is a set of examples used for ‘‘learning’’: fitting the parameters (i.e. weights) of a neural network. The testing set is a distinct set of examples used to assess the performance of a trained neural network. It is important to maintain a strict separation of these data sets with the testing set being applied only after determination of network architecture and con- nection weights. A basic assumption in neural network training (and model optimization approaches of other machine learning methods) is that the training data exhibit an underlying systematic aspect but are corrupted with random noise (Bishop, 1995). The central goal of model optimization is to produce a system able to make good predictions for cases not in the training set. It requires the model to represent the underlying mechanism correctly. Training an over-complex model may fit the noise, not just the signal, leading to ‘‘overfitting’’. Such a model will have low training error but a much higher testing error. Generally, its performance on new cases will be poor. The best way to avoid overfitting is to use n Author to whom correspondence should be addressed. E-mail: wtaylor@nimr.mrc.ac.uk 0022-5193/02/$35.00/0 r 2002 Elsevier Science Ltd. All rights reserved.