International Journal of Computational Bioscience, 2010 MATCHING OBSERVED ALPHA HELIX LENGTHS TO PREDICTED SECONDARY STRUCTURE ∗ Brian D. Cloteaux * and Nadezhda Serova ** Abstract Because of the complexity in determining the 3D structure of a protein, the use of partial information determined from experimental techniques can greatly reduce the overall computational expense. We investigate the problem of matching experimentally observed lengths of helices to the predicted secondary structure of a protein. We give a simple and fast algorithm for producing a library of potential solutions. We test our algorithm by performing a series of computational experiments for predicting the alpha helix placement of proteins with an already known order. These tests seem to demonstrate that our method, if given a good prediction of the protein’s secondary structure, can generate high quality lists of potential placements of the helix lengths onto the protein sequence. Key Words Protein structure, alpha helix placement 1. Introduction Understanding how speciﬁc proteins fold, or arrange them- selves in three-dimensional space (3D) based on environ- mental and internal chemical constraints, is necessary to determine how these proteins function. But even when the amino acid sequence of the proteins (1D structure) is known, the prediction of their corresponding 3D structure is an extremely challenging problem. This challenge is both from an experimental and computational viewpoint. Proteins require precise environments to fold properly. Because of the numer- ous complications in measuring protein under the correct environment, experimental methods for ascertaining the 3D arrangements are expensive, time consuming, * Applied and Computational Mathematics Division, National Institute of Standards and Technology, Gaithersburg, Mary- land, USA; e-mail: brian.cloteaux@nist.gov ** Department of Computer Science, University of Mary- land, Baltimore County, Baltimore, Maryland, USA; e-mail: nserova1@umbc.edu ⋆ Oﬃcial contribution of the National Institute of Standards and Technology; not subject to copyright in the United States. A conference version of this paper was published at the 2009 Computational Structural Bioinformatics Workshop [1]. Recommended by Dr. L. Elnitski (10.2316/J.2010.210-1024) and of limited accuracy. X-ray crystallography, for exam- ple, is a powerful technique in the determination of these structures; however, it is ineﬀective in proteins that are not easily crystallized, such as membrane proteins. Many other methods provide only partial information about the protein’s 3D structure. At the same time, computationally determining the 3D structure of proteins is, in general, intractable. To reduce the diﬃculty of this problem, a recent approach has been to computationally match experimental observations to the 3D structure of the protein [2–4]. This paper extends an original investigation by He, Lu, and Pontelli [5] into the problem of matching observed lengths of the alpha helices from the electron cryomicroscopy technique to the predicted areas of secondary structure. Electron cryomicroscopy can be used to produce a density map of some proteins. Although with current technology the resolutions of such maps are relatively low, certain secondary structures such as alpha helices can still be identiﬁed. Using electron cryomicroscopy, the lengths of the alpha helices can be observed, but the exact location of these helices on the protein sequence is not clear. To help overcome this limitation, He, Lu, and Pontelli suggested matching these observed lengths to the predicted probabilities of the protein’s secondary structure. An example of the correspondence between protein sequences and the observed secondary structure is shown in Fig. 1. These probabilities on the placement of alpha helices onto the 1D sequence are generally based on the placement for similar sequences in other known proteins and have inherently limited accuracy. Thus, the result from matching observed length to the predicted secondary structure is to produce a set of probable arrangements of the observed lengths that can then be used as a starting point in determining 3D structure. This article oﬀers two contributions to the matching of observed lengths to their placement on the 1D protein structure. The ﬁrst is an examination of the complexity and necessity of computing the optimal length placement. We give evidence that computing optimal solutions may not be worth the computational expense. A second contribution is to introduce a new approach to computing possible arrangements. He, Lu, and Pontelli 103