Identifying subset errors in multiple sequence alignments Aparna Roy, Bruck Taddese, Shabana Vohra, Phani K. Thimmaraju, Christopher J.R. Illingworth, Lisa M. Simpson, Keya Mukherjee, Christopher A. Reynolds* and Sree V. Chintapalli School of Biological Sciences, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK Communicated by Ramaswamy H. Sarma (Received 12 November 2012; ﬁnal version received 22 January 2013) Multiple sequence alignment (MSA) accuracy is important, but there is no widely accepted method of judging the accuracy that different alignment algorithms give. We present a simple approach to detecting two types of error, namely block shifts and the misplacement of residues within a gap. Given a MSA, subsets of very similar sequences are gener- ated through the use of a redundancy ﬁlter, typically using a 70–90% sequence identity cut-off. Subsets thus produced are typically small and degenerate, and errors can be easily detected even by manual examination. The errors, albeit minor, are inevitably associated with gaps in the alignment, and so the procedure is particularly relevant to homology modelling of protein loop regions. The usefulness of the approach is illustrated in the context of the universal but little known [K/R]KLH motif that occurs in intracellular loop 1 of G protein coupled receptors (GPCR); other issues relevant to GPCR modelling are also discussed. Keywords: redundancy; errors; multiple sequence alignments; alignment accuracy; alignment errors; homology modelling Introduction Multiple sequence alignments (MSAs), play a central role in bioinformatics as they lay the foundation for identify- ing conserved residues, identifying functional sites in pro- teins (Dean et al., 2001; Gouldson et al., 2001; Lichtarge, Bourne, & Cohen, 1996; Madabushi et al., 2004; Miha- lek, Res, & Lichtarge, 2004), predicting secondary struc- ture (Cuff, Clamp, Siddiqui, Finlay, & Barton, 1998; Jones, 1999; McGufﬁn, Bryson, & Jones, 2000; Rost, 1996), and implying phylogeny (Felsenstein, 1989); the accuracy of the alignments is important, as different methods can lead to different conclusions in phylogenetic analysis and different structures in homology models (Dickson, Wahl, Fernandes, & Gloor, 2010; Loytynoja & Goldman, 2008; Martin, MacArthur, & Thornton, 1997; Venclovas 2001, 2003; Venclovas & Margelevicius, 2005; Venclovas, Zemla, Fidelis, & Moult, 2001; Wong, Suchard, & Huelsenbeck, 2008). The issue of accurately assessing the quality of a MSA is not straightforward (Aniba, Poch, Marchler-Bauer, & Thompson, 2010b; Edgar, 2010). Here, we present a simple method for iden- tifying a small proportion of the badly aligned residues in MSAs by comparing subsets of similar sequences. In principle, dynamic programming (Needleman & Wunsch, 1970) permits the determination of an optimum alignment, but for more than about 10–20 sequences, CPU demands dictate the use of heuristic approaches, as in Clustal (Larkin et al., 2007; Thompson, Gibson, Plewniak, Jeanmougin, & Higgins, 1997). However, the optimum deﬁned by the objective function does not nec- essarily guarantee that the MSA has optimal biological meaning, making it difﬁcult to assess MSAs. In this mul- tiple sequence context, reliability is usually sought by choosing an alignment program that performs well against a series of curated MSA benchmarks such as BAliBASE, Oxbench and Prefab (Aniba, Poch, March- ler-Bauer, & Thompson, 2010a; Aniba et al., 2010b; Edgar, 2004; Raghava, Searle, Audley, Barber, & Barton, 2003). There are systematic approaches for identifying badly aligned residues in MSAs (where there is no refer- ence to e.g. benchmarks) (Blouin, Perry, Lavell, Susko, & Roger, 2009; Dickson et al., 2010; Lassmann & Sonn- hammer, 2007), and there are several approaches for scoring alignments and for removing badly aligned sequences or correcting poorly aligned regions (Lassmann & Sonnhammer, 2005a; Muller, Creevey, *Corresponding author. Email: reync@essex.ac.uk Journal of Biomolecular Structure and Dynamics, 2013 http://dx.doi.org/10.1080/07391102.2013.770371 Copyright Ó 2013 Taylor & Francis