news & views B iocatalysis—that is, performing chemical transformations with biological catalysts—has made great inroads over the last few years. Owing in large part to their superior chemo-, regio- and enantioselectivity and specifcity, enzymes are increasingly used in low– and high–value-added transformations ranging from hydrolysis of cellulose to the generation of chiral alcohols and amines for pharma applications 1 . A signifcant fraction of new chemical entities under development in pharma feature chiral centers of amines, but the enzymes needed to access these molecules are currently rare. To address this problem, new (R)-transaminases, rarer than their (S)-specifc counterparts, were developed by applying a sequence-based algorithm to exclude sequences leading to known (S)-amine and (R)- or (S)-amino acid specifcity 2 . Nowadays, superior specifcity and selectivity is ofen achieved through protein engineering. Tis designing of protein sequences has evolved rapidly. Site-directed mutagenesis ushered in the frst generation of protein engineering, rational design. However, as protein design rules to this day are not completely understood, success was not always forthcoming with rational design. In the second generation, combinatorial protein engineering, ofen termed ‘directed evolution’, was introduced and practiced with good success, using protocols such as DNA shufing and/or recombination- dependent PCR 3,4 . However, owing to the large protein sequence space, library sizes quickly explode in hyperexponential fashion with rising n. Given that hits—that is, protein variants signifcantly improved over background—are rare, large libraries are necessary in combinatorial protein engineering to improve chances for a hit. Te goal of the third generation of protein engineering, which is data driven, is to shrink the size of libraries (producing ‘focused libraries’) while increasing the chances for a hit 5–7 . Te article on page 807 of this issue is a prime example of data-driven protein engineering combined with probing naturally existing diversity 2 . Te strategy in this work had four steps (Fig. 1): (i) to evaluate related enzyme amino acid sequences to fnd sequence patterns in existing (R)- and (S)-amino acid and amine transaminases that could point to unknown transaminases; (ii) to predict relevant positions and positions that would need to be varied in the target sequence; (iii) to develop an annotation algorithm for sequence motifs to exclude unwanted activities; and (iv) to identify sequences using the annotation algorithm, develop the corresponding proteins and test them for function. From the crystal structures of (S)-specifc α-amino acid transaminases (α-TAs) and branched-chain aminotransferases (BCAT) and amino acid alignment, the authors discovered that the presence of a hydrophobic residue in position 95 (phenylalanine, not tyrosine) and absence of lysine or arginine in position 40 indicated specifcity toward amines rather than amino acids. Next, the residues 107–109 in contact with the substrate in the active site were found to be rather conserved in (S)-specifc BCAT and (R)-specifc DATA (d-amino acid amino transferase) sequences of proven functionality. An algorithm was developed to exclude the sequences shown in either, with the argument that the remaining sequences should have (R)-specifc amine specifcity. Lastly, almost 6,000 sequences annotated as BCAT or class IV pyridoxal- 5′-phosphate–dependent proteins from the National Center for Biotechnology Information database were analyzed, and 21 sequences were identifed that met all the criteria specifed above. Ten of those 21 (48%) were found to have signifcant levels of (R)-transaminase activity. Tis percentage is almost identical to the one for small libraries (~20 variants) aimed at thermal stabilization of proteins according to another concept using both crystal structures and sequence-alignment—that is, structure- guided consensus 8 . Te results presented here demonstrate that developing sequences already extant in nature but so far merely annotated can provide a more targeted, faster path to new activity or specifcity than directed evolution. With ever more genomes being sequenced, the number of annotated but undeveloped sequences keeps rising rapidly. Te two key steps are (i) picking key residues related to the desired activity or specifcity and (ii) fltering out nonpertinent annotated sequences on the basis of their amino acid fngerprints with clever algorithms. Tis procedure requires a certain number of existing, successfully characterized examples (for (i)) and sufciently many functionally proven, annotated sequences of analogs PROTEIN ENGINEERING Check nature ﬁrst, then evolve Ten signiﬁcantly active new (R)-transaminases, still very rare enzymes, were found among 21 designed variants obtained from nothing more than existing transaminase structures and alignment of pertinent ﬁngerprints of annotated sequences. Andreas S Bommarius Figure 1 | Evaluation of active site environment and of ﬁngerprints in multiple sequence alignments of transaminases. Filtering of known or undesired sequences led to 21 annotated sequences, of which 10 turned out to be (R)-transaminases with enantiomeric purity up to 99.6% enantiomeric excess. Structures Sequences Enzyme with desired specificity Fingerprints FxxxY Gx(UR) Katherine Vicari NATURE CHEMICAL BIOLOGY | VOL 6 | NOVEMBER 2010 | www.nature.com/naturechemicalbiology 793 © 2010 Nature America, Inc. All rights reserved.