ARTICLES https://doi.org/10.1038/s42256-022-00532-1 1 Computational Biology and Bioinformatics Program, Yale University, New Haven, CT, USA. 2 Department of Applied Mathematics, Yale University, New Haven, CT, USA. 3 Department of Mathematics, Yale University, New Haven, CT, USA. 4 Department of Genetics, Yale University, New Haven, CT, USA. 5 Department of Computer Science, Yale University, New Haven, CT, USA. ✉ e-mail: smita.krishnaswamy@yale.edu T he primary challenge in sequence-based protein design is the vast space of possible sequences. A small protein of 30 residues (mean length in eukaryotes ≈ 472, ref. 1 ) translates into a total search space of 10 38 —far beyond the reach of modern high-throughput screening technologies. This obstacle is further exacerbated by epistasis (higher-order interactions between amino acids at distant residues in the sequence), which makes it difficult to predict the effect of small changes in the sequence on its prop- erties 2 . Together, this motivates the need for approaches that can better leverage sequence–function relationships, often described using fitness landscapes 3 , to more efficiently generate protein sequences with desired properties. To address this problem, we propose a data-driven deep generative approach called Regularized Latent Space Optimization (ReLSO). ReLSO leverages the greater abundance of labelled data arising from recent improvements in library generation and phenotypic screening technologies to learn a highly structured latent space of joint sequence and structure information. Further, we introduce novel regularizations to the latent space in ReLSO such that molecules can be optimized and redesigned directly in the latent space using gradient ascent on the fitness function. Although the fitness of a protein (we use this term in general to refer to some quantifiable level of functionality that an amino-acid sequence possesses: for example, binding affinity, fluorescence, catalysis and stability) is more directly a consequence of its folded, three-dimensional structure rather than strictly its amino-acid sequence, it is often preferable to connect fitness directly to sequence since structural information may not always be available. Indeed, when generating a library of variants for therapeutic discov- ery or synthetic biology, either through a designed, combinatorial approach or by random mutagenesis, it is cost prohibitive to solve for the structure of each of the typically 10 3 –10 9 variants produced. Here we observe that protein design is fundamentally a search problem in a complex and vast space of amino-acid sequences. For most biologically relevant proteins, sequence length can range from a few tens to several thousands of residues 1 . Since each posi- tion of an N-length sequence may contain one of 20 possible amino acids, the resulting combinatorial space (≈20 N sequences) is often too large to search exhaustively. Notably, this problem arises with the consideration of just canonical amino acids, notwithstanding the growing number of non-canonical alternatives 4 . A major con- sequence of the scale of this search space is that most publicly avail- able datasets, although high throughout in their scale, capture only a small fraction of possible sequence space and thus the vast major- ity of possible variants are left unexplored. To navigate the sequence space, an iterative search procedure called directed evolution 5 is often applied, where batches of ran- domized sequences are generated and screened for a function or property of interest. The best sequences are then carried over to the next round of library generation and selection. Effectively, this searches sequence space using a hill-climbing approach, and as a consequence is susceptible to local maxima that may obscure the discovery of better sequences. Other approaches to protein design include structure-based design 6,7 , where ideal structures are cho- sen a priori and the task is to fit a sequence to the design. Recently, several promising approaches have emerged incorporating deep learning into the design 8,9 , search 10,11 and optimization 12 of proteins. However, these methods are typically used for in silico screen- ing by training a model to predict fitness scores directly from the input amino-acid sequences. Recent approaches have also utilized reinforcement learning to optimize sequences 13 . Although these approaches are valuable for reducing the experimental screening burden by proposing promising sequences, the challenge of navi- gating the sequence space remains unaddressed. Transformer-based protein generation with regularized latent space optimization Egbert Castro 1 , Abhinav Godavarthi 2 , Julian Rubinfien 3 , Kevin Givechian 4 , Dhananjay Bhaskar 4 and Smita Krishnaswamy 1,2,4,5 ✉ The development of powerful natural language models has improved the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder, which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and a novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence–function landscape of large labelled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly available protein datasets, including variant sets of anti-ranibizumab and green fluorescent protein. We observe a greater sequence optimization efficiency (increase in fit- ness per optimization step) using ReLSO compared with other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly trained ReLSO models provide a potential avenue towards sequence-level fitness attribution information. NATURE MACHINE INTELLIGENCE | www.nature.com/natmachintell