Consensus Sequence Prediction With LSTMs- Submission to PLOS Synbio Review Michael Teti 1, , Rachel St Clair 2 , Abrian Miller 2 , William Hahn 3, , Elan Barenholtz 3, , with the Center for Complex Systems and Brain Sciences 1 Florida Atlantic University, Boca Raton, FL, United States 33431 USA * CorrespondingAuthor@rstclair2012@fau.edu Abstract Bioinformatics can be immensely valuable with the aid of machine learning algorithms to process and analyze data. In this work we use TensorFlow, an open-source machine learning library, to implement a Long Short-Term Memory (LSTM) neural network on the task of categorizing proteins based on their primary amino acid sequence. The network was employed on two separate datasets — homeodomain vs. non-homeodomain, and artemisinin binding vs. non-artemisinin binding — to show its ability to generalize. The first dataset containing proteins with one of two categories (binding or non-binding to a target antimalarial drug Artemisinin) were given as input to the LSTM, which learned to recognize characteristic patterns within the sequences of the overall class, accurately identify the proper category of a novel sequence, and predict an unknown sequence’s binding ability with high accuracy. The second dataset involved proteins containing and not containing homeodomain consensus sequence. The model was again trained to distinguished novel features of the two categories (homeodomain and no homeodomain) and classify a novel sequence accurately. When the network was prompted to define the feature most activating nodes for proteins in the consensus containing class, the model predicted a sequence highly similar to the theoretically accepted homeodomain consensus. This computational model supports the use of LSTM’s in proteomics and should be considered for use in a wide array of research dataset types to analyze, sort, and predict copious amounts of information. LSTM — Proteins — Machine Learning — Tensorflow Introduction 1 As bioinformatics data grows exponentially, our need for collecting, sorting, and 2 analyzing it becomes an imperative aspect of this research, and methods that 3 incorporate machine learning, a class of techniques where artificial networks learn from 4 data, will be extremely important. Long Short Term Memory (LSTM), a type of 5 recurrent neural network first described in [4], is a relatively innovative approach being 6 used in speech recognition [1]. With a few manipulations, a novel deep learning neural 7 network model is proposed here for a new application: protein function given primary 8 amino acid sequence. The predicted class of a protein can be used in part with wet-lab 9 research to establish the protein’s function as well as narrow down important sequence 10 fragments for that function. The pragmatic use of machine learning before traditional 11 genetic engineering and synthetic biology approaches can save multitudes of time, 12 PLOS 1/8