Information-Theoretic Model of Evolution over Protein Communication Channel Liuling Gong, Nidhal Bouaynaya, and Dan Schonfeld Abstract—In this paper, we propose a communication model of evolution and investigate its information-theoretic bounds. The process of evolution is modeled as the retransmission of information over a protein communication channel, where the transmitted message is the organism’s proteome encoded in the DNA. We compute the capacity and the rate distortion functions of the protein communication system for the three domains of life: Archaea, Bacteria, and Eukaryotes. The tradeoff between the transmission rate and the distortion in noisy protein communication channels is analyzed. As expected, comparison between the optimal transmission rate and the channel capacity indicates that the biological fidelity does not reach the Shannon optimal distortion. However, the relationship between the channel capacity and rate distortion achieved for different biological domains provides tremendous insight into the dynamics of the evolutionary processes of the three domains of life. We rely on these results to provide a model of genome sequence evolution based on the two major evolutionary driving forces: mutations and unequal crossovers. Index Terms—Protein communication system, channel capacity, rate distortion theory, nonhomogeneous Poisson process. Ç 1 INTRODUCTION I N this work, we describe the evolutionary process of transmitting information from generation to generation using communication and information theory. The process of transmission of genetic material during reproduction resembles the engineering system of transmission of information over a channel. Every organism contains the DNA, or the genome sequence, which encodes the informa- tion required to create proteins, the functional machinery of the organism. During cell duplication or reproduction, the genomic material is copied to create the offspring’s genome. This duplication of genetic material is typically error-prone [1]. By decoding the genome into proteins, the organism comes into being. The decoding process is almost universal for all organisms and is called translation in molecular biology. Hence, we have a biological system, which is composed of three elements: the encoded message (DNA), a noisy medium of transmission or channel (DNA storage and replication), and a decoder (the translation process). Since the output of the decoder is the organism’s proteome and the objective of a communication system is to receive messages from a source and to transmit them through a channel to a destination (see Fig. 1), the source of the biological communication system should generate the proteome. Forthwith, we observe that there are two main differences between the biological information processing system and the communication engineer system: The first is that biology does not encode proteins into DNA. It only decodes genes into proteins. The second is that, unlike the communication engineer system, the biological communica- tion system is not designed to minimize transmission errors. Otherwise, evolution will not be possible. Intuitively, there has to be a balance between keeping the cell identity by reliable transmission of its protein set and allowing errors to occur purposefully to encourage evolution. The biological communication system is shown in Fig. 2, and we will refer to it as the protein communication system [2] since the transmitted and received messages are protein sequences. It is important to reiterate that the encoding process, in the protein communication system, is only a mathematical model of the protein information captured by the DNA. In order to clarify this abstraction, let us use the following analogy with an engineering communication system for video transmission: We want to transmit a video stored in a computer to other computers. The initial computer maintains an MPEG code of the video. Assuming that the computer at the receiver has the decoder required to decode MPEG files into videos, transmission of the video message to other computers only requires sending the corresponding MPEG code. At the receiver, the MPEG file will be decoded to display the desired video. Assume further that the first MPEG code was created by chance. Therefore, this system never encodes a video into MPEG. It only decodes MPEG to display a video. Nonetheless, the proper communication model for this video transmission system relates to the transmission of the video between the sender and receiver. Note that, in this system, the only signal transmitted is the MPEG code and not the video. We also note that, although the MPEG file is decoded by the receiver to reconstruct the video, the original video was never encoded by the sender. Yet, from an engineering commu- nication system perspective, the information transmitted between the sender and receiver relates to the video, whereas the MPEG code is simply used to represent the video over the communication channel; i.e., “video ! MPEG ! MPEG ! video” even though the process “video ! MPEG” never takes place. Fig. 3 summarizes the analogy IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 143 . L. Gong and D. Schonfeld are with the Department of Electrical and Computer Engineering, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053. E-mail: {lgong4, dans}@uic.edu. . N. Bouaynaya is with the Department of Systems Engineering, Donaghey College of Information Science and Systems Engineering, University of Arkansas at Little Rock, 2801 S. University Avenue, Little Rock, AR 72204. E-mail: nxbouaynaya@ualr.edu. Manuscript received 28 Jan. 2008; revised 6 Oct. 2008; accepted 21 Dec. 2008; published online 5 Jan. 2009. For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2008-01-0019. Digital Object Identifier no. 10.1109/TCBB.2009.1. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM