International Journal of Computer Applications (0975 8887) Volume 56No.3, October 2012 1 Adaptive Self-Correcting Floating Point Source Coding Methodology for a Genomic Encryption Protocol Harry C. Shaw Comm. Systems Branch NASA/GSFC Greenbelt, MD Sayed Hussein ECE Dept. George Washington Univ. Washington, DC Hermann Helgert ECE Dept. George Washington Univ Washington, DC ABSTRACT We address the problem of creating an adaptive source coding algorithm for a genomic encryption protocol using a small alphabet such as the nucleotide bases represented in the genetic code. For codewords derived from an alphabet of N plaintext with probability of occurrence, p, we describe a mapping into a floating point representation of the codewords which are translated into genomic codewords derived from a novel modification of the Shannon-Fano-Elias coding process. Errors in the reverse decoding process are processed through an adaptive, self-correcting codebook to determine the best fit codeword decoding solution. A genetic algorithmic approach to error correction within the source coding is also summarized. General Terms Data Confidentiality and Network Authentication Keywords Source coding, genetic algorithms, probability mass functions. Shannon-Fano-Elias. 1. INTRODUCTION Genomic encryption protocols are being widely studied for implementation in advanced information security [1], [2]. In this paper, we present a source coding system for subsequent encryption via a system that emulates the mechanisms of regulation of gene expression [3]. However, utilization of such a protocol requires an efficient source coding scheme that is optimized for the requirements of the electronic domain (bandwidth and channel efficiency, error detection and correction, signal recovery in the presence of noise and interferers, etc.) In this paper, we address the mapping of a plaintext source code alphabet into genomic codes using the matrix cofactors of a solution of linear equations. The transmitted data content is a series of floating point matrix cofactors. At the receiving end, the receiver applies a decoding algorithm to recover and invert the cofactor matrix and correct the rounding and floating point errors via an adaptive source codebook. A genetic algorithm provides an efficient method to determine to correct errors in received codewords based upon the fitness of approximated codewords. Codeword lengths are adaptable based upon the entropy of user selected source. This source could be a user plaintext, the selected genome of one or more species, or other sources as required. The genomic alphabet can consist of the four most commonly found nucleotides (adenine, cytosine, guanine and thymine. It can be expanded to include epigenetic marking (methyl-cytosine) [4], mutagenic base modifications (xanthine, hypoxanthine) [5], the RNA base uracil, and so forth. The method is extensible to the proteome and other domains within the space of gene expression products. A large variety of methods have been published to utilize DNA transcription and translation in cryptographic systems. DNA cryptography using the central dogma of biology has been proposed for mobile ad hoc networks [6]. It takes plaintext through a process of DNA→RNA→Amino Acid coding. A combination of DNA computing and Elliptic Curve Cryptography has been described [7] for a powerful form of DNA encryption. It permits encrypted traffic over communication links which may not be secure. A symmetric key block cipher approach using DNA transcription and translation has been demonstrated by [8]. Other forms of DNA encryption include: Image compression encryption using a DNA-based alphabet and a genetic algorithm based compression scheme [9]. DNA encryption utilizing gel electrophoresis images and a molecular checksum [10]. Steganographic approach using DNA as a natural template for hidden messages [11]. DNA watermarks to identify genetically modified organisms utilizing the DNA-Crypt algorithm permitting a user to insert encrypted data into a genome of choice [12], [13]. 2. THE METHOD 2.1 High-level description of the transmitter source coding process Consider a memoryless source generating letters from an alphabet A 1 = {a 1 , a 2 , a n } with a source taken from a probability mass function P = {p 1 , p 2 , p n }. Let the source generate a message X such that: X=x 1 x 2 …x i A i where i represents the word order of the message. The message X is serialized and subdivided into character blocks of size r, and r-sized blocks are arranged into k sized word blocks in a set L as shown in Figure 1. The words are lexicographically coded in the format of   k where is the Huffman decimal code for the first letter and k are the subsequent Huffman decimal codes for remaining letters. Clearly, if the character blocks are long enough, precision and accuracy of subsequent floating point computations would be a concern. Therefore, the character block size is made adaptable to the floating point capabilities of the transmitting and receiving system.