IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 10, OCTOBER 2010 5223 Correcting Deletions Using Linear and Cyclic Codes Khaled A. S. Abdel-Ghaffar, Member, IEEE, Hendrik C. Ferreira, Senior Member, IEEE, and Ling Cheng, Associate Member, IEEE Abstract—Linear and cyclic codes are typically used to combat substitution errors. However, synchronization errors, associated with the deletion and insertion of symbols, can cause severe per- formance degradation unless the coding scheme possesses the ca- pability to recover from such errors. It is shown that linear codes of rate greater than 1/2 cannot correct deletion or insertion errors but there are linear codes of rate 1/2 that can correct these er- rors. Although cyclic codes, except for repetition codes, cannot cor- rect deletion or insertion errors, two approaches are investigated to yield codes, based on cyclic codes, that can correct these errors. In the first approach, it is shown that a binary or nonbinary cyclic code of rate at most 1/3 or 1/2, respectively, can be extended by one symbol to make it capable of correcting synchronization er- rors. In the second approach, a cyclic code of rate at most 1/2 is expurgated by appropriately deleting codewords such that the ex- purgated code is capable of correcting synchronization errors. It is shown that deleting codewords costs at most two information bits if the code is binary and one information symbol if the code is non- binary. Index Terms—Cyclic code, deletion, expurgated code, extended code, insertion, linear code, substitution error, synchronization error. I. INTRODUCTION I N most communication and storage channels, substitution errors, in which a transmitted symbol is received as an- other symbol, are the most common type of errors. For this reason, coding techniques are widely used to combat such er- rors. However, channels may also suffer from synchronization errors. These errors are associated with not receiving a trans- mitted symbol, which is called a deletion error, or with receiving a spurious symbol that was not transmitted, which is called an insertion error. In some applications, such as the Internet, sym- bols, representing packets, are transported over a communica- tion network via a set of links and nodes connecting the source to the destination. A failure in any part of the communication route may cause a packet to be lost causing a deletion error. The rate of packet loss ranges from 0.6% to 1.4% depending on the distance from server to user [13]. It has been also observed that Manuscript received May 26, 2009; revised May 13, 2010. Date of current version September 15, 2010. This work was supported in part by the National Science Foundation (NSF) under Grant CCF-0727478 and in part by the Na- tional Research Foundation (NRF) under Grant 66422. The material in this paper was presented in part at the IEEE International Symposium on Informa- tion Theory, Nice, France, June 24–29, 2007. K. A. S. Abdel-Ghaffar is with the Department of Electrical and Com- puter Engineering, University of California, Davis, CA 95616 USA (e-mail: ghaffar@ece.ucdavis.edu). H. C. Ferreira and L. Cheng are with the Department of Electrical and Elec- tronic Engineering Science, University of Johannesburg, Auckland Park, 2006, South Africa (e-mail: hcferreira@uj.ac.za; lcheng@uj.ac.za). Communicated by M. Blaum, Associate Editor for Coding Theory. Digital Object Identifier 10.1109/TIT.2010.2059790 the loss rate can be much higher, ranging between 10% and 50%, over short periods of time [15]. Deletion and insertion errors can have a devastating effect on the reliability of the communication channel even if powerful codes are used to correct substitution errors. Therefore, there is a compelling reason to consider codes that, not only correct substitution errors, but can also recover from deletion and insertion errors [3], [4], [8]–[10], [14], [18], [21], [23]. For an interesting and accessible survey on deletion correcting codes, the reader is referred to [20]. In this paper, we show that linear codes of rates greater than 1/2 cannot correct a single deletion or a single insertion although there are linear codes of rate 1/2 that can correct such errors. This contradicts a construction by Sloane [20] of linear dele- tion correcting codes of rate greater than 1/2. Actually, we will show that the construction presented in [20] cannot lead to linear codes. Our results addresses a question raised by Sloane [20] re- garding optimal linear single deletion correcting codes. In par- ticular, we determine the minimum number of check symbols needed in a linear deletion correcting code. For example, using computer search, it is reported in [20] that there is no bi- nary linear deletion correcting code. This follows immediately from our results. Not only all linear codes of rates greater than 1/2 are inca- pable of correcting synchronization errors, but also all cyclic codes, except for repetition codes, cannot correct these errors. This is unfortunate since cyclic codes are the most widely used class of codes for correcting substitution errors due to the ease of their implementation. This reason motivated us to study coding schemes, based on cyclic codes, that can correct deletions and insertions. In particular, we study extending and expurgating cyclic codes for this purpose. Our results pertain to low-rate cyclic codes. We show that by judiciously extending a cyclic code by one symbol, i.e., inserting one extra symbol in each codeword, we obtain a code capable of correcting synchroniza- tion errors provided that the cyclic code has rate at most 1/3 or 1/2 depending, respectively, on whether or not the code is binary. We also consider expurgating a cyclic code, i.e., deleting codewords from it, such that the resulting expurgated code is capable of correcting synchronization errors. For a cyclic code of rate 1/2 or less, we determine the maximum size of an expurgated code with this error correcting capability. We show that deleting codewords from a cyclic code of rate 1/2 or less to obtain an expurgated code that can correct deletions and insertions costs at most two information bits if the code is binary and one information symbol if the code is nonbinary. In this paper, we assume that the beginning and the end of each received sequence corresponding to a transmitted code- word are known, which allows for independent decoding of the codewords. (This assumption, which is commonly assumed in the literature, can be achieved by inserting periodic markers be- tween codewords.) We also assume that each codeword may 0018-9448/$26.00 © 2010 IEEE