IEEE TRANSACTIONS ON COMPUTERS, VOL. C-29, NO. 4, APRIL 1980
Correspondence-
On Uniquely Decipherable Codes with Two Codewords
RONALD V. BOOK AND SAI CHOI KWAN
Abstract-It is shown that every uniquely decipherable code with just two
codewords has finite delay. In addition, if a uniquely decipherable code with
just two codewords is full, then it is trivial.
Index Terms-Finite delay, semigroups, uniquely decipherable codes.
Suppose that S is a finite set of strings. The question of whether
an arbitrary string can be expressed as the concatenation of strings
in S is easy to answer: construct a deterministic finite-state acceptor
M to recognize all and only those strifigs in S*, the set of strings ob-
tained as finite concatenations of strings from S, and run M on x. A
second question is whether every string in S* has a unique factori-
zation or parse as the concatenation of strings in S; for some choices
of S this is true and others false-this question is decidable, e.g., see
[2]. Suppose that S is such that every string in S* does have a unique
parse as the concatenation of strings in S. How difficult is it to obtain
this parse? Some choices of S require that any parsing algorithm have
memory bounded by a function of the length of the input string while
other choices of S are such that a parsing algorithm need have only
finite memory.
Here we consider the situation where S contains just two strings,
say S =
Ix,yI. We show that if xy $ yx, then a single-scan parsing
algorithm with finite memory can be used to parse strings in S*. The
methods used are those of the theory of variable-length codes and of
semigroups. The results are new but can be obtained as corollaries
of more sophisticated theorems. In this note simple and direct proofs
are provided.
In [2],[3] it is shown that variable-length codes are related to in-
formation-lossless automata. Connections between the study of
variable-length codes and the study of subsemigroups of free semi-
groups exist [8],[9], in particular, if z is a finite set of symbols, then
the set 2* of all strings over z is the free semigroup (with identity
e) generated by I, and a set S c Z* is a uniquely decipherable (UD)
code if the subsemigroup S* of 2;* generated by S is a free subsem-
igroup and has S as its minimal generating set. If S is a uniquely
decipherable code, then x E S* is a message and s e S is a codeword.
A code S c 2,* is trivial if S c I, i.e., each codeword is a string of
length one.
Consider subsemigroups of 2* generated by sets with just two el-
ements. For any x, y, E 2*, the following are equivalent [1],[5],
[7]:
1) Ix,yj* is a free subsemigroup of Z*;
2) xy # yx;
3) there do not exist z e Z* and p, q > 0 such that x
= zP and y
=Zq.
This fact leads us to consider codes with just two codewords.
A UD code S E
2*
hasfinite delay if there exists an integer t such
Manuscript received October 26, 1978; revised May 28, 1979. This research
was supported in part by the National Science Foundation under Grant
MCS77-1 1360.
The authors are with the Department of Mathematics, University of Cal-
ifornia at Santa Barbara, Santa Barbara, CA 93106.
that for any message w e S*, examining the prefix of w of length at
most t allows one to determine the first codeword occurring in w's
unique factorization as a message in S*. A UD code S is said to have
delay k if k is the smallest integer that has this property.
If S is a UD code, then S* is a regular set so that the question of
whether a given string is a concatenation of codewords from S can
be answered by scanning the string once from left to right with a fi-
nite-state acceptor that recognizes the strings in S*. If one wishes to
decode a message in S*, that is, to parse a string in S* in terms of the
codewords in S, then it may not be possible to do this using only finite
memory and a single scan of the string. However, if S has finite delay,
then one can construct a finite-state machine that will accept those
strings that are in S* and give as output the unique decoding of those
strings.
It is easy to see that the UD code
I1,1
0,lOj with three codewords
does not have finite delay: 1 1010101010... can be interpreted as 1-
1-010-1-010-1---- or 10-1-010-1-010---- [2]. Our first result is that
any UD code with two codewords has finite delay.
Theorem 1: If S is a UD code with two codewords, then S has fi-
nite delay.
Proof: Let S =
Ix,yj
c 2* be a UD code. Clearly if neither x
is a prefix of y nor y is a prefix of x, then S has finite delay <
min( I x I, |y I ) ( z
I
denotes the length of the string z). Thus, suppose
x is a prefix of y and let k > I be the largest integer such that xk is
a prefix of y. Let u be the string such that y = xku (u $ e, the null
string, else yx = xy and fx,yl is not a UD code). By the maximality
of k, x is not a prefix of u.
Let m 1 = |x|, m2 =
Iy
I. Suppose S does not have finite delay. Then
for every integer n > 0 there is a message w e S* such that the first
codeword in w's unique factorization cannot be determined until a
prefix of w of length greater than n is examined. Particularly, there
is a string whose prefix of length m
I
+ m2 initially has two factori-
zations, the first begining with y and the second beginning with x.
The first factorization of the prefix of length m
I
+ m2 is yx since it
begins with y, there are only two codewords (x and y), andy = xku.
Because the second factorization begins with x and y = xku, this
means that the prefix of length m
I (k + 1) of the second factorization
is xk+1. By choice of u, x is not a prefix of u so that xk+l being a
prefix of yx
=
xkux implies that u is a prefix of x. Let x = uv (v 5
e else xy = yx). Now we have the following.
The first factorization of the prefix of length m1 + m2 is
yx
= xkux =
xkuuv;
and the second factorization of the prefix of length ml + m2 is
Xk+1U =
xkuvu.
Thus xkuuv = xkuvu and by cancellation uv = vu. This implies xy
= uv(uv)ku = (uv)kuVu = (uv)kuuv = yx which contradicts the
hypothesis that
Ix,y)
is a UD code. a
Theorem 1 can also be obtained as a corollary to a result of Linna
[6] which says-that any UD code with infinite delay is contained in
the message set of a UD code with finite delay that has strictly fewer
code words.
Lentin and Schiutzenberger (Corollary 1 of [4]) have established
the following fact.
Proposition: A necessary and sufficient condition for two strings
x and y to be powers of the same string is that xy and yx contain a
common left prefix of length Ix I
+
IY I- gcd(Ix
I, Iy).
0018-9340/80/0400-0324$00.75 ©) 1980 IEEE
324