Linear Pattern Matching with Swaps for Short Patterns Tom´ s Flouri Xhevi Qafmolla Abstract— The Pattern Matching problem with swaps is a variation of the classical pattern matching problem. It consists of finding all the occurrences of a pattern P in a text T , when an unrestricted number of disjoint local swaps is allowed. In this paper, we present a new, efficient method for the Swap Matching problem with short patterns. In particular, we present an algorithm constructing a non-deterministic finite automaton for a given pattern P which, when transformed to a deterministic finite automaton, serves as a pattern matcher running in time O(n), where n is the length of the input text T . I. INTRODUCTION Finding all the occurrences of a given pattern in a text, i.e. the classical pattern matching, is one of the basic and most well-studied problems in computer science with many practical appliances in many areas such as computational biology, communications, data mining and multimedia. For example the Boyer-Moore algorithm is implemented in the emacs’ “s” command, or in UNIX’s “grep”. UNIX’s “diff” command uses the longest common subsequence algorithm [9] presented by Chvatal et al. since 1972. The tremendous and continuous expansion of these fields, however, implied the need of a more generalized theoret- ical foundation of the pattern matching concept. Research has emerged in two directions: generalized matching and approximate matching. In generalized matching one seeks exact occurrences of the pattern in the text, but matching doesn’t mean equality. Instead, matching is done with “don’t cares”, less-than matching, or matching relation defined by a graph on the alphabet. In approximate matching one seeks to find approximate matches of the pattern. The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance, also called the Levenshtein distance, between the string and the pattern. Primitive operations can be insertion, deletion, substitution and transposition, or swapping. In our paper we focus on the problem of Pattern Matching with Swaps, also known as the Swap Matching problem. In Swap Matching context, we say that the pattern P of length m matches the given text T of length n at location i, when an unrestricted number of adjacent characters from the pattern can be swapped in order to become identical This research has been partially supported by the Ministry of Education, Youth and Sports under research program MSM 6840770014, and by the Czech Science Foundation as project No. 201/09/0807. T. Flouri and X. Qafmolla are with the Faculty of Electrical Engineering, Department of Computer Science and Engineer- ing, Czech Technical University in Prague, Czech Republic {flourt1,qafmox1}@fel.cvut.cz with a substring of T starting or ending at i, given that all swaps are disjoint, i.e. no one character is involved in more than one swap. Both P and T are sequences of characters drawn from the same finite character set Σ of size σ. To provide just a few applications of this definition, we could name mistyping in text pattern search, transmission noise adjusting in communications or finding of close mutations in biology. For example in gene mutation phenomenon we observe swaps in a disease called Spinal Muscular Atrophy [14]. Such cases serve as a convincing pointer to further theoretical study of swaps in computer science. The Swap Matching problem was introduced in 1995, as one of the open problems in nonstandard string matching, by Muthukrishnan [16]. Amir et al. have since then done exces- sive research in this area producing many interesting results. They first provided an algorithm of O(nm 1 3 log m log σ) time complexity for an alphabet set of size two (see [2]). They also showed that alphabets of larger sizes could be re- duced to the size of two having an O(log 2 σ) time overhead. Later in 1998, Amir et al. also studied some restrictive cases [5] for which they could obtain an algorithm of O(n log 2 m) time complexity. Back in the year 2000, again Amir et al. tried to reduce the overhead of their 1998 algorithm, with the method of alphabet size reduction [3], introducing now an overhead of only O(log σ). More recently, in another paper in 2003, Amir et al. found a new solution of O(n log m log σ) time, using overlap matching [4]. It is important to mention that all the above streams of research are based on the Fast Fourier Transformation (FFT). The first efficient solution without using FFT was in- troduced in 2008 by Iliopoulos and Rahman [13]. Their approach consisted in introducing graph theory for initially modeling the problem and then, using bit parallelism, they developed an efficient algorithm running at O((n+m) log m) time complexity. The constraint given was that the pattern size must be of a comparable size with the word size in the target machine, thus limiting their algorithm for small patterns. More recently, in 2009, Cantone et al. continued in bit parallelism approach to introduce an algorithm named CROSS-SAMPLING [7]. The algorithm was characterized by a worst-case time complexity of O(nm) having a O(σ) space complexity for short patterns fitting in a few machine words. In the same year, Campanelli et al. presented an efficient way [6] for solving the Swap Matching problem with small patterns at O(nm 2 ) time complexity in general. Their al- gorithm was named BACKWARDS-CROSS-SAMPLING and inherited many properties of the original CROSS-SAMPLING algorithm, but was based on a right-to-left scan of the