Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns Rui Henriques 1,2 , Cl´ audia Antunes 2 , and Sara C. Madeira 1,2 1 KDBio, Inesc-ID, Instituto Superior T´ ecnico, Universidade de Lisboa 2 Dep. Computer Science and Engineering, IST, Universidade de Lisboa {rmch,claudia.antunes,sara.madeira}@tecnico.ulisboa.pt Abstract. An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving properties, can be mapped as a se- quential pattern mining problem on data with item-indexable properties. An item-indexable database, typically observed in biomedical domains, does not allow item repetitions per sequence and is commonly dense. Although multiple methods have been proposed for the efficient dis- covery of sequential patterns, their performance rapidly degrades over item-indexable databases. The target tasks for these databases bene- fit from lengthy patterns and tolerate local mismatches. However, ex- isting methods that consider noise relaxations to increase the average short length of sequential patterns scale poorly, aggravating the yet critical efficiency. In this work, we first propose a new sequential pat- tern mining method, IndexSpan, which is able to mine sequential pat- terns over item-indexable databases with heightened efficiency. Second, we propose a pattern-merging procedure, MergeIndexBic, to efficiently discover lengthy noise-tolerant sequential patterns. The superior perfor- mance of IndexSpan and MergeIndexBic against competitive alternatives is demonstrated on both synthetic and real datasets. 1 Introduction Sequential pattern mining (SPM) has been proposed to deal efficiently with the discovery of frequent precedences and co-occurrences in itemset sequences. SPM methods can be applied to solve tasks centered on extracting order-preserving regularities, such as the discovery of flexible (bi)clusters [14]. These tasks com- monly rely on a more restricted form of sequences, item-indexable sequences, which do not allow item repetitions per sequence. Illustrative examples of item- indexable databases include sequences derived from microarrays, molecular in- teractions, consumer ratings, ordered shoppings, tasks scheduling, among many others. However, these tasks are characterized by two major challenges. First, their hard nature, which is related with two factors: average high number of items per transaction and high data density. Second, order-preserving solutions are optimally described by lengthy noise-tolerant sequential patterns [5]. Although existing SPM approaches can be applied over item-indexable data- bases, they suffer from two problems. First, they show inefficiencies due to the commonly observed density levels and high average transaction length of these datasets, which leads to a combinatorial explosion of sequential patterns under low support thresholds [14]. Additionally, the few dedicated methods able to