Clustering and Categorization Applied to Cryptanalysis Claudia Oliveira, Jos´ e Antˆonio Xex´ eo, and Carlos Andr´ e Carvalho Departamento de Engenharia de Sistemas, Instituto Militar de Engenharia Rio de Janeiro, Brazil Abstract. This paper presents a new application for Information Re- trieval techniques. We introduce the use of clustering and categorization in the attack of cryptosystems. In order to clearly present the funda- mentals and understand the workings and the implications of this new technique, we developed a procedure for keylength determination in the process of cryptanalysis of polyalphabetic ciphers, the core of any attack of this type of ciphers. The basic premises are: ﬁrst, a cryptogram is a normal document written in an unknown language; secondly, Information Retrieval Techniques are extremely useful in detecting string patterns in ordinary texts and might be helpful with cryptograms as well. 1 Introduction This paper presents a new technique in cryptanalysis. It proposes a new ﬁeld of investigation which links Information Retrieval (IR) to Cryptology, through the use of text categorization techniques in the attack of cryptosystems. The goal of our work in this initial stage is to argue for the feasibility of using these tech- niques as cryptanalysis instruments. In order to clearly present the fundamentals and understand the workings and the implications of this IR technique, as an example, we developed a procedure for keylength determination in the process of cryptanalysis of simple polyalphabetic ciphers. Information Retrieval systems have relied heavily on the “bag-of-words” model of document representation [5], which simply encodes the frequency of each word in a text and disregards word order and any language speciﬁc linguis- tic knowledge. Even the language in question is not taken into account, although statistical features highly inﬂuence the model. Many IR tasks are solved by as- suming that a given text A is conceptually more similar to text B than to text C, even if their contents are not known. This perspective is valuable in the spec- ulation about the linguistic features that remain after a text has been encrypted. In the context of this application, a word is a string of symbols, either in a plaintext or in a ciphertext; in ciphertexts, words can be blocks of a ﬁxed size. As a consequence, the vocabulary is the set of distinct words generated by a cipher. The cipher key can be viewed as a linguistic property that determines a new language (ciphertexts), with its particular vocabulary and word frequency distribution. Therefore, the work explores the idea that the analysis of linguistic