Unsupervised Spectral Learning of FSTs Rapha¨ el Bailly Xavier Carreras Ariadna Quattoni Universitat Politecnica de Catalunya Barcelona, 08034 rbailly,carreras,aquattoni@lsi.upc.edu Abstract Finite-State Transducers (FST) are a standard tool for modeling paired input- output sequences and are used in numerous applications, ranging from computa- tional biology to natural language processing. Recently Balle et al. [4] presented a spectral algorithm for learning FST from samples of aligned input-output se- quences. In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as ﬁnding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identiﬁability results for FST dis- tributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efﬁciently. 1 Introduction This paper addresses the problem of learning probability distributions over pairs of input-output sequences, also known as transduction problem. A pair of sequences is made of an input sequence, built from an input alphabet, and an output sequence, built from an output alphabet. Finite State Transducers (FST) are one of the main probabilistic tools used to model such distributions and have been used in numerous applications ranging from computational biology to natural language processing. A variety of algorithms for learning FST have been proposed in the literature, most of them are based on EM optimizations [9, 11] or grammatical inference techniques [8, 6]. In essence, an FST can be regarded as an HMM that generates bi-symbols of combined input-output symbols. The input and output symbols may be generated jointly or independently conditioned on the previous observations. A particular generation pattern constitutes what we call an alignment. GAATTCAG- GAATTCAG- GAATTC-AG GAATTC-AG ||||| ||||| ||||| ||||| GGA-TC-GA GGAT-C-GA GGA-TCGA- GGAT-CGA- To be able to handle different alignments, a special empty symbol ∗ is added to the input and output alphabets. With this enlarged set of bi-symbols, the model is able to generate an input symbol (resp. an output symbol) without an output symbol (resp. input symbol). These special bi-symbols will be represented by the pair x ∗ (resp. ∗ y ). As an example, the ﬁrst alignment above will correspond to the two possible representations G G A ∗ ∗ G A A T ∗ T T C C A ∗ G G ∗ A and G G ∗ G A ∗ A A T ∗ T T C C A ∗ G G ∗ A . Under this model the probability of observing a pair of un-aligned input-output sequences is obtained by integrating over all possible alignments. Recently, following a recent trend of work on spectral learning algorithms for ﬁnite state machines [14, 2, 17, 18, 7, 16, 10, 5], Balle et al. [4] presented an algorithm for learning FST where the input to the algorithm are samples of aligned input-output sequences. As with most spectral methods the core idea of this algorithm is to exploit low-rank decompositions of some Hankel matrix representing 1