Improving Predictive Entry of Finnish Text Messages using IRC Logs Miikka Silfverberg Helsinki University Dept. of Modern Languages Email: miikka.silfverberg@helsinki.fi Mirka Hyvärinen Helsinki University Dept. of Modern Languages Email: mirka.hyvarinen@helsinki.fi Tommi Pirinen Helsinki University Dept. of Modern Languages Email: tommi.pirinen@helsinki.fi Abstract—We describe a predictive text entry system for Finnish combining an open source morphological analyzer Omorfi and a lexical model compiled from Internet Relay Chat (IRC) logs. The system is implemented as a weighted finite- state transducer (WFST) using the freely available WFST library HFST. We show that using IRC logs to train the system gives substantial improvement in recall from a baseline system using word frequencies computed from the Finnish Wikipedia. We evaluate our system against the predictive text entry systems in three widely available mobile phone models and establish that we achieve comparable recall. I. I NTRODUCTION M OBILE phone text messages are a hugely popular way of communication, but mobile phones are not especially well suited for inputting text because of their small size and often limited keyboard. There are several technological solutions for inputting text on mobile phones and other limited keyboard devices. This paper is concerned with a technology called predictive text entry, which utilizes redundancy in natural language in order to enable efficient text entry using limited keyboards (typically having 12 keys). There has been a lot of research into improving predictive text entry, but it has mainly been concerned with improving the statistical model, or other technical aspects of text entry such as the keyboard layout. In this paper, we investigate the what role the training data, used in constructing the predictive text entry system, plays on the recall. Ideally the training data for any data driven system, such as predictive text entry, should resemble the input data of the system as closely as possible. In the case of a predictive text entry system, the ideal training data would thus consist of text messages. Since text messages are difficult to come by and there are legal restrictions for using them, other sources for training data should be considered. In this paper, we use Internet Relay Chat (IRC) logs, to train a predictive text input system for Finnish. In IRC multiple users chat in public chat rooms and often these conversations are logged and the logs are posted on the Internet. We show that using IRC logs as training data gives significant improvement compared to a baseline system, which is trained using data from the Finnish Wikipedia. To the best of our knowledge, there have not been earlier inquiries into using IRC logs for training predictive text entry systems. IRC log material is nevertheless very well suited for the task, since there is a lot of material available in different languages. Like text messages, it resembles spoken language and typically consists of short messages of a couple of hundred of characters. We evaluate our system against the predictive text entry in a number of widely available mobile phones (Nokia 9200, Nokia C7 and Samsung SGH-M310) and show that we achieve comparable recall. This demonstrates that it is possible to construct an accurate predictive text entry system without resorting to actual text message data. Even estimating the parameters of the system can be accomplished without using actual text message data. Because Finnish is a morphologically complex language, our system uses a morphological analyzer for Finnish Omorfi [1] rather than simply using a word list. The word forms found in Omorfi are given probabilities according to their frequency in the Finnish Wikipedia. These probabilities are combined using similar probabilities computed from IRC logs and the final probability given for a word form is a combination of the probabilities given by Omorfi and the IRC log model. Since Omorfi is implemented as a weighted finite-state transducer (WFST), we implemented our predictive text en- try system in the weighted finite-state framework. We used HFST [2], a freely available open-source C++ interface, for constructing and utilizing WFSTs. This paper is organized as follows: In section II, we present earlier work in improving the recall of predictive text entry systems. In section III, we present the predictive text entry task. In section IV, we explain how to augment a morpho- logical analyzer with word frequencies computed from IRC logs and how such a system is used to disambiguate between suggestions corresponding to an ambiguous input sequence. After this we present the morphological analyzer, Omorfi, and the HFST interface in section V and present the IRC log data used for training out model and the text message data used for evaluation in V-B. We evaluate our system in section VII and present some general and closing remarks in sections VIII and IX. II. RELATED WORK Improving predictive text entry is a widely researched problem. The improvements can be divided into two broad categories (i) improving the statistical model used in predictive