Hindi Text Normalization K. Panchapagesan , Partha Pratim Talukdar, N. Sridhar Krishna, Kalika Bali, A. G. Ramakrishnan Hewlett-Packard Labs India 24 Salarpuria Arena, Hosur Road, Bangalore, India Email: partha.talukdar, nsridhar, kalika @hp.com Indian Institute of Science Bangalore, India Email: ramkiag@ee.iisc.ernet.in Birla Institute of Technology & Science Pilani, Rajasthan, India Email: panchapagesan.k@gmail.com Abstract All areas of language and speech technology, directly or indirectly, require handling of real (unrestricted) text. For example, Text-to-Speech systems directly need to work on real text, whereas Automatic Speech Recognition systems depend on language models that are trained on text. This paper reports our ongoing effort on Hindi Text Normaliza- tion. In that, a novel approach to text normalization, wherein tokenization and initial token classification are combined into one stage followed by a second level of token sense disambiguation, is described. Tokenization and initial token classification are performed using a lexical analyser that is derived from various token definitions in the form of regular expressions. For second level of token sense disambiguation, application of decision lists and decision trees are explored. Token-to-word rules are then applied, which are specific for each token type and also for each format within a token type. 1 Introduction All areas of language and speech technology, directly or indirectly, requires handling of real (unrestricted) text [Sproat et al., 2001]. For example, Text-to-Speech (TTS) systems directly need to work on real text, whereas Automatic Speech Recognition (ASR) systems depend on language models that are trained on text. In real text, many non-standard representation of words appear, for e.g., numbers (year, time, ordinal, cardinal, floating point), abbreviations,