Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval Royal Sequiera, Monojit Choudhury Microsoft Research Lab India {a-rosequ,monojitc}@microsoft.com Parth Gupta, Paolo Rosso Technical University of Valencia, Spain {pgupta,prosso}@dsic.upv.es Shubham Kumar IIT, Patna shubh07071994@gmail.com Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay Jadavpur University, Kolkata {sb.cse.ju,sudip.naskar}@gmail.com,sivaji_cse_ju@yahoo.com Gokul Chittaranjan QuaintScience, Bangalore gokulchittaranjan@gmail.com Amitava Das IIIT, Sri City amitava.das@iiits.in Kunal Chakma NIT, Agartala kchax4377@gmail.com ABSTRACT The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie re- views and astrology documents, where both the queries and doc- uments were either in Hindi written in Devanagari or in Roman transliterated form. Subtask III was on transliterated question an- swering where the documents as well as questions were in Bangla script or Roman transliterated Bangla. A total of 24 runs were sub- mitted by 10 teams, of which 14 runs were for subtask I and 10 runs for subtask II. There were no participation for Subtask III. The overview presents a comprehensive report of the subtasks, datasets, runs submitted and performances. 1. INTRODUCTION A large number of languages, including Arabic, Russian, and most of the South and South East Asian languages, are written us- ing indigenous scripts. However, often websites and user generated content (such as tweets and blogs) in these languages are written using Roman script due to various socio-cultural and technological reasons [1]. This process of phonetically representing the words of a language in a non-native script is called transliteration [2, 3]. A lack of standard keyboards, a large number of scripts, as well as familiarity with English and QWERTY keyboards has given rise to a number of transliteration schemes for generating Indian language text in Roman transliteration. Some of these are an attempt to stan- dardise the mapping between the Indian language script and the Roman alphabet, e.g., ITRANS 1 but mostly the users define their own mappings that the readers can understand given their knowl- edge of the language. Transliteration, especially into Roman script, is used abundantly on the Web not only for documents, but also for user queries that intend to search for these documents. A challenge that search engines face while processing translit- erated queries and documents is that of extensive spelling varia- 1 http://www.aczoom.com/itrans/ tion. For instance, the word dhanyavad (“thank you" in Hindi and many other Indian languages) can be written in Roman script as dhanyavaad, dhanyvad, danyavad, danyavaad, dhanyavada, dhanyabad, and so on. The aim of this shared task is to systematically for- malize several research problems that one must solve to tackle this unique situation prevalent in Web search for users of many lan- guages around the world, develop related data sets, test benches and most importantly, build a research community around this im- portant problem that has received very little attention till date. In this shared task track, we have hosted three subtasks. Subtask 1 was on language labeling of short text fragments; this is one of the first steps before one can tackle the general problem of mixed script information retrieval. Subtask 2 was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astrology documents, which are some of the most searched items in India, and thus, are per- fect and practical examples of transliterated search. We introduced a third subtask this year on mixed script question answering. In the first subtask, participants had to classify words in a query as English or a transliterated form of an Indian language word. Un- like last year though, we did not ask for the back-transliteration of the Indic words in the native scripts. This decision was made due to the observation that the most successful runs from previous years had used off-the-shelf transliteration APIs (e.g. Google Indic input tool) which beats the purpose of a research shared task. In the second subtask, participants had to retrieve the top few docu- ments from a multi-script corpus for queries in Devanagari or Ro- man transliterated Hindi. Last two years, this task was run only for Hindi film lyrics. This year, movie reviews and astrology docu- ments, both transliterated Hindi and in Devanagari, were also added to the document collection. The queries apart from being in differ- ent scripts were also in mixed languages (e.g. dil chahta hai lyrics). In the third subtask, the participants were required to provide an- swers to a set of factoid questions written in Romanized Bangla. This paper provides the overview and datasets of the Mixed Script Information Retrieval track at the seventh Forum for Information Retrieval Conference 2015 (FIRE ’15). The track was coordinated jointly by Microsoft Research India, Technical University of Va- lencia, and Jadavpur University and was supported by BMS Col- lege of Engineering, Bangalore. The track on mixed script IR con- 19