INTERNATIONAL JOURNAL OF TRANSLATION VOL. 32, NO. 1-2, JAN-DEC 2020 (ISSN 0970-9819) Development of a News Text Corpus of Indian English for Technology, Translation, and Teaching NILADRI SEKHAR DASH Indian Statistical Institute, Kolkata ABSTRACT The generation of a corpus of newspaper texts collected from digital newspapers has become a popular practice in corpus linguistics, language technology, and machine translation for various technical, cognitive, linguistic, and practical reasons. Extracted linguistic data and extralinguistic information from news texts in a customized form are useful in understanding lexical patterns, sentence types, lexical correlations, discourse structures, sense variations, sense annotation, sentiment analysis, information embedding, machine translation, and language teaching. Keeping these applications in mind, in this paper, we attempt to describe the process that we use to develop a corpus of news texts from an online Indian English newspaper. We also describe the methods we apply to refine and process the data using language processing techniques and store it categorically in a database to generate a structured corpus along with some metadata information. The present corpus contains 100 million (1 billion) running words that are compiled from some news media texts. The corpus is an important representative of one of the regional varieties of Indian English and a valuable resource in understanding various aspects of Indian English and its usage in diverse sociocultural contexts. Keywords: Newspaper, extraction, processing, crawler, parts of speech, tagger, metadata