AIETI’15 • January 2015 • pp.74-76 iCorpora: Compiling, Managing and Exploring Multilingual Data Hernani Costa * a ,Gloria Corpas P astor a , Miriam Seghiri a and Ruslan Mitkov b a LEXYTRAD, University of Malaga, Spain b RIILP, University of Wolverhampton, UK {hercos,gcorpas,seghiri}@uma.es, r.mitkov@wlv.ac.uk Abstract In the last decade, there has been a growing interest in bilingual and multilingual corpora. Particularly, in translation their beneﬁts have been demonstrated by several authors (cf. Bowker and Pearson (2002); Bowker (2002); Zanettin et al. (2003); Corpas Pastor and Seghiri (2009)). Their objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages. Thus, it is not surprising that the use of corpora has been considered an essential resource in several research domains such as translation, language learning, stylistics, sociolinguistics, terminology, language teaching, automatic and assisted translation, amongst others. Nevertheless, the lack of sufﬁcient/up-to-date parallel corpora and linguistic resources for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement on these areas. One potential solution to the insufﬁcient parallel translation data is the exploitation of non-parallel bilingual and multilingual text resources, also known as comparable corpora (i.e. corpora that include similar types of original texts in one or more language using the same design criteria (cf. EAGLES (1996); Corpas Pastor, 2001:158). Even though comparable corpora can compensate for the shortage of linguistic resources and ultimately improve automated translations quality for under-resourced languages and narrow domains for example, the problem of data collection presupposes a signiﬁcant technical challenge. The solution proposed in iCorpora project and presented in this article is to exploit the fact that comparable corpora are much more widely available than parallel translation data. This ongoing project aims to increase the ﬂexibility and robustness of the compilation, management and exploration of both comparable and parallel corpora by creating a new web-based application from scratch. iCorpora intends to fulﬁl not only translators’ and interpreters’ needs (Costa et al. (2014b;a)), but also professionals’ and ordinary people’s, either by breaking some of the usability problems found in the current compilation tools available on the market (e.g. BootCaT (Baroni and Bernardini (2004)) and WebBootCat (Baroni et al. (2006)) or by improving their limitations and performance issues. iCorpora will aggregate three applications: iCompileCorpora, iManageCorpora and iExploreCorpora. The ﬁrst application, iCompileCorpora (Costa et al. (2014c)), can be seen as a layered model comprising a manual, a semi-automatic web-based and a semi-automatic Cross-Language Information Retrieval (CLIR) layer. This design option will permit not only to increase the ﬂexibility and robustness of the compilation process, but will also hierarchically extend the manual layer features to the semi-automatic web-based layer and then to the semi-automatic CLIR layer (i.e. the CLIR layer will automatically translate the queries to other languages (Talvensaari et al. (2007))). iManageCorpora will be specially designed to: manage (i.e. it will allow to edit, copy and paste sentences and documents from and to documents and corpora respectively, as well as to manage corpora into domains and sub-domains); measure the similarity between documents; and to explore the representativeness of the corpora (cf. Corpas Pastor and Seghiri (2009)). Finally, iExploreCorpora intends to offer a set of concordance features, such as search for words in context, automatic extraction of the most frequent words and multi-words, amongst other. * Hernani Costa is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement n 317471. Also, the research reported in this work has been partially carried out in the framework of the Educational Innovation Project TRADICOR (PIE 13-054, 2014-2015); the R&D project INTELITERM (ref. n FFI2012-38881, 2012-2015), and the R&D Project for Excelence TERMITUR (ref. n HUM2754, 2014-2017).