Mining Comparable Bilingual Text Corpora for Cross-Language Information Integration Tao Tao Department of Computer Science University of Illinois at Urbana Champaign ChengXiang Zhai Department of Computer Science University of Illinois at Urbana Champaign ABSTRACT Integrating information in multiple natural languages is a challenging task that often requires manually created lin- guistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in diﬀerent languages. Comparable text cor- pora are collections of text documents in diﬀerent languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in diﬀerent languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in diﬀer- ent languages in the comparable corpora and discover map- pings between words in diﬀerent languages. Such mappings can then be used to further discover mappings between doc- uments in diﬀerent languages, achieving cross-lingual infor- mation integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is eﬀective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is gen- erally applicable to any language pairs as long as we have comparable corpora. Categories and Subject Descriptors: H.3.3 [Informa- tion Search and Retrieval]: Text Mining General Terms: Algorithms Keywords: Cross-lingual text mining, comparable corpora, frequency correlation, document alignment 1. INTRODUCTION As more information becomes available online, we have also seen more and more information in diﬀerent natural languages such as English, Spanish, and Chinese. The web today consists of documents in many diﬀerent languages. For a user who is interested in ﬁnding information from all Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00. the documents in diﬀerent languages, it would be very use- ful if we could integrate related information in multiple lan- guages [1]. Currently, cross-lingual information integration is often achieved through performing cross-lingual information re- trieval(CLIR) [17], which allows a user to retrieve documents in language A with a query in language B. Most CLIR tech- niques rely on manually created linguistic resources such as a bilingual dictionary or examples of direct translations of words and documents [16]. Such resources may not al- ways be available, especially for minority languages; in such a case, how to perform cross-lingual information integration would be a signiﬁcant challenge. In this paper, we propose a cross-lingual text mining method that can exploit compara- ble bilingual text corpora to perform cross-lingual informa- tion integration without requiring any additional linguistic resources. Comparable text corpora are collections of text documents in diﬀerent languages that are about the same or similar topics. For example, news articles published in the same time period tend to report the same important international events in various topics such as politics, business, science and sports. Such data are naturally available to us, so it would be very interesting to study how to exploit them to perform cross-lingual information integration. Even when we have manually created clean bilingual resources such as a bilin- gual dictionary, it may still be desirable to exploit such com- parable corpora for two reasons: (1) New words and phrases are constantly introduced and it would be hard to keep up- dating a dictionary to include them all. Approaches such as what we propose in this paper can potentially be used to ac- quire translation knowledge about such new words/phrases from comparable corpora and help lexicographers update dictionaries. (2) Since comparable corpora are additional resources, we may expect to achieve better performance by combining the exploitation of comparable corpora with that of a bilingual dictionary. We frame the problem of cross-lingual information in- tegration as one involving mapping or linking words and documents in diﬀerent languages. While comparable cor- pora have been studied extensively in the existing literature (e.g.,[6, 10, 15, 5, 2, 8, 13]), almost all existing work assumes some kind of bilingual dictionary or translation examples to start with. We study how to map words and documents from comparable bilingual corpora without requiring any ad- ditional linguistic resources such as a bilingual dictionary. Our basic idea is to exploit the fact that the frequency dis- tributions of topically related words in diﬀerent languages