RETRIEVAL OF BROADCAST NEWS SPEECH IN MANDARIN CHINESE COLLECTED IN TAIWAN USING SYLLABLE-LEVEL STATISTICAL CHARACTERISTICS Berlin Chen 1,2 , Hsin-min Wang 1 , and Lin-shan Lee 1,2 1 Institute of Information Science, Academia Sinica, 2 Dept. of Computer Science & Information Engineering, National Taiwan University, Taipei, Taiwan, Republic of China ABSTRACT Spoken document retrieval has been extensively studied in recent years because of its high potential in various applications in the near future. Considering the monosyllabic structure of Chinese language, a whole class of indexing features for retrieval of spoken documents in Mandarin Chinese using syllable-level statistical characteristics has been studied, and very encouraging experimental results on retrieval of broadcast news speech collected in Taiwan were obtained. This paper reports some interesting initial results and findings obtained in this research. 1. INTRODUCTION The network technologies and the Internet activities have created a completely new information era. Intelligent and efficient information retrieval techniques providing Internet users with easy access to spoken documents, such as broadcast radio and television programs, become highly desired and have been extensively studied in recent years [1-6]. At the same time, the DARPA Hub-4 contest that began in 1995 has been evaluating the technologies of use of large-vocabulary continuous speech recognition (LVCSR) to transcribe audio recordings of broadcast news with many evaluation results reported [7] so far. Regardless of all these developments, automatic recognition and efficient retrieval of broadcast radio and television news speech remains to be a very challenging research topic because of the wide variety of speaking styles and acoustic conditions [8-10], and the various different problems in spoken document retrieval. There have been several different approaches developed for spoken document retrieval (SDR) in recent years. Word–based retrieval approaches have been very popular and successful, although with the potential problems of either having to know the query words in advance, or requiring a large enough lexicon to cover the growing dynamic contents of the diverse broadcast news [3]. Some other researchers proposed the concept of subword-based approaches, also with some other potential problems to be solved yet. A syllable-based indexing approach for retrieval of Chinese speech information using speech queries developed earlier at Taipei [6] belonged to the latter case. In this paper, the recent results of a new phase of research on the retrieval of real world Mandarin broadcast news speech collected in Taiwan area are reported. A whole class of indexing features based on the syllable-level statistical characteristic features has been studied. This is based on the various considerations on the structural nature of Chinese language. Very encouraging results are obtained and included. 2. CONSIDERATIONS OF USING SYLLABLE-LEVEL STATISTICAL CHARACTERISTICS In Chinese language, because each of the large number of characters (at least 10,000 are commonly used) is pronounced as a syllable, and is a morpheme with its own meaning, new words are very easily generated everyday by combining few characters. For example, the combination of the character(electricity) and(brain)gives a new word(computer), and the combination of the characters (stock) , (market) , (long) , and (red) gives a new word (stock price remains high for long)in business news. In many cases the meaning of these words have to do with the meaning of the component characters. Examples of such new words also include many proper nouns such as personal names and organization names which are simply arbitrary combinations of a few characters, as well as many domain specific terms just as the examples mentioned above. Many of such words are very often the right key in retrieval functions, because they usually carry the core information, or characterize the subject topic. But in most cases these important words for retrieval purposes are simply not included in any lexicon. It is therefore believed that the out-of- vocabulary problem is especially important for Chinese information retrieval, and this is a very important reason why the syllable-level statistical characteristics makes great sense in the problem here. In other words, the syllables represent characters with meaning, and in the retrieval process they do not have to be decoded into words which may not exist in the lexicon. Actually, the syllable-level information makes great sense for retrieval of Chinese information due to the more general monosyllabic structure of the language. Although there exist more than 10,000 commonly used Chinese characters, a nice feature of Chinese language is that all Chinese characters are monosyllabic and the total number of phonologically allowed Mandarin syllables is only 1,345. So a syllable is usually shared by many homonym characters with completely different meanings. Each Chinese word is then composed of from one to several characters (or syllables), thus the combination of these 1,345 syllables actually gives almost unlimited number of Chinese words. In other words, each syllable may stand for many different characters with different meanings, while the combination of several specific syllables very often gives only very few, if not unique, homonym polysyllabic words. As a result, comparing the input query and the documents to be retrieved based on the segments of several syllables may provide very good degree of similarity between them, as can be found in the