MDS-based Visualization Method for Multiple Speech Corpora Kimiko YAMAKAWA 1 , Tomoko MATSUI 2 and Shuichi ITAHASHI 13 1 National Institute of Informatics(NII), Japan 2 The Institute of Statistical Mathematics(ISM), Japan 3 National Institute of Advanced Industrial Science & Technology(AIST), Japan {jin,itabashi}@nii.ac.jp, tmatsui@ism.ac.jp Abstract The purpose of this study is to visualize the similarities between speech corpora. Speech data are indispensable for promoting speech research. A wide variety of speech corpora has recently been developed in many countries. Corpus diversification has given users many choices for corpus selection. In order for users to easily utilize these various corpora, we propose a new feature visualization method based on the corpus attribute. First, we listed eight attributes of the speech corpora. Then, we selected a few items for each attribute resulting in 58 items in all. Each item takes on a ‘1’ or ‘0’ depending on whether the corpus has the attribute or not. The set of corpus features is represented as a 58-dimensional vector. Then, the vectors are converted into a similarity matrix and analyzed using a multidimensional scal- ing method (MDS). We analyzed the speech corpora distributed by the Speech Resources Consortium (NII-SRC). The results showed that it is possible to visualize the similarities between multiple speech corpora using the proposed method. We also tested the effectiveness of the proposed method by analyzing six imaginary corpora having some specified attributes. This result will facilitate the idea of being able to search a specific corpus according to a user’s needs. Index Terms: Speech corpus, Corpus attribute, Visualization, MDS 1. Introduction Speech data are indispensable for promoting speech research. The performance of computer has greatly increased, and it is now possible to process large amounts of speech data on smaller computers. There has been a lot of speech recognition research that used probabilistic models based on large speech corpora. These probabilistic models have been widely used in techno- logical applications such as speech and language processing. They also seem to be effective methods for various linguistics fields. Moreover, a wide variety of speech corpora have been developed in many countries. Corpus diversification has given the user many choices for corpus selection, while on the other hand, they have to select a good corpus for the intended purpose from huge variety of corpora. In order for users to easily utilize these various corpora, we propose a new feature visualization method based on the corpus attributes. Although there are com- prehensive analyses on various corpora [1], there has not been a study that has tried to visualize the similarities among speech corpora. We have already proposed a new feature visualiza- tion method based on the corpus attribute using multidimen- sional scaling (MDS) [2]. This paper will report on the result of experiment conducted to test the effectiveness of the proposed method. 2. Visualization method 2.1. Corpus attributes We listed any of the possible attributes of the speech corpora feature based on Itahashi and Kuwabara’s classifications [3, 4]. There are eight groups of attributes: input devices, input en- vironments, number of speakers, speaking styles, data modes, speech modes, languages, and purposes. We then selected a few items for each attribute, resulting in 58 items in all. Table 1 shows the proposed corpus attributes. Each item takes on a ‘1’ or ‘0’ depending on whether the corpus has the attribute or not. The set of corpus features is represented as a 58-dimensional vector. 2.2. MDS-based method The vectors of the corpus attribute are converted into a distance matrix D m×m. The distance be defined according to the Eu- clidean model. Then, the barycentric coordinate matrix Zm×m is found from the distance matrix D. The coordinate matrix Zm×m may be given by the following equation: Zij = 1 2 ( m X i=1 d 2 ij m + m X j=1 d 2 ij m - m X i=1 m X j=1 d 2 ij m 2 - d 2 ij ) (1) The coordinate values Zij obtained from Eq.1 are placed in two-dimensional space. The goodness of the fit of the dimen- sional reductionφ on the latent space is given by the following equation: φ(r)= r X t=1 λ 2 l / n X t=1 λ 2 l (2) where r is the dimensional number, and λ is the eigenvalue of coordinate matrix Zm×m. 3. Experiment 3.1. Corpus specification We analyzed 23 speech corpora distributed by the Speech Re- sources Consortium (NII-SRC) [5, 6]. Table 2 shows the corpus list for this study. #1 PASL-DSR and #23 ASJ-JIPDEC are the continuous speech corpus, and #3 TMW and #22 FW03 are the isolated word corpus. #2 UT-ML is the corpus of the mul- tilingual speech. #4 GSR-JD is the Japanese dialect corpus. #5 RWCP-SP96, #6 RWCP-SP97 and #10 PASD are the spoken dialog corpora. #7 RWCP-SP99 and #18 JEIDA-JCSD are the read speech corpus. #8 RWCP-SP01 is the meeting speech corpus. Accepted after peer review of full paper Copyright 2008 ISCA September 22 - 26, Brisbane Australia 1666 10.21437/Interspeech.2008-462