An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1 , Jeong-Ho Chang 2 , and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication Engineering, Hallym University Kang-Won, Korea 200-702 yskim01@hallym.ac.kr 2 School of Computer Science and Engineering, Seoul National University Seoul, Korea 151-744 {jhchang, btzhang}@bi.snu.ac.kr ⋆⋆⋆ Abstract. In this paper, we try to find empirically the optimal dimensionality in data-driven models, Latent Semantic Analysis (LSA) model and Probabilistic Latent Semantic Analysis (PLSA) model. These models are used for building linguistic semantic knowledge which could be used in estimating contextual semantic similarity for the target word selection in English-Korean machine translation. We also facilitate k-Nearest Neighbor learning algorithm. We diversify our experiments by analyzing the covariance between the value of k in k-NN learning and accuracy of selection, in addition to that between the dimensionality and the accuracy. While we could not find regular tendency of relationship between the dimensionality and the accuracy, however, we could find the optimal dimensionality having the most sound distribution of data during experiments. Keywords: Knowledge Acquisition, Text Mining, Latent Semantic Analysis, Probabilistic Latent Semantic Analysis, Target Word Selection 1 Introduction Data-driven models in this paper are much beneficial in natural language pro- cessing application because the cost for building new linguistic knowledge is very expensive. But only raw text data, called untagged corpora, are needed in data-driven models of this paper. LSA is construed as a practical expedient for obtaining approximate estimates of meaning similarity among words and text segments and is applied to various application. LSA also assumes that the choice of dimensionality can be of great importance[Landauer 98]. The PLSA model is based on a statistical model which has been called aspect model[Hofmann 99c]. In this paper, we ultimately have tried to find out regular tendency of covariance ⋆⋆⋆ This work was supported by the Korea Ministry of Science and Technology under the BrainTech Project K.-Y. Whang, J. Jeon, K. Shim, J. Srivatava (Eds.): PAKDD 2003, LNAI 2637, pp. 111–116, 2003. c Springer-Verlag Berlin Heidelberg 2003