Y. Shi et al. (Eds.): MCDM 2009, CCIS 35, pp. 266–274, 2009.
© Springer-Verlag Berlin Heidelberg 2009
A Comparison of SVD, SVR, ADE and IRR for
Latent Semantic Indexing
Wen Zhang
1
, Xijin Tang
2
, and Taketoshi Yoshida
1
1
School of Knowledge Science, Japan Advanced Institute of Science and Technology,
1-1 Ashahidai, Tatsunokuchi, Ishikawa 923-1292, Japan
{zhangwen,yoshida}@jaist.ac.jp
2
Institute of Systems Science, Academy of Mathematics and Systems Science,
Chinese Academy of Sciences, Beijing 100080, P.R. China
xjtang@amss.ac.cn
Abstract. Recently, singular value decomposition (SVD) and its variants, which
are singular value rescaling (SVR), approximation dimension equalization (ADE)
and iterative residual rescaling (IRR), were proposed to conduct the job of latent
semantic indexing (LSI). Although they are all based on linear algebraic method
for tem-document matrix computation, which is SVD, the basic motivations
behind them concerning LSI are different from each other. In this paper, a series of
experiments are conducted to examine their effectiveness of LSI for the practical
application of text mining, including information retrieval, text categorization and
similarity measure. The experimental results demonstrate that SVD and SVR have
better performances than other proposed LSI methods in the above mentioned
applications. Meanwhile, ADE and IRR, because of the too much difference
between their approximation matrix and original term-document matrix in
Frobenius norm, can not derive good performances for text mining applications
using LSI.
Keywords: Latent Semantic Indexing, Singular Value Decomposition, Singular
Value Rescaling, Approximation Dimension Equalization, Iterative Residual
Rescaling.
1 Introduction
As computer networks become the backbones of science and economy, enormous
quantities of machine readable documents become available. The fact that about 80
percent of business is conducted on unstructured information [1] creates a great
demand for the efficient and effective text mining techniques, which aim to discover
high quality knowledge from unstructured information. Unfortunately, the usual
logic-based programming paradigm has great difficulties in capturing fuzzy and often
ambiguous relations in text documents. For this reason, text mining, which is also
known as knowledge discovery from texts, is proposed to deal with uncertainness and
fuzziness of languages and disclose hidden patterns (knowledge) among documents.
Typically, information is retrieved by literally matching terms in documents with
terms of a query. However, lexical matching methods can be inaccurate when they are