87 Categorization of Malay Documents using Latent Semantic Indexing Nordianah Ab Samat, Masrah Azrifah Azmi Murad, Rodziah Atan, Muhammad Taufik Abdullah Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 Serdang, Selangor E-mail : nordianahSamat@gmail.com Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 Serdang, Selangor Tel : 03-89466546, Fax : 03-89466576 E-mail : masrah@fsktm.upm.edu.my Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 Serdang, Selangor Tel : 03-89466574, Fax : 03-89466576 E-mail : rodziah@fsktm.upm.edu.my Faculty of Computer Science and Information Technology Universiti Putra Malaysia, 43400 Serdang, Selangor Tel : 03-89466529, Fax : 03-89466576 E-mail : taufik@fsktm.upm.edu.my ABSTRACT Document categorization is a widely researched area of information retrieval. A popular approach to categorize documents is the Vector Space Model (VSM), which represents texts with feature vectors. The categorizing based on the VSM suffers from noise caused by synonymy and polysemy. Thus, an approach for the clustering of Malay documents based on semantic relations between words is proposed in this paper. The method is based on the model first formulated in the context of information retrieval, called Latent Semantic Indexing (LSI). This model leads to a vector representation of each document using Singular Value Decomposition (SVD), where familiar clustering techniques can be applied in this space. LSI produced good document clustering by obtaining relevant subjects appearing in a cluster. Keywords Latent Semantic Indexing, Document Clustering, K- means, Malay Language 1.0 INTRODUCTION Information retrieval can be defined broadly as the study of how to determine and retrieve from a corpus of stored information the portions which are relevant to particular information needs (Van Rijsbergen, 1979). The goal of an information retrieval system is to locate relevant documents in response to a user’s query at the same time retrieving as few as possible of the irrelevant documents. In order to represent the documents efficiently, those with similar topics or contents are clustered together. Categorization will group similar documents together based on their dominant features. The idea of clustering search results is not new, and has been investigated quite deeply in information retrieval (Osinski, Stefanowski, Weiss, 2004; Shankaran, Uma, Mani, 2003) based on the so called cluster hypothesis according to which clustering may be beneficial to users of an information retrieval system since it is likely that results that are relevant to the user are close to each other in the document space, and therefore tend to fall into relatively few clusters. A research on Malay natural language processing has been done up to the level of retrieving documents (Hamzah & Sembok, 2005a) but not to the extent of automation categorization in a semantic nature. Thus, this paper proposes a framework to document clustering using latent semantic indexing (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990) in the context of Malay natural language processing. Nevertheless, it’s believed the method build from this research is possible to be used in other languages. The paper is organized as follows. In the next section, the related works for document categorization techniques is discussed. Section 3 describes the LSI method that was designed to overcome the deficiencies of the classic vector space model, section 4 describes the algorithm to perform document clustering and section 5 reports on preliminary results and give some examples of the clusters obtained. Finally, section 6 concludes the paper .