IJSRD - International Journal for Scientific Research & Development| Vol. 1, Issue 3, 2013 | ISSN (online): 2321-0613 All rights reserved by www.ijsrd.com 642 Clustering Algorithm for Gujarati Language Miral Patel 1 Prem Balani 2 1 Research Scholar 2 Assistant Professor 1, 2 GCET, Vallabh Vidyanagar, Anand, Gujarat, India Abstract— Natural language processing area is still under research. But now a day it is on platform for worldwide researchers. Natural language processing includes analyzing the language based on its structure and then tagging of each word appropriately with its grammar base. Here we have 50,000 tagged words set and we try to cluster those Gujarati words based on proposed algorithm, we have defined our own algorithm for processing. Many clustering techniques are available Ex. Single linkage , complete, linkage ,average linkage, Hear no of clusters to be formed are not known, so it’s all depends on the type of data set provided .Clustering is preprocess for stemming . Stemming is the process where root is extracted from its word. Ex. cats= cat+S, meaning. Cat: Noun and plural form. Keywords: Stemmer, Gujarati Stemmer, Stemming, POS, Gujarati cluster, clustering of Gujarati words. I. INTRODUCTION In NLP :Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLhkP involve natural language understanding -- that is, enabling computers to derive meaning from human or natural language input. [I] Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context. Stemming is the process where we extract root words from grammar based words for example grAhako= graham + o, mAhiwinI= mAhiwi+nI. Here we have shown examples, both are Noun. we have many categories in grammar like Adverb, Adjective , Pronoun, Conjuction etc....For clustering preprocessing step is POS:. Part of speech tagging POS process will tag each sentence with its grammatical identifier for Example: mAhiwinI ‘NN - here ‘NN shows NOUN category. Similarly all input test corpus is tagged with its label. There are two approaches available Supervised and unsupervised. Many clustering techniques are available like single linkage, average linkage, and complete linkage. The similarity between two groups is defined as the maximum similarity between any member of one group and any member of the other. Groups only need to be similar in a single pair of members in order to be merged [3] The similarity of two clusters is calculated as the minimum similarity between any member of one cluster and any member of the other. Like single linkage, the probability of an element merging with a cluster is determined by a single member of the cluster. However, in this case the least similar member is considered, instead of the most [3]. The similarity between two groups of points is defined by the mean similarity between points in one cluster and those of the other. In contrast to a single linkage each element needs to be relatively similar to all members of the other cluster, rather than to just one. Average linkage clusters tend to be relatively round or ellipsoid. [3] For clustering process we have used a supervised approach as we had fixed no of category for clustering. we have worked for categories are listed in table 1. Category 50,529 NN Noun JJ Adjective PRP Preposition PSP Post position CC Conjuction VM Verb Main VAUX Verb Auxiliary NNC Special symbol Table. 1: List of Tags II. PROPOSED ALGORITHAMS FOR CLUSTERING IN GUAJARATI