Short Text Clustering using Numerical data based on N-gram Rajiv Kumar Department of Computer Science Engineering Lovely Professional University, Phagwara,Punjab,India Email: rajivbhatia1000@gmail.com Robin Prakash Mathur Department of Computer Science Engineering Lovely Professional University Phagwara,Punjab,India Email: robin.14597@lpu.co.in Abstract—Short text messages, especially mobile SMSs contain not only pure textual strings but also contain numeric values. Existing systems discard and filter out these numeric values. In our research, a new approach has been developed which makes usage of numeric values for feature extraction in the process of clustering. We are proposing an algorithm that uses n-gram approach to retrieve the pre-strings and post-strings of each numeric data and then similarity between documents is calculated. Partitioning is done to separate out two types of documents such as pure textual as well as mixed documents. Text messaging is gaining popularity in the field of pushing and providing short indication and informative notifications to users at any time. Use of numerical values through n-gram plays an important role for efficient clustering of text messages. Keywords—N-gram, Clustering, VSM I. INTRODUCTION Textual data is being produced at very high speed on daily basis. Text mining includes feature extraction, classification as well as clustering of these text documents. Most of text messages originate from mobiles, emails as well as from various social networking sites. Clustering is an unsupervised learning method. Unsupervised learning such as clustering does not make use of pre-defined classes for grouping objects. In our research, we concentrate on clustering of short text messages.VSM (Vector State Machine) with k-means is major technique, employed for clustering short commercial text messages. Improved version of k-means for mining web documents is also available that preserves conceptual similarity. Another version of k-means is kea-means which lessens the dependency on k for choosing initial total number of clusters. There are so many supporting techniques such as tf-idf, SVD, PCA, stemming and n-gram which are used in overall clustering process. II. EXISTING WORK Present research mainly includes feature extraction, text summarization, stemming of words. Algorithms have been developed to extract feature word by word and character by character. Vector space model is being used directly and with modified version. Euclidean measure seems to be little bit obsolete and limited in work, but cosine measure is being used in vector space model. Cosine measure is mathematically proved technique to calculate similarity between two vectors. Creating vectors requires calculating term frequency and inverse document frequency. Another technique for similarity is Jaccard a similarity method.Jaccard method does not require to create vectors. It is simple method that deals with intersection and union of two set of string. In our implementation part, we develop a module for jaccard similarity. Short commercial messages are clustered using VSM and k-means. It is based on single pass, flat hard clustering. A new algorithm named ArtCM is developed that uses minimum and maximum threshold [1] . Present work also includes dimension reduction techniques like ICA (Independent Component Analysis), Latent semantic analysis. Research shows ICA and LSI produce better results than conventional projection methods .Another existing research modifies k-means which removes the dependency of k value for initial clusters. This research extracts the key phrases. These key phrases determine the total number of initial clusters. From the viewpoint of our research, conversion from numerical data to categorical data has been done. This work creates numerical set and categorical data set. Then, numerical data is converted into another category and then clustering is performed. III. PROPOSED WORK As we are concerned with the dealing of text messages here, these messages contain numeric figures also. Existing research includes textual strings only, for documents clustering. Our research will make use of these numeric figures for clustering. We also know that as it is numeric data in text messages does not play any significant role in clustering. We will use n-gram technique to access prefix and postfix strings of a numeric figure, then we will make further process. Our research mainly concentrates on mobile SMSs and small indicative emails. These messages generally, contain notifications for people. For example:-A student can receive notifications of his or her fee payment or schedule of exam. In our text clustering technique, prefix and postfix phrases of numeric values are retrieved and stored in memory. Hence, less memory is required for storage as compared to the situation where all string phrases are stored in memory. 274 978-1-4799-4236-7/14/$31.00 c 2014 IEEE Authorized licensed use limited to: Lovely Professional University - Phagwara. Downloaded on December 29,2021 at 02:20:11 UTC from IEEE Xplore. Restrictions apply.