Term Similarity and Weighting Framework for Text Representation Sadiq Sani, Nirmalie Wiratunga, Stewart Massie, and Robert Lothian School of Computing, The Robert Gordon University, Aberdeen AB25 1HG, Scotland, UK Abstract. Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many diﬀerent ways. Therefore, terms in a document should not be treated indepen- dently of one another since together they help to disambiguate and es- tablish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms. An- other consideration for representation involves the importance of terms. Feature selection techniques address this by using statistical measures to quantify feature usefulness for retrieval related tasks. In this paper we present a framework that combines term-similarity and weighting for text representation. This allows us to comparatively study the impact of term similarity, term weighting and any synergistic eﬀect that may ex- ist between them. Study of term similarity is based on approaches that exploit term co-occurrences within the document and sentence contexts whilst term weighting uses the popular Chi-squared test. Our results on text classiﬁcation tasks show that the combined eﬀect of similarity and weighting is far superior to each independently and that this synergistic eﬀect is obtained regardless of the co-occurrence context granularity. We also introduce a novel term-similarity mining approach using lexical co- occurrence proﬁles which consistently out-performs both the standard co-occurrence approaches to similarity mining and SVM. 1 Introduction While unstructured, natural language text is convenient for human consump- tion, computers still ﬁnd it diﬃcult to process such information with satisfac- tory precision. This is because the lexical content of natural language text can be quite diﬀerent from its intended meaning due to inherent ambiguities in nat- ural language such as synonymy (diﬀerent terms having similar meaning) and polysemy (the same term having multiple diﬀerent meanings). Representation of text documents is of interest to many research ﬁelds such as Information Retrieval, Natural Language Processing and Textual CBR. The standard Bag of Words (BOW) representation is a naive approach in that it operates at the lexical level, treating terms as independent features [15]. Such a strategy is well suited in domains where vocabulary usage remains consistent. However in the