256 Int. J. Intelligent Systems Technologies and Applications, Vol. 14, Nos. 3/4, 2015
Copyright © 2015 Inderscience Enterprises Ltd.
A new term weighting scheme for text categorisation
Fatiha Barigou
Laboratory of Computer Science of Oran,
Department of Computer Science,
University of Oran 1, Ahmed Ben Bella,
Oran 31000, Algeria
Email: fatbarigou@gmail.com
Abstract: Recently, the study of term weighting schemes has increasingly
attracted the attention of researchers in the field of text categorisation (TC).
Unlike information retrieval, TC is a supervised learning task that makes use of
the prior information about the distribution of training documents in different
predefined categories. This information, being omitted from traditional
weighting schemes, is considered very useful and has been widely used for the
term selection and building classifiers. This paper aims to study and analyse a
new weighting measure to improve performance of a k nearest neighbours
(kNN)-based TC.
Keywords: text categorisation; term weighting; supervised term weighting
scheme; kNN; k nearest neighbours.
Reference to this paper should be made as follows: Barigou, F. (2015)
‘A new term weighting scheme for text categorisation’, Int. J. Intelligent
Systems Technologies and Applications, Vol. 14, Nos. 3/4, pp.256–272.
Biographical notes: Fatiha Barigou graduated from the Department of
Computer Science, University of Oran 1, Algeria. In 2012, she received
her PhD in Computer Science from the University of Oran 1. She is currently
a Research Member of Laboratory of Computer Science of Oran. Her research
interests include natural language processing, information extraction,
information retrieval, knowledge-based system, pattern recognition and data
mining.
1 Introduction
Nowadays, the electronic information is abundantly available. The World Wide Web,
for example, is continually enriched with new contents: companies are more and more
storing data, email is becoming an extremely popular form of communication and old
manuscripts are now available in digital forms. All this complex information would be
meaningless if our ability to effectively access did not increase, too. For this, we need
tools to organise and access this data. One successful solution that tries to answer this
problem is the automatic text categorisation (TC).
The task of TC consists in assigning new documents to predefined categories,
on the basis of knowledge gained during the training phase where a classification
system is built using a set of labelled training examples and a learning algorithm.
According to Sebastiani (2002), building an automated TC system is based on three main