A Modular System for Rule-based Text Categorisation Marco Del Tredici, Malvina Nissim Expert System, University of Bologna mdeltredici@expertsystem.it, malvina.nissim@unibo.it Abstract We introduce a modular rule-based approach to text categorisation which is more flexible and less time consuming to build than a standard rule-based system because it works with a hierarchical structure and allows for reusability of rules. When compared to currently more wide-spread machine learning models on a case study, our modular system shows competitive results, and it has the advantage of reducing manual effort over time, since only fewer rules must be written when moving to a (partially) new domain, while annotation of training data is always required in the same amount. Keywords: text categorisation, rule-based, hierarchical structure 1. Introduction and Background Automatic text categorisation is the task of classifying into a finite number of preselected categories a set of unknown documents. It is indeed also known as “document classi- fication”. Nowadays, thanks to the availability of a large quantity of preclassified documents in digital form and effective learning algorithms, the dominating approach is based on supervised machine learning techniques, where a classifier is built by learning from a set of manually labelled documents (Sebastiani, 2002; Sebastiani, 2005). Statistical approaches have thus gradually been favoured over ruled- based ones — which had however achieved competitive re- sults (Dejong, 1982; Jacobs and Rau, 1988; Hayes and We- instein, 1990; Goodman, 1990) — also because of portabil- ity issues: creating new annotated sets for training statisti- cal models is generally less time consuming and requires a lesser degree of expertise than creating entire new sets of rules (Sebastiani, 1999); see also (Yang and Liu, 1999) and (Steinbach et al., 2000) for a comparison including unsu- pervised methods). However, the Pascal challenge in Large Scale Hierarchical Text classification, at its fourth edition 1 , highlights the need for ways of dealing with large amounts of data where distributions are skewed at different levels of the hierarchy, implying that learning methods are chal- lenged by the distribution and statistical dependence of the classes (Kosmopoulos et al., 2010). In this paper we explore the possibility of making a more flexible, reusable ruled-based system aimed at improving portability while eliminating the annotation effort typical of supervised machine learning approaches. Specifically, rather than creating a unique set of rules that directly de- fines a target category, we suggest a double categorisation process, in which atomic categories are first created as in- dependent units, and then combined into complex struc- tures corresponding to the target classes. Such a modular rule-based system also naturally lends itself to the afore- mentioned hierarchical document classification task, as it reflects the structural dependencies of the categories (see Section 5. for a detailed discussion of this issue). 1 http://lshtc.iit.demokritos.gr/LSHTC4_ CALL 2. Method The method we propose is a modular two-stage categorisa- tion process. Final categories, which we call complex cate- gories (Hayes and Weinstein, 1990), can be seen as the sum of several basic categories, which we call atomic categories. Considering the final target categories, several relevant atomic categories are created and stored in a database as independent units. Complex categories are then built up by aggregating appropriate basic categories among those available, with a specific weighing strategy (see details of phase 2 below). By combining them in different ways, new complex structures can be formed. The underlying, guiding principle, which addresses the portability problem typical of rule-based systems, is that basic categories constitute atoms of information that can be reused. As an advantage to statistical methods, no new training sets have to be annotated when dealing with new categories. And as an advantage with respect to rule-based approaches, no new rules have to be rewritten completely from scratch when including new categories or moving to (partially) new domains. The two phases of the categorisation process require spe- cific tools, and are described below. In phase 1, atomic categories are created with COGITO R Studio, a programming environment in which sets of rules are manually written using a declarative language. At the heart of COGITO R Stu- dio is a semantic network (similar to the WordNet hierarchy (Fellbaum, 1998)), which is used to per- form word sense disambiguation and entity recogni- tion. Within the same environment, rules are then writ- ten to define the conditions that an input text has to obey to be placed in a given atomic category. Every time such conditions are verified, a score is assigned to the corresponding category. All created atomic cat- egories are stored in a database. In phase 2, we developed a Java application that im- plements the combination of the atomic categories in the database towards the construction of complex cat-