Adaptive Context-based term (re)weighting An experiment on Single-Word Question Answering Marco Ernandes and Giovanni Angelini and Marco Gori and Leonardo Rigutini and Franco Scarselli 1 Abstract. Term weighting is a crucial task in many Information Re- trieval applications. Common approaches are based either on statis- tical or on natural language analysis. In this paper, we present a new algorithm that capitalizes from the advantages of both the strategies. In the proposed method, the weights are computed by a parametric function, called Context Function, that models the semantic influ- ence exercised amongst the terms. The Context Function is learned by examples, so that its implementation is mostly automatic. The algorithm was successfully tested on a data set of crossword clues, which represent a case of Single-Word Question Answering. 1 Introduction Term weighting is an important task in many areas of Text Process- ing, including, Document Retrieval, Text Categorization and Ques- tion Answering (QA). The goal of term weighting is to assign to each term w found in a collection of text documents a specific score s(w) that measures the importance, with respect to a certain goal, of the information represented by the word. Common approaches to term weighting can be divided into two groups: statistical and linguistic techniques. Statistical techniques [1] (e.g. TFIDF) are efficient and easy to develop, but they tend to consider the words of a document as unordered and independent. The techniques inspired by natural lan- guage theories [6], as morphological analysis, naturally exploit the information provided by word contexts. This makes the processing more expressive, but also slower and more complex to design. In this paper, we present a term weighting algorithm that aims to combine the advantages of both statistical and linguistic strategies. The method exploits the relationships among the words of a docu- ment. The intuition is that the relevance of a term can be computed recursively as the combination of its intrinsic relevance and the rele- vance of the terms that appear within the same context. The influence exercised by a word on another one is computed using a parametric function, called Context Function. This function can use both statis- tical and linguistic information, and it can be trained by examples. The Context-based algorithm has been evaluated on a specific problem, that of Single-Word Question Answering (QA), where the goal is to find the single correct word that answers a given question. The experimental results prove that the approach is viable. 2 Adaptive Context-based term (re)weighting The proposed method exploits the word contexts. A word context pri- marily consists of the text surrounding a given word, but could also include other features, e.g. document titles, hyper-linked documents. The basic idea is that, in order to measure the relevance of a word with respect to a certain goal (e.g. a query, a document category), the features of the context in which the term appears are important as 1 Dip. di Ingegneria dell’Informazione, Universit` a di Siena, via Roma 56, 53100 - Siena - Italy, email: {ernandes, angelini, marco, rigutini, franco}@dii.unisi.it well as the features of the word itself. In this work we assume that a text document can be represented by a social network [5], where the importance of the words can be computed on the basis of their neigh- bours. More precisely, the weight s(w) of the word w is computed as s(w) = (1 − λ)dw + λ u∈r(w) s(u)cw,u, (1) where dw is the default score of w, r(w) is the set of words that belong to the context of w, cw,u is a real number measuring the in- fluence exercised by u over w, and λ ∈ [0, 1] is a damping factor. Eq. (1) defines the term weights by a sparse linear system of equa- tions. In our experiments, the solution of such a system was com- puted by Jacobi algorithm, an efficient algorithm which can be ap- plied even on huge problems with billions of variables [3]. Context Functions In order to define the influence factors, it has to be taken into account that words are multiply instantiated in differ- ent positions of a document, and each instance (a word-occurrence) is affected by a different context. Therefore, we distinguish between words, w, and word occurrences, ˆ w. We assume that cw,u can be computed as the sum of the contributions of all the occurrences ˆ w, ˆ u of w and u, respectively, such that ˆ u belongs to the context of ˆ w cw,u = ˆ w∈occ(w) ˆ u∈ict(ˆ w,u) Cp(ˆ w, ˆ u). (2) Here, occ(w) is the set of instances of word w, and ict(ˆ w, u) is the set of the occurrences of u that belong to the context of ˆ w, i.e. ict(ˆ w, u)= occ(u) ctxt(ˆ w), where ctxt(ˆ w) is the context of ˆ w; p is a set of parameters, and Cp(ˆ w, ˆ u) is a parametric function that establishes the strength of the influence between the instances ˆ w (the word under evaluation) and ˆ u (the context word). In this work we define the context of a word ctxt(ˆ w) as the set of words that are contained in the same document and within the surround of ˆ w. The function Cp(ˆ w, ˆ u) establishes how word couples can influ- ence one another on the basis of features extracted from the words and from the relationships between words. This function can exploit any sort of feature from ˆ w and ˆ u: info-theoretical, morphological or lexical. The features that have been used in our preliminary experi- ments are exclusively statistical (Tab. 1). The most general approach to the implementation of Cp(ˆ w, ˆ u) is by a modeling tool that has the universal approximation property, e.g. neural networks, polynomials, and rationales. For the introduc- tory scope of this paper we preferred to adopt a simpler implementa- tion of Cp(ˆ w, ˆ u). We defined the influence function as Cp(ˆ w, ˆ u)= n i=0 σ(αi xi + βi ), where xi is the value associated with the i-th feature, αi ,βi are the model parameters, and σ is the logistic sigmoid function σ(x)=1/(1 + e −x ). Each term σ(αi xi + βi ) is a sort of soft switch related to the i-th feature and controlled by αi (steepness and direction) and βi (medium value) so that the whole function is a sort of boolean expression composed by AND operators.