A General Matrix Framework for Modelling Information Retrieval Thomas R¨ olleke Theodora Tsikrika Gabriella Kazai Department of Computer Science Queen Mary University of London {thor,theodora,gabs}@dcs.qmul.ac.uk Abstract Content-oriented retrieval models are based on a document-term matrix, whereas link-oriented re- trieval models are based on an adjacent (parent- child) matrix. Term frequency and inverse doc- ument frequency are key concepts in content- oriented retrieval, whereas pagerank, authorities and hubs are key concepts in link-oriented retrieval. We present in this paper a general matrix frame- work for modelling information retrieval (IR). The framework covers both content-oriented and link- oriented retrieval and, in addition, includes the structure of documents, the retrieval quality and the semantics of indexing terms. The benefit of this framework lies in its high level of reusability and abstraction. The framework improves information retrieval in the sense that system construction be- comes significantly more efficient, and thus, bet- ter and more personalised systems can be build at lower costs. 1 Introduction With the web and its search engines, ranking of re- trieved objects becomes a focus in many application areas. More and more people face the task of build- ing complex information systems that provide rank- ing functionality. The matrix framework presented in this paper contributes to the understanding of re- trieval concepts, and it supports the construction of search systems since the matrix operations provide a high level of reusability and abstraction. The matrix framework improves retrieval in the sense that system construction becomes more effi- cient, flexible and robust. For a search system engi- neer, the flexibility of tools is crucial, since the flex- ibility of retrieval and indexing functions yields the possibility to tune the effectiveness and efficiency of a system for the particular needs of an end user. The literature background of this work includes general IR literature such as [van Rijsbergen, 1979, Grossman and Frieder, 1998, Baeza-Yates and Ribeiro-Neto, 1999, Belew, 2000], and more specific literature such as [Wong et al., 1985, Wong and Yao, 1995, Amati and van Rijsbergen, 1998, Page et al., 1998, Kleinberg, 1999]. [Wong et al., 1985] and [Wong and Yao, 1995] and other publications of the authors on the generalised vector-space model and the probabilistic framework for information retrieval are major foundations and motivations for the matrix framework presented in this paper. Furthermore, [Amati and van Rijsbergen, 1998] on the duality of document indexing and relevance feedback, and [Amati and Rijsbergen, 2002] on probability distributions for exploiting term fre- quencies and capturing normalisation motivated our work to present a general matrix framework in which those methods can be applied to more than “just term frequencies”. The extension of our matrix framework towards a probabilistic framework with probability distributions is one of the next research goals. The results and notations of [Page et al., 1998] and [Kleinberg, 1999] were input regarding link-oriented retrieval. All of the above literature addresses the formalisation of either content or links (structure), whereas in this paper we propose a general matrix framework for both content and structure. In addition, we include relevance feedback and evaluation within our framework. The paper is structured as follows: First, we introduce the matrix spaces in section 2. We consider a collection, a document and a query space, where we associate several matrices with 1