In Proceedings of the E-Society 2004 Conference (E-Society2004). © IADIS 2004. How to Improve Retrieval Effectiveness on the Web? João Ferreira Alberto Rodrigues da Silva José Delgado Instituto Superior de Engenharia de Lisboa INESC-ID, Instituto Superior Técnico Instituto Superior Técnico jferreira@deetc.isel.ipl.pt alberto.silva@acm.org Jose.Delgado@tagus.ist.utl.pt ABSTRACT We explore the question of combining link analysis, content analysis and classification-based techniques to improve retrieval performance on the Web. We show the potential of combination and that relatively simple implementation of combination does improve the retrieval performance. KEYWORDS Combination, Retrieval Information, Classification 1 Introduction How do we find information on the Web? It is an old question far from being solved. Web information is distributed, decentralized and huge in size. The Web can be viewed as one big virtual document collection. The findings from traditional Information Retrieval (IR) research (traditional IR means text-based approaches), however, may not always be applicable in the Web setting. The Web document collection, massive in size and diverse in content, context, format, purpose and quality, challenges the validity of previous research findings based on relatively small and homogeneous test collections. Also, some traditional IR approaches may be applicable in theory, but may not be possible or practical to implement in a Web IR system. For instance, the size, distribution and dynamic nature of information on the Web make it difficult, if not impossible, to construct a complete and up-to-date data representation required for an ideal IR system. In addition, conventional evaluation measures, such as precision, recall, and even relevance, may no longer be applicable to Web IR, where a test collection representative of dynamic and diverse Web data is all but impossible to construct. To further complicate the matter, information seeking on the Web is quite diverse in characteristics and unpredictable in nature. Web searchers come from all kinds of reasons motivated by all types of information need. The wide range of experience, knowledge, motivation, need, and purpose of Web searchers means that searchers can express wide ranges of information needs in a wide variety of ways with various criteria for satisfying their needs. At the same time, the Web is rich with new types of information not present in most previous test collections. Hyperlinks, usage statistics, document markup tags and bodies of topic hierarchies such as Yahoo present an opportunity to leverage the Web-specific document characteristics in novel approaches that go beyond the term-based retrieval framework of traditional IR. This paper explores the question of combining link analysis, content analysis, and classification-based techniques to improve retrieval performance and is divided in 7 sections. Firstly, we introduce the problem, secondly we describe individual retrieval systems, thirdly we talk about combination formulae, fourthly we present our results, fifth we discuss results and finally we present our conclusions.