Investigating Learning Approaches for Blog Post Opinion Retrieval Shima Gerani, Mark Carman, Fabio Crestani Faculty of Informatics, University of Lugano, Lugano, Switzerland {shima.gerani,markjcarman,fabio.crestani}@lu.unisi.ch Abstract. Blog post opinion retrieval is the problem of identifying posts which express an opinion about a particular topic. Usually the problem is solved using a 3 step process in which relevant posts are first retrieved, then opinion scores are generated for each document, and finally the opinion and relevance scores are combined to produce a single ranking. In this paper, we study the effectiveness of classification and rank learn- ing techniques for solving the blog post opinion retrieval problem. We have chosen not to rely on external lexicons of opinionated terms, but in- vestigate to what extent the list of opinionated terms can be mined from the same corpus of relevance/opionion assessments that are used to train the retrieval system. We compare popular feature selection methods such as the weighted log likelihood ratio and mutual information for use both in selecting terms for training an opinionated document classifier and also as term weights for generating simpler (not learning based) aggre- gate opinion scores for documents. We thereby analyze what performance gains result from learning in the opinion detection phase. Furthermore we compare different learning and not learning based methods for com- bining relevance and opinion information in order to generate a ranked list of opinionated posts, thereby investigating the effect of learning on the ranking phase. Key words: Opinion Retrieval, Blog Post, Learning Methods 1 Introduction Unlike traditional Web pages, which tend to contain primarily factual informa- tion, blog posts often contain opinionated content expressing the views of blog authors on a variety of topics such as political issues, product launches, fashion trends, health services, and so on. This wealth of opinionated content can be very useful for those whose job it is to gauge users’ thoughts and perceptions about different concepts, including marketing departments, political pundits, anthropology researchers, etc. Mining this opinionated content so as to separate the opinionated and rele- vant content from the “background noise” that is present in the blogosphere (the totality of all blogs) constitutes an interesting and challenging research problem. We view this problem from an Information Retrieval (IR) perspective and in- vestigate the specific task of retrieving blog posts that express an opinion about