A Model for Ranking Entities and Its Application to Wikipedia Gianluca Demartini, Claudiu S. Firan, Tereza Iofciu, Ralf Krestel, and Wolfgang Nejdl L3S Research Center Leibniz Universit¨ at Hannover Appelstr 9a, 30167 Hannover, Germany {demartini,firan,iofciu,krestel,nejdl}@L3S.de Abstract Entity Ranking (ER) is a recently emerging search task in Information Retrieval, where the goal is not finding docu- ments matching the query words, but instead finding entities which match types and attributes mentioned in the query. In this paper we propose a formal model to define enti- ties as well as a complete ER system, providing examples of its application to enterprise, Web, and Wikipedia scenar- ios. Since searching for entities on Web scale repositories is an open challenge as the effectiveness of ranking is usu- ally not satisfactory, we present a set of algorithms based on our model and evaluate their retrieval effectiveness. The results show that combining simple Link Analysis, Natural Language Processing, and Named Entity Recognition meth- ods improves retrieval performance of entity search by over 53% for P@10 and 35% for MAP. 1 Introduction Finding entities on the Web is a new search task which goes beyond the classic document search. While for infor- mational search tasks (see [7] for a classification of tasks) high precision document search can give satisfying results for the user, a different approach should be followed when the user is looking for specific entities. For example, when the user wants to find a list of “Brazilian female politicians” it is easy for a classical search engine to return documents about politics in Brazil. It is left to the user to extract the information about the requested entities from the provided results. Our goal is to develop a system that can find entities and not just documents on the Web. Being able to find entities on the Web can become a new important feature of current search engines. It can allow users to find more than just Web pages: also people, phone numbers, books, movies, cars, or any other kind of items. Searching for entities in a collection of documents is not an easy task. Currently, we can see the Web as a set of inter- linked pages of different types, e.g. describing tasks, an- swering questions or describing people. Therefore, in order to find entities, it is necessary to do a preprocessing step of identifying entities in the documents. Moreover, we need to build descriptions of those entities to enable search engines to rank and find them given a user query. Applying classical Information Retrieval (IR) method- ologies for finding entities can lead to low effectiveness as seen in previous approaches [3, 9]. This is because entity search is a task different than document search. It is crucial to rely on consolidated information extraction technologies if we do not want to start with an already high error that the ranking algorithms can only increase. In this paper we first propose a general model for find- ing entities and we show how this can be applied to differ- ent entity search scenarios. We generalize this search task and identify its main actors so that we can optimize solu- tions for different search contexts such as, for example, the Wikipedia corpus. Building on top of the designed model, we developed search algorithms based on Link Analysis, Natural Language Processing (NLP), and Named Entity Recognition (NER) for finding entities in the Wikipedia corpus. Moreover, we experimentally evaluated the devel- oped techniques using a standard testbed for Entity Ranking (ER). We show that these algorithms improve significantly over the baseline and that the proposed approaches – incor- porating Link Analysis, NLP and NER methods – can be beneficially used for ER. We evaluated our algorithms for entity ranking only on the Wikipedia scenario. It will be a future step to extend the approach to the entire Web of Entities. The main contributions of this paper are: Proposing a general model for Entity Ranking (Sec- tion 2); Applying the model to enterprise, Web, and Wikipedia scenarios (Section 3); Creating a set of algorithms for finding entities in Wikipedia (Section 5);