Retrieval and Ranking of Semantic Entities for Enterprise Knowledge Management Tasks Chad Cumby Accenture Technology Labs 161 N Clark St. Chicago, IL, USA chad.m.cumby@accenture.com Katharina Probst Google, Inc. Atlanta, GA, USA katharina.probst@gmail.com Rayid Ghani Accenture Technology Labs 161 N Clark St. Chicago, IL, USA rayid.ghani@accenture.com ABSTRACT We describe a task-sensitive approach to retrieval and rank- ing of semantic entities, using the domain information avail- able in an enterprise. Our approach utilizes noisy named- entity tagging and document classification, on top of an en- terprise search engine, to provide input to a novel rank- ing metric for each entity retrieved for a task. Retrieval is query-centric, where the user query is the target topic (e.g., a technology needed for a proposal). Named entities are then extracted from the retrieved documents, and ranked according to their similarity to the target topic. We evalu- ate our approach by comparing to a baseline retrieval and ranking technique that is based on entity occurrence rates, and show encouraging results. Keywords Enterprise search , Metadata IR, Information Extraction 1. INTRODUCTION Current knowledge management systems not only con- sist of documents, but also of a variety of semantic entities. People, employees, companies, clients, projects, partners, al- liances, locations, and competitors are just some examples of such entities. When a worker performs a specific enter- prise task, one or more of these semantic entities are often required to fulfill that task. Writing proposals for new con- tracts with various companies, finding experienced workers within the company to work on new projects, or evaluat- ing different third-party vendor capabilities with respect to various project requirements are just some enterprise tasks that require the use of entities mentioned above. In general, various knowledge management tasks can be greatly simpli- fied or assisted by delivering relevant semantic entities to the person performing them. Currently, it is very difficult to retrieve such information: while any commercial enterprise search engine will yield a list of relevant documents, they are not currently able to retrieve a reliable ranked list of semantic entities such as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. companies or experts. Information extraction has focused on extracting certain kinds of entities but Search and IR work has typically focused on ranking of documents. In the cases where entity search has been studied in the IR community [2, 1] it has been restricted to expert finding with specialized heuristics, without utilizing a general class of semantic information. This is limiting in several ways, and we argue that augmenting search systems in order to retrieve semantic entities specific to a given task and ranking them dynamically based on the needs of the user is essential for many enterprise tasks. In order to enable more general entity retrieval and ranking, we present here a system that retrieves and ranks semantic entities that can be specific to a user query, i.e., a topic such as ‘Business Intelligence’. An important difference between content on the Web ver- sus in the enterprise is that in the latter, business processes often produce extensive meta-data about each document in the enterprise knowledge base. This can include straight- forward information such as the creator of a document or the time the document was submitted, but also includes a lot of semantic knowledge about the domain such as client companies or locations that are relevant for the document. Using this meta-data to incorporate semantics into search engines is at the center of our approach. With it, we cre- ate a more representative profile of a semantic entity to be used in ranking, compared to simple occurance counts of the entity in the corpus. 2. RETRIEVAL AND RANKING We query the document search engine for the topic t that was specified by the user and use the retrieved documents to extract candidate entities. Our base document search engine indexes automatically extracted entities such as companies, people, keywords, locations, acronyms, etc. as well as man- ually given entities such as project client, project contact, etc. from all documents in a specific set, associated with occurrence frequency. The algorithm proceeds by collecting the entities of the desired type, i.e., people in our example, that were extracted for all the returned documents. Each candidate entity ei is associated with a count cnte i . This count indicates how many of the returned documents contain the entity. We first order all candidates by cnt and consider only those entities in the top n (100 in our experiments) by occurrence. For example, let t be a topic query of interest to the user, e.g., CRM or BP drilling. Let typeec be the can- didate entity type, e.g., people. The query will result in a document set Rt . We then create set of candidate enti-