A Bi-Dimensional User Profile to Discover Unpopular Web Sources Romain Noel LITIS, Normandie Université INSA de ROUEN Airbus Defense & Space Val-de-Reuil, France romain.noel@cassidian.com Nicolas Malandain LITIS, Normandie Université INSA de ROUEN St Etienne du Rouvray, France nicolas.malandain@insa- rouen.fr Alexandre Pauchet LITIS, Normandie Université INSA de ROUEN St Etienne du Rouvray, France alexandre.pauchet@insa- rouen.fr Laurent Vercouter LITIS, Normandie Université INSA de ROUEN St Etienne du Rouvray, France laurent.vercouter@insa- rouen.fr Bruno Grilheres Airbus Defence & Space Val-de-Reuil, France bruno.grilheres@cassi- dian.com Stephan Brunessaux Airbus Defence & Space Val-de-Reuil, France stephan.brunessaux@cassi- dian.com ABSTRACT The discovery of new sources of information on a given topic is a prominent problem for Experts in Intelligence Analysis (EIA) who cope with the search of pages on specific and sensitive topics. Their information needs are difficult to ex- press with queries and pages with sensitive content are diffi- cult to find with traditional search engines as they are usu- ally poorly indexed. We propose a double vector to model EIA’s information needs, composed of DBpedia resources [2] and keywords, both extracted from Web pages provided by the user. We also introduce a new similarity measure that is used in a Web source discovery system called DOW- SER. DOWSER aims at providing users with new sources of information related to their needs without considering the popularity of a page. A series of experiments provides an empirical evaluation of the whole system. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval 1. INTRODUCTION The explosive growth of the Web has resulted in a huge amount of information available on Internet. Finding rele- vant sources has become a complex task. Experts in Intelli- gence Analysis (EIA) often explore the Web to collect infor- mation on specific and sensitive topics such as Web sites sell- ing illegal pharmaceutical products, Jihadist blogs, terrorist forums and so on. Their information needs are therefore to discover new information sources on search topics. Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW 2015 Companion, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3473-0/15/05. http://dx.doi.org/10.1145/2740908.2742139 . The search activity of EIA has some specific characteris- tics that make traditional Information Retrieval (IR) tools unsuitable. First of all, EIA find difficult to express their needs using traditional queries, as the vocabulary of sensi- tive pages quickly evolve. For instance, new drug names, synonyms of the same molecule, often appear, and thus EIA need to discover new sensitive pages to update their vocabu- lary. Then, sources containing sensitive content are usually poorly indexed in traditional web search engines because of their lack of popularity. Such information sources can also be not indexed at all in order to stay unreachable by lambda users or because search engines deprecate their content. Fi- nally, EIA have to combine broad search and deep search to explore sources on an identified relevant topic but also to consider, and sometimes discover, new related topics. In this article, we introduce an original approach of user profile modelling to address the problem of sensitive need representation for EIA. Instead of queries, we propose to describe a user’s information needs with a double vector of DBpedia resources [2] and keywords to cover respectively the thematic and the specific aspects of her information need. The user profile is constructed semi-automatically to avoid EIA to use their own list of terms. To tackle the prob- lem of poorly indexed web sites, we exploit our own focused crawler called DOWSER (Discovery Of Web Sources Eval- uating Relevance) [12] that integrates a new similarity mea- sure to index pages regardless of their popularity. Our approach provides the following main contributions: (i) a semi-automatically constructed user profile based on DBpedia concepts and keywords both used by DOWSER; (ii) an approach for relevance calculation based on this pro- file; and (iii) an automatic ranking process to provide rel- evant sources of information to the user. In section 2, we compare our approach to existing works. In section 3, the user profile representation and our similarity measure are described. A user experiment and the results obtained are presented in section 4. Finally, we conclude by discussing possible extensions in section 5. 1471