2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 22-25, 2013, SOUTHAMPTON, UK INTER-DOCUMENT REFERENCE DETECTION AS AN ALTERNATIVE TO FULL TEXT SEMANTIC ANALYSIS IN DOCUMENT CLUSTERING Patrick A. De Mazi` ere a,b a Dept. Healthcare & Technology, KHLeuven Herestraat 49, 3000 Leuven, Belgium Patrick.DeMaziere@khleuven.be Marc M. Van Hulle b b Lab. Neuro- & Psychofysiologie, KU Leuven Herestraat 49-1021, 3000 Leuven, Belgium Marc.VanHulle@med.kuleuven.be ABSTRACT We discuss here the search for inter-document references as an alternative to the grouping of document inventories based on a full text semantic analysis. The used document in- ventory, which is not publicly available, was provided to us by the European Union (EU) in the framework of an EU project, the aim of which was to analyse, classify, and visu- alise EU funded research in social sciences and humanities in EU framework programmes FP5 and FP6. This project, called the SSH project for short, was aimed at the evaluation of the contributions of research to the development of EU policies. For the semantic based grouping, we start from a Multi-Di- mensional Scaling analysis of the document vectors, which is the result of a prior semantic analysis. As an alternative to a semantic analysis, we searched for inter-document ref- erences or direct references. Direct references are defined as terms that explicitly refer to other documents present in the inventory. We show that the grouping based on references is largely similar to the one based on semantics, but with con- siderably less computational efforts. In addition, the non-ex- pert can make better use of the results, since the references are displayed as graphical webpages with hyperlinks pointing to both the referenced and the referencing document(s), and the reason of linkage. Finally, we show that the combination of a database, to store the data and the (intermediate) results, and a webserver, to visualise the results, offers a powerful platform to analyse the document inventory and to share the results with all participants/collaborators involved in a data- This work is performed in the framework of the SSH project funded by DG Budget Framework Service Contract BUDG06/PO/01/Lot3. We thank Dr. Vincent Duchˆ ene, Dr. Geert Steurs, and Nathalie Pasquier of IDEA Consult, Brussels and Dr. Nikolaos Kastrinos of the EU for their valuable contributions. MMVH is supported by research grants received from the program Financing program (PFV/10/008) and the CREA Financing program (CREA/07/027) of the KU Leuven, the Belgian Fund for Scientific Research - Flanders (G.0588.09), the Interuniversity Attraction Poles Programme – Belgian Sci- ence Policy (IUAP P7/21), the Flemish Regional Ministry of Education (Bel- gium) (GOA 10/019), the Flemish Agency for Innovation by Science and Technology (TETRA project Spellbinder), and by the SWIFT prize of the King Baudouin Foundation of Belgium. and computation intensive EU-project, thereby guaranteeing both data- and result-consistency. Index TermsText Mining, HPC, Semantic Analysis, client-server infrastructure 1. INTRODUCTION We focus on the search for inter-document references or direct references in documents and how they can be used as an al- ternative to the grouping of documents based on a full text se- mantic analysis. The latter is done by spotting discourse links. Such links are often used for grouping documents based on their similarity in semantics, which is likely to point to sim- ilar topics covered in the documents. Direct references are defined as terms that explicitly reference to other documents present in the inventory. Due to space restrictions, we discuss the semantic analy- sis of the inventory, and the inventory itself here only briefly. For a more elaborate overview we refer to [1]. The inven- tory counts 6,919 useful documents and over 330,000 pages, and is collected by IDEA Consult Brussels with the aid of the Research Directorate General of the European Union (EU). It contains documents that belong to three main categories: research documents (54% of all documents), influential pol- icy documents (28%), and EU policy documents (18%). The EU policy documents are predominantly ’Communications’ issued by the European Commission; EU influential policy documents are analytical, policy-supporting documents pro- duced within or outside the EU institutions; the research doc- uments are delivered by the research consortia that were en- gaged in EU-funded research and summarise the research per- formed in these projects. The overall objective was to evalu- ate the contributions of research supported through the socio- economic key action of research framework 5 (FP5), and the priorities 7 and 8 of FP6 to the development of EU policies. In the remainder of this manuscript, we first give an overview of the full text semantic analysis and subsequent MDS analysis, which are used to visualise the inventory. Next, we review how we detect the direct references, and how 978-1-4673-1026-0/12/$31.00 c 2013 IEEE