Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 184–189, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics ASAP-II: From the Alignment of Phrases to Text Similarity Ana O. Alves 1,2 David Sim˜ oes 1 1 Polytechnic Institute of Coimbra Portugal aalves@isec.pt a21210644@alunos.isec.pt Hugo Gonc ¸alo Oliveira 2 Adriana Ferrugento 2 2 CISUC, University of Coimbra Portugal hroliv@dei.uc.pt aferr@student.dei.uc.pt Abstract ThisThis work is licensed under a Creative Commons Attribution 4.0 International Li- cence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ paper describes the second version of the ASAP system 1 and its participation in the SemEval-2015, task 2a on Semantic Textual Similarity (STS). Our approach is based on computing the WordNet semantic relatedness and similarity of phrases from distinct sen- tences. We also apply topic modeling to get topic distributions over a set of sentences as well as some linguistic heuristics. In a special addition for this task, we retrieve named entities and compound nouns from DBPedia. All these features are used to feed a regression algorithm that learns the STS function. 1 Introduction Semantic Textual Similarity (STS), which is the task of computing the similarity between two sentences, has received an increasing amount of attention in re- cent years (Agirre et al., 2012; Agirre et al., 2013; Marelli et al., 2014a; Agirre et al., 2014; Agirre et al., 2015). Our contribution to this challenge is to learn the STS function for English texts. ASAP-II is an evolution of the ASAP system (Alves et al., 2014), which participated in SemEval 2014 - Task 1: Evaluation of compositional distributional semantic models on full sentences through semantic related- ness and textual entailment. Although with a differ- ent goal from STS, which goes beyond relatedness 1 This work was supported by the InfoCrowds project - FCT- PTDC/ECM-TRA/1898/2012 and entailment, and different datasets, which include pairs of short texts instead of controlled sentences, we believe that, rather than specifying rules, con- straints and lexicons manually, it is possible to adapt a system from one to the other task, by automat- ically acquiring linguistic knowledge through ma- chine learning (ML) methods. For this purpose, we apply some pre-processing techniques to the train- ing set in order to extract different types of features. On the semantic aspect, we compute the similar- ity/relatedness between phrases using known mea- sures over WordNet (Miller, 1995). Considering the problem of modeling a text cor- pus to find short descriptions of documents, we aim at an efficient processing of large collections, while preserving the essential statistical relationships that are useful for similarity judgment. Therefore, we also apply topic modeling, in order to get topic dis- tribution over each sentence set. These features are then used to feed an ensemble ML algorithm for learning the STS function. Our system is entirely developed as a Java independent software package, publicly available 2 for training and testing on given and new datasets containing pairs of texts. The remainder of this paper comprises 4 sections. In section 2, fundamental concepts are introduced in order to understand the proposed approach delin- eated in section 3. Section 4 presents some results of our approach, using not only the SemEval-2015’s dataset, but also datasets from previous tasks. Fi- nally, section 5 presents some conclusions and com- plementary work to be done in a near future. 2 See https://github.com/examinus-/ASAP 184