Automatically Constructing a Corpus of Sentential Paraphrases William B. Dolan and Chris Brockett Natural Language Processing Group Microsoft Research Redmond, WA, 98052, USA {billdol,chrisbkt}@microsoft.com Abstract An obstacle to research in automatic paraphrase identification and genera- tion is the lack of large-scale, publicly- available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Micro- soft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judg- ment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classi- fier to select likely sentence-level para- phrases from a large corpus of topic- clustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact se- mantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defin- ing guidelines for the human raters. 1 Introduction The Microsoft Research Paraphrase Corpus (MSRP), available for download at http://research.microsoft.com/research/nlp/msr_ paraphrase.htm, consists of 5801 pairs of sen- tences, each accompanied by a binary judgment indicating whether human raters considered the pair of sentences to be similar enough in mean- ing to be considered close paraphrases. This data has been published for the purpose of encourag- ing research in areas relating to paraphrase and sentential synonymy and inference, and to help establish a discourse on the proper construction of paraphrase corpora for training and evalua- tion. It is hoped that by releasing this corpus, we will stimulate the publication of similar cor- pora by others and help move the field toward adoption of a shared dataset that will permit use- ful comparisons of results across research efforts. 2 Motivation The success of Statistical Machine Translation (SMT) has sparked a successful line of investi- gation that treats paraphrase acquisition and generation essentially as a monolingual machine translation problem (e.g., Barzilay & Lee, 2003; Pang et al., 2003; Quirk et al., 2004; Finch et al., 2004). However, a lack of standardly-accepted corpora on which to train and evaluate models is a major stumbling block to the successful appli- cation of SMT models or other machine learning algorithms to paraphrase tasks. Since para- phrase is not apparently a common “natural” task—under normal circumstances people do not attempt to create extended paraphrase texts—the field lacks a large readily identifiable dataset comparable to, for example, the Canadian Han- sard corpus in SMT that can serve as a standard against which algorithms can be trained and evaluated. What paraphrase data is currently available is usually too small to be viable for either training or testing, or exhibits narrow topic coverage, limiting its broad-domain applicability. One class of paraphrase data that is relatively widely available is multiple translations of sentences in a second language. These, however, tend to be rather restricted in their domain (e.g. the ATR English-Chinese paraphrase corpus, which con- 9