Off-topic Spoken Response Detection with Word Embeddings Su-Youn Yoon, Chong Min Lee, Ikkyu Choi, Xinhao Wang, Matthew Mulholland, Keelan Evanini Educational Testing Service, 660 Rosedale Road, Princeton, NJ, USA syoon, clee001, ichoi001, xwang002, mmulholland, kevanini@ets.org Abstract In this study, we developed an automated off-topic response de- tection system as a supplementary module for an automated proficiency scoring system for non-native English speakers’ spontaneous speech. Given a spoken response, the system first generates an automated transcription using an ASR sys- tem trained on non-native speech, and then generates a set of features to assess similarity to the question. In contrast to previ- ous studies which required a large set of training responses for each question, the proposed system only requires the question text, thus increasing the practical impact of the system, since new questions can be added to a test dynamically. However, questions are typically short and the traditional approach based on exact word matching does not perform well. In order to ad- dress this issue, a set of features based on neural embeddings and a convolutional neural network (CNN) were used. A system based on the combination of all features achieved an accuracy of 87% on a balanced dataset, which was substantially higher than the accuracy of a baseline system using question-based vector space models (49%). Additionally, this system almost reached the accuracy of vector space based model using a large set of responses to test questions (93%). 1. Introduction This study aims to develop an off-topic detection system as a part of an automated oral proficiency scoring system. The auto- mated scoring system was designed to score spoken responses to a test of English speaking proficiency. When students are fa- tigued, unmotivated, distracted, they may not respond seriously. For instance, students may recite their response to a previous question (referred to as off-topic responses hereafter). Such re- sponses often have sub-optimal characteristics which make it difficult for the automated scoring system to provide a valid score. In order to address this issue, the automated scoring system can employ a “filtering model” (hereafter, FM) to fil- ter out off-topic responses. By filtering out such problematic responses, the remaining responses can be scored by the au- tomated scoring system without concerns about scoring errors resulting from problematic responses. Filtering off-topic responses is concerned with issues re- lated to topicality. However, these issues are found at different ranks on [1]’s hierarchy of five similarity levels (unrelated, on the general topic, on the specific topic, same facts, and copied). In particular, off-topic responses belong to the unrelated group. In this study, we focus on off-topic responses and develop an automated FM which detects off-topic responses by utilizing semantic similarity measures. Especially, we use only the ques- tion text and do not use sample responses for test questions. With the introduction of the FM, the overall architecture of our automated scoring system will be as follows. For a given spoken response, the system performs speech recognition and speech processing. Given the ASR output and the speech signal, it computes a set of linguistic features assessing pronunciation, prosody, vocabulary, and grammar skills. In addition, document similarity features are generated based on word hypotheses and content models. The FM then uses the similarity features to filter out off-topic responses. Finally, the remaining responses are scored by the automated scoring model. In this study, we will only focus on the FM part of the overall architecture. 2. Relevant studies Previous studies, such as [2, 3, 4], focused on scoring of highly restricted speech (e.g., read-aloud) and detected off-topic responses using features derived from the automated speech recognition (ASR) system. This approach achieved good per- formance for restricted speech, but it is not appropriate for tasks that elicit unconstrained, spontaneous speech. [5] applied document similarity features to detect gaming responses for an English speaking proficiency test that elicits spontaneous speech from non-native speakers. They developed a set of similarity features between a test response and a large number of question-specific responses (sample responses pro- vided to the same question as the test response) using VSM (vector space model) and word overlaps. These features were used in identifying gaming responses with topic problems (e.g., question repetition and off-topic responses) and showed promis- ing performance. Approaches like those above require a sizable amount of re- sponse data for each question, and collecting question-specific data is an expensive and difficult task. To address this issue, [6] developed a system for detecting off-topic essays without the need for question-specific responses; the system was based on similarity features between the question text and the test re- sponse. The performance of this system was lower than the benchmark system trained on question-specific responses, but it achieved a substantial improvement over a majority-based base- line. [7] further improved this system by expanding question texts to include synonyms, inflected forms, and distributionally similar words to the question content. The performance of [7] showed a substantial improvement for questions consisting of only a small amount of text. More recently, various approaches based on deep-neural networks (DNN) and word-embeddings trained on large cor- pora have showed promising performance in document simi- larity detection (e.g., [8, 9, 10]). In contrast to traditional sim- ilarity features, which are limited to a reliance on exact word matching (e.g., content vector analysis), these new approaches have the advantage of capturing topically relevant words that Copyright 2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-388 2754