Understanding Inexplicit Utterances Using Vision for Helper Robots Zaliyana Mohd Hanafiah, Chizu Yamazaki, Akio Nakamura, and Yoshinori Kuno Department of Information and Computer Sciences, Saitama University 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570 JAPAN {zaliyana, yamazaki, nakamura, kuno}@cv.ics.saitama-u.ac.jp Abstract Speech interfaces should have a capability of dealing with inexplicit utterances including such as ellipsis and deixis since they are common phenomena in our daily conversation. Their resolution using context and a priori knowledge has been investigated in the fields of natural language and speech understanding. However, there are utterances that cannot be understood by such symbol processing alone. In this paper, we consider inexplicit utterances caused from the fact that humans have vision. If we are certain that the listeners share some visual information, we often omit or mention ambiguously things about it in our utterances. We propose a method of understanding speech with such ambiguities using computer vision. It tracks the human’s gaze direction, detecting objects in the direction. It also recognizes the human’s actions. Based on these bits of visual information, it understands the human’s inexplicit utterances. Experimental results show that the method helps to realize human-friendly speech interfaces. 1. Introduction Speech is a promising human-interface means for helper robots, which have growing needs in the coming aging society. Thus, robots with a speech interface have been investigated [1][2]. Speech interfaces should have a capability of dealing with inexplicit utterances including ellipsis and deixis since they are common phenomena in our daily conversation. Their resolution using context and a priori knowledge has been investigated in the fields of natural language and speech understanding [3][4]. However, in the case of robots, we must consider inexplicit utterances that cannot be understood by such symbol processing alone. We humans have vision. Thus, in our speech, we may omit or mention ambiguously things which we think that the listeners know by vision. For example, we may say, “Get that for me,” even though the object indicated by that was not mentioned before. Since the object is outstanding in the scene and the listener seems to look in the direction from his/her gaze, we assume that he/she is aware of the object. The robot should be able to act like this to be user-friendly. In this paper, we propose a method of understanding speech with such ambiguities using computer vision. Grice proposed the conversational maxims [5]. One of them is that conversation is a collaboration between a speaker and a listener where both will offer necessary and sufficient related information briefly and clearly. Based on this, we assume that we can get the information by vision about the things that are important but are not mentioned clearly in speech. Actually, there are various inexplicit utterance cases other than this vision-derived one. We assume that these are solved by other research results and only vision-derived ones are left in our speech input. We are developing a helper robot that brings the object that the user asks through speech [6]. In this paper, we present a speech interface for the robot that allows the user to use utterances with such vision-derived inexplicitness. 2. Inexplicit Utterances As mentioned in the Introduction, visual information shared by a speaker and a listener may cause inexplicit utterances. Among such utterances, we deal with ellipsis and deixis that may appear in the speech interface for the helper robot [6]. Human utterances can be considered mostly the requests that he/she asks the robot in this restricted application domain. Such a request consists of a verb and an object. (The subject is always the robot.) The verb indicates an action that the human wants the robot to take. The object indicates the target of the action. For each of verb and object, the human may say it definitely or ambiguously (deixis). Or he/she may not say it at all (ellipsis). Thus, utterances are classified into nine cases. We give an utterance example for each case below. Note that we use Japanese language in our system. In the following examples, we give direct translation of the Japanese and show the English supposed to be used in such situations in parentheses if necessary. Case1. Verb omitted; Object omitted. Examples are greetings such as, “Hello.” Case2. Verb omitted; Object ambiguous. “That one.” Case3. Verb omitted; Object definite. “That apple.” Case4. Verb ambiguous; Object omitted. “Make (it) to four,” while watching the television. ("Channel four.") Case5. Verb ambiguous; Object ambiguous. “Do that.” Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE