A Similarity Measure for Vision-Based Sign Recognition Haijing Wang, Alexandra Stefan, and Vassilis Athitsos Computer Science and Engineering Department, University of Texas at Arlington Arlington, Texas, USA Abstract. When we encounter an English word that we do not understand, we can look it up in a dictionary. However, when an American Sign Language (ASL) user encounters an unknown sign, looking up the meaning of that sign is not a straightforward process. It has been recently proposed that this problem can be addressed using a computer vision system that helps users look up the meaning of a sign. In that approach, sign lookup can be treated as a video data- base retrieval problem. When the user encounters an unknown sign, the user provides a video example of that sign as a query, so as to retrieve the most simi- lar signs in the database. A necessary component of such a sign lookup system is a similarity measure for comparing sign videos. Given a query video of a specific sign, the similarity measure should assign high similarity values to vid- eos from the same sign, and low similarity values to videos from other signs. This paper evaluates a state-of-the-art video-based similarity measure called Dynamic Space-Time Warping (DSTW) for the purposes of sign retrieval. The paper also discusses how to specifically adapt DSTW so as to tolerate differ- ences in translation and scale. Keywords: Gesture recognition, sign language recognition, American Sign Language, Dynamic Space-Time Warping, video databases, similarity-based re- trieval. 1 Introduction When we encounter an English word that we do not understand, we can look it up in a dictionary. However, when an American Sign Language (ASL) user encounters an unknown sign, looking up the meaning of that sign is not a straightforward process. A recent approach for facilitating sign lookup is to develop a computer vision system that, given a sign as a query, computes the similarity between the query sign and every sign in a large database, and outputs the most similar matches to the query [2]. In this paper, as in [2], a video database is utilized that contains one or more video examples for each sign, for a large number of signs (close to 1000 in our current ex- periments). When the user encounters an unknown sign, the user provides a video ex- ample of that sign as a query, so as to retrieve the most similar signs in the database. The query video can be either extracted from a pre-existing video sequence, or it can be recorded directly by the user, who can perform the sign of interest in front of a camera.