Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition Zhe Niu [0000−0001−5833−1142] and Brian Mak Department of Computer Science & Engineering The Hong Kong University of Science & Technology {zniu,mak}@cse.ust.hk Abstract. In this paper, we propose novel stochastic modeling of vari- ous components of a continuous sign language recognition (CSLR) sys- tem that is based on the transformer encoder and connectionist temporal classiﬁcation (CTC). Most importantly, We model each sign gloss with multiple states, and the number of states is a categorical random vari- able that follows a learned probability distribution, providing stochastic ﬁne-grained labels for training the CTC decoder. We further propose a stochastic frame dropping mechanism and a gradient stopping method to deal with the severe overﬁtting problem in training the transformer model with CTC loss. These two methods also help reduce the training computation, both in terms of time and space, signiﬁcantly. We evalu- ated our model on popular CSLR datasets, and show its eﬀectiveness compared to the state-of-the-art methods. 1 Introduction Sign language is the primary communication medium among the deaf. It conveys meaning using gestures, facial expressions and upper body posture, etc., and has linguistic rules that are diﬀerent from those of spoken languages. Sign language recognition (SLR) is the task of converting a sign language video to the corre- sponding sequence of (sign) glosses (i.e., “words” in a sign language), which are the basic units of the sign language semantics. Both isolated sign language recog- nition (ISLR) [15] and continuous sign language recognition (CSLR) have been attempted. ISLR classiﬁes a gloss-wise segmented video into its corresponding gloss, whereas CSLR classiﬁes a sentence-level sign video into its corresponding sequence of glosses. The latter task is more diﬃcult and is the focus of this paper. Most of the modern CSLR architectures contain three components: visual model, contextual model and alignment model. The visual model ﬁrst extracts the visual features from the input video frames, based on which the contextual model further mines the correlation between the glosses. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are commonly used architectures for the visual and contextual model, respectively. In CSLR, sign glosses occur (time-wise) monotonically with the corresponding events in the