Supplementary Material: Referring to Object in Video using Spatio-Temporal Identifying Description Peratham Wiriyathammabhum ♠♦ , Abhinav Shrivastava ♠♦ , Vlad I. Morariu ♦ , Larry S. Davis ♠♦ University of Maryland: Department of Computer Science♠, UMIACS♦ peratham@cs.umd.edu, abhinav@cs.umd.edu morariu@umd.edu, lsd@umiacs.umd.edu Figure 1: The number of words in a sentence in STV- IDL is normally distributed with an average of 22.65 words. A Dataset Statistics The length of referring expressions. We ﬁrst an- alyze the length of referring expressions. We split each referring expression into words using Natu- ral Language Toolkit (NLTK) (Bird et al., 2009). Figure 1 shows the distribution of the number of words in each referring expression. STV-IDL con- tains varying lengths of referring expressions and the average length (22.65 words) is longer than most existing video referring expression datasets (Gao et al., 2017; Yamaguchi et al., 2017; Hen- dricks et al., 2017; Krishna et al., 2017) because we force the sentence syntax and encourage con- junctions. The referring expressions provided from the an- notators as speakers are subjective. However, as long as we correctly understand what they refer to and the process leads to mutual understanding and successful communication, the grounding process is valid. We are aware that giving an example sen- tence may limit the variations of the referring ex- Figure 2: The number of target objects in STV-IDL is at least 2 with an average number of objects per video of 2.85. pressions. But, we want to make sure that our con- straint is valid on every sentence and we also col- lect only a few referring expressions per referred object in a temporal interval. Besides, the col- lected sentences are usually from different anno- tators from the randomization in our web interface so there are still some constraint-satisﬁed varia- tions within our few sentences per referred object. Compared to the ReferIt game (Kazemzadeh et al., 2014), the annotators are the speakers and we are the listeners. However, we are not collecting the data in a gamiﬁcation setting in which referring expressions look short and concise similar to ver- bal utterances. The reason is it is not clear how a speaker will compare and utter words to contrast the referent from other distractors in the tempo- ral dimension. Also, it is not clear how the lis- tener will comprehend the referring expression in this setting for untrimmed video except rely on the time stamp to fast forward to the event itself. Our data collection pipeline is more similar to the Google-Refexp dataset (Mao et al., 2016) which