Image Sequence Recognition with Active Learning Using Uncertainty Sampling Masatoshi Minakawa, Bisser Raytchev, Toru Tamaki and Kazufumi Kaneda Abstract— In this paper we consider the case when huge datasets need to be labeled efﬁciently for learning. It is assumed that the data can be naturally organized into many small groups, called chunklets, each one of which contains data from the same class, and many chunklets are available from each class. Each chunklet exhibits some of the typical variation representative for the class. We investigate how active learning methods based on uncertainty sampling perform in this setting, and whether any gains can be expected in comparison with random sampling. We also propose a novel strategy for selecting which chunklets to be selected for labeling. Experiments with face sequences containing variation in pose, expression and illumination conditions illustrate the proposed method. I. I NTRODUCTION I N pattern recognition, still the most predominant learning paradigm is based on supervised learning, where train- ing data sets are gathered in advance and humans provide class/category information (labels) to be used during the training process. With the advent of the Big Data era, where cameras, microphones, RFI readers and ever more varied types of mobile/ubiquitous sensor devices and networks are gathering information, often 24 hours a day, the labeling process itself might need to be reconsidered: obviously labeling such huge amounts of data is becoming impractical and sometimes even impossible. Also, even with less-than- exabyte datasets, sometimes the labeling requires the costly time and effort of busy experts (e.g. medical doctors labeling images of different categories of tumors), and choosing which data samples to label in an optimal way becomes a necessity. Fortunately, often the information content in such datasets is heavily redundant, which makes it possible to drastically reduce the number of required training labels by using active learning algorithms [1], [2], that choose selectively which samples (known as queries) should be labeled. In this paper we consider the speciﬁc case when the data is organized in groups or sets of samples of the same class/category. For example, consider the following cases: a surveillance camera is monitoring who is entering into a building, or a robot is tracking the faces of humans, or trying to learn different categories of objects by observing them from different views. In all these cases it would be much more efﬁcient to consider the sequences of images of the same face (or the same object) taken during a short time interval as the smallest unit of data, rather than considering each image separately as such. In this way, the variability due to changes in factors The authors are with the Department of Information Engineer- ing, Hiroshima University, Japan (email: {minakawa, bisser, tamaki, kin}@hiroshima-u.ac.jp). which are not directly relevant to category information (like changes in illumination or view/pose in the examples above) can be spread within the sequence, allowing the classiﬁer to concentrate on the variability between the categories. The beneﬁts (in terms of increased recognition rates, robustness, etc.) stemming from this strategy have been consistently conﬁrmed by research in image recognition [3]–[7] and more generally in machine learning [8], [9]. In this paper our aim is to investigate whether active learning methods can still provide beneﬁts, in comparison with random sampling, when the data is organized in groups or sequences, as explained above. It would be interesting to ﬁnd out how the gains in increased recognition rates and efﬁciency following from the better use of information through the additional constraints inherent in the grouping organization would affect active learning methods. II. ACTIVE LEARNING WITH I MAGE SEQUENCES Here we illustrate the problem we deal with, using as a concrete example the case when one needs to perform face recognition from image sequences of different people. Such sequences can be easily obtained from surveillance cameras or monitoring cameras in conference rooms, etc., where it would be possible to track the faces of different individuals, so that it can be guaranteed that an image sequence obtained from a single track contains only face images from a single person. We call such sequences “chunklets” (following [8]), and a single chunklet would typically contain the object of interest (a face here) represented under a variety of poses, illumination conditions and facial expressions. Fig. 1 illustrates the concept, showing two face image sequences from two different people, one changing in pose, the other in facial expression (and both having changes in illumination to some degree). In Fig. 1 (a) all face images are treated as separate samples, which renders the problem quite difﬁcult, as for example faces from different classes (people) but with similar pose would be nearer in feature space than faces from the same person but with different pose or expression. Once the faces are organized in chunklets, as Fig. 1 (b) shows, the problem becomes much easier due to the constraint that all images within a chunklet belong to a single class. Considering the class labeling process, the organization of the data in chunklets is also very beneﬁcial, since now labeling a single face would automatically determine the labels of all other faces within the same sequence, due to the chunklet constraint. (Note that in Fig. 1 only one sequence per class is shown for illustration, but typically there are many unlabeled sequences from each class, both in Appears in Proc. IEEE International Joint Conference on Neural Networks (IJCNN2013), pp.2531-2536, Dallas, June 10-15, 2013. The final publication is available at http://ieeexplore.ieee.org/ DOI: 10.1109/IJCNN.2013.6707060