IEEE TRANSACTIONS ON IMAGE PROCESSING, 2013 1 Image Set based Face Recognition using Self-regularized Non-negative Coding and Adaptive Distance Metric Learning Ajmal Mian, Yiqun Hu, Richard Hartley, Fellow, IEEE, and Robyn Owens Abstract—Simple nearest neighbor classiﬁcation fails to exploit the additional information in image sets. We propose self- regularized non-negative coding to deﬁne between set distance for robust face recognition. Set distance is measured between the nearest set points (samples) that can be approximated from their orthogonal basis vectors as well as from the set samples under the respective constraints of self-regularization and non-negativity. Self-regularization constrains the orthogonal basis vectors to be similar to the approximated nearest point. The non-negativity constraint ensures that each nearest point is approximated from a positive linear combination of the set samples. Both constraints are formulated as a single convex optimization problem and the accelerated proximal gradient method with linear-time Euclidean projection is adapted to efﬁciently ﬁnd the optimal nearest points between two image sets. Using the nearest points between a query set and all the gallery sets as well as the active samples used to approximate them, we learn a more discriminative Mahalanobis distance for robust face recognition. The proposed algorithm works independently of the chosen features and has been tested on gray pixel values and Local Binary Patterns (LBP). Experiments on three standard datasets show that the proposed method consistently outperforms existing state-of-the-art. Index Terms—Image set classiﬁcation, face recognition, non- negative coding, distance metric learning. I. I NTRODUCTION This paper deals with the problem of face recognition from image sets where classiﬁcation is based on a collection of query images rather than a single one. Each training class is represented by one or more labeled image sets referred to as the gallery. During classiﬁcation, a query (test) image set is assigned the identity of the nearest class using some distance criterion. More often than not, multiple face images of a person are available for training and testing. These images may come from multiple surveillance cameras, personal photo albums or online resources and correspond to different facial appearances under varying pose, illumination and expressions. Within a set, the common semantic relationship is shared across individual face images since they all belong to the Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. A. Mian and R. Owens are with the School of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA 6009, Australia, (e-mails ajmal.mian/robyn.owens@uwa.edu.au). Y. Hu is with the Paypal Innovation Team, Singapore, (e-mail yiqun.hu@gmail.com) R. Hartley is with the Research School of Engineering, Aus- tralian National University, Canberra, ACT 0200, Australia, (e-mail richard.hartley@anu.edu.au) Fig. 1. Illustration of unconstrained and constrained face reconstruction. Top row: Class mean and orthogonal basis form unlikely faces in unconstrained reconstruction and more likely faces when self-regularization constraint is im- posed on the coefﬁcients of the basis vectors. Second row: Linear combination of set samples result is unlikely faces when one of the coefﬁcients is negative but more likely faces when all α coefﬁcients are constrained to be positive. same person. These facial images complement the appearance variations of the person under different conditions. While image sets offer more opportunities for face recognition, they also pose new challenges to the classiﬁcation task. Image sets contain more information that is useful for accurate classiﬁcation. However, they introduce a challenging problem of image set modeling that can exploit their internal semantic relationships. Single sample based classiﬁcation models cannot exploit these semantic relationships. Video based recognition is a special case of image set classi- ﬁcation where a temporal relationship between the consecutive frames is assumed. In set classiﬁcation, temporal relationships between the images of a set may not exist. For example, the images in a photo album are not temporally contiguous. Simple extension of nearest neighbor approaches to set classiﬁcation does not exploit the semantic relationships be- tween the set samples. Therefore, some techniques model sets as linear subspaces or manifolds [37] (approximated as a collection of subspaces) and measure distances between them. While the nearest neighbor approach is too rigid, the subspace distance model is too ﬂexible in the sense that it allows matches between any combination of the subspace basis vectors. However, not all combinations of the basis vectors form a meaningful face. In fact, most unconstrained combinations result in impossible faces that are even outside the human face space. This is illustrated in Fig. 1. To ensure that the distance between two subspaces is measured at points that correspond to possible faces of the respective classes, This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TIP.2013.2282996 Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.