Unsupervised Learning of Facial Landmarks based on Inter-Intra Subject Consistencies Weijian Li * , Haofu Liao * , Shun Miao † , Le Lu † and Jiebo Luo * * Department of Computer Science, University of Rochester, Rochester, NY, USA † PAII. Inc., Bethesda, MD, USA Email: * {wli69, hliao6, jluo@cs.rochester.edu}, † {shwinmiao, tiger.lelu@gmail.com} Abstract—We present a novel unsupervised learning approach to image landmark discovery by incorporating the inter-subject landmark consistencies on facial images. This is achieved via an inter-subject mapping module that transforms original subject landmarks based on an auxiliary subject-related structure. To recover from the transformed images back to the original subject, the landmark detector is forced to learn spatial locations that contain the consistent semantic meanings both for the paired intra-subject images and between the paired inter-subject images. Our proposed method is extensively evaluated on two public facial image datasets (MAFL, AFLW) with various settings. Experimen- tal results indicate that our method can extract the consistent landmarks for both datasets and achieve better performances compared to the previous state-of-the-art methods quantitatively and qualitatively. I. I NTRODUCTION Facial landmark localization aims to detect a set of semantic keypoints on the given objects from images, such as the eyes, nose, and ears of human faces. It has been an essential process to assist many high-level computer vision tasks [1], [2]. Traditional fully supervised approach relies on a set of annotated landmark locations that are labeled by human experts. These landmarks are subsequently used to train a supervised model before it can be applied to unseen images. Although many efforts have been made in this direction and promising results have been achieved [3], [4], [5], [6], [7], [8], [9], the challenge of supervised models remains that a large amount of human labeling efforts are required to have desirable performance, which is expensive and the annotation processing is subjective. Another recent approach follows the unsupervised learning strategy to extract keypoints with self-supervision [10], [11], [12], [13]. Many of the existing methods propose to apply a group of random transformations, such as rotations and translations, on the original image to generate the transformed and paired images. Machine learning models are trained to predict landmark locations based on the fact and constraint that the paired landmarks should follow the same transformation. Despite the popularity and success, training landmark de- tectors with only paired images from the same subject images may be insufﬁcient to discover the inter-subject consistency among different subjects. The trained detector may be biased to learn landmark locations that are meaningful for the trans- formation within the same-subject pairs, but make different predictions on the same landmark across different subjects. To this end, we propose a novel unsupervised learning method for image landmark discovery via exploring and inte- grating on the inter-subject consistency. Our method follows the standard equivariance approach by using image recon- struction as supervision cues, added with injecting a subject mapping module between the image encoder and decoder to ensure the inter-subject landmark semantics. Speciﬁcally, (1) our model ﬁrst extracts the feature maps from the input image, then computes a landmark heatmap from an auxiliary subject image as the structural guidance. (2) We implement a subject mapping module to perform structural transformation on the input image according to the structure deﬁned by the extracted landmark heatmap of the auxiliary image. (3) The transformed image is then sent into a second transformation guided by the landmark heatmap of a paired image of the input subject and the ﬁnal generated image is output. In this manner, we adopt a cycle-like design to complete the transformation cycle between the paired intra-subject images in both directions. By modeling an intermediate landmark based inter-subject transformation, the landmark detector is enforced to extract semantically-consistent facial landmark locations across differ- ent subjects to produce accurate landmark based image genera- tion. The cycle-like intra-subject translation enables additional supervision that encourages our network to learn consistent referential keypoints for both forward and backward image translations. These two factors together help our network to not only extract discriminative landmark locations for each subject in accordance with the provided transformation, but also simultaneously retain landmark semantics across different subjects. In summary, our main contributions are as follows: • We propose an unsupervised learning method for image landmark discovery by focusing on both inter and intra landmark consistencies. • We construct the inter-subject consistency directly through landmark representations with the use of aux- iliary images. • We model the intra-subject transformation as a cycle and build a two-path end-to-end trainable structure to improve the intra-subject landmark consistency. • Comprehensive quantitative and qualitative evaluations on two public facial image datasets demonstrate that the consistent superior landmark localization performances using our method are observed. arXiv:2004.07936v2 [cs.CV] 7 Jul 2020