Abstract This paper proposes a new template matching method that is robust to outliers and fast enough for real-time operation. The template and image are densely transformed in binary code form by projecting and quantizing histograms of ori- ented gradients. The binary codes are matched by a generic method of robust similarity applicable to additive match measures, such as L p - and Hamming distances. The robust similarity map is computed efficiently via a proposed In- verted Location Index structure that stores pixel locations indexed by their values. The method is experimentally justi- fied in large image patch datasets. Challenging applications, such as intra-category object detection, object tracking, and multimodal image matching are demonstrated. 1. Introduction Designing a template matching method that is simultane- ously fast and robust remains a challenging task, despite an extensive body of research in this area [1]. The metric prop- erties of standard non-robust match measures have enabled many fast methods based on bounded partial correlation [2],[7],[8], but they become less effective when the matched signals contain outliers. Recent attempts to speed up the cri- teria for robustness, such as M-estimator, have shown sig- nificant improvement in computation time, but only for small outlier ratios [4]. Another promising fast method that approximates a robust match measure by correlations [19] relates more to image registration techniques. Standard match measures are often formulated in vector notation so that both the template and an image region are represented by vectors of pixel values. Such match measures could be applied to arbitrarily shaped templates, but in prac- tice the most effective speedup techniques, including image pyramid [4], projection kernels [8],[18], integral images [20], and FFT [17],[19], do not generalize well for non- rectangular templates or templates defined on a sparse grid of pixels. In such cases, the rectangular enclosure would include unrelated background pixels acting as outliers, thus lowering the performance of the match measure. Another popular direction in similarity estimation is based on local descriptors that are robust to illumination changes and local structural variations [11]-[16]. Recent studies [13],[15] have shown that local descriptors outperform stan- dard match measures such as L 2 -distance and normalized cross-correlation (NCC); therefore, template matching algo- rithms can benefit from the local descriptor approach. Local descriptors are high-dimensional features extracted at sparse locations; therefore, their dense correlation-like application is costly in terms of memory and computation time. In this paper, we address all the issues mentioned above and present a robust method for matching arbitrarily shaped templates that is somewhat similar to the best local descrip- tors in terms of matching performance. Our implementation uses only integer operations and memory manipulations, and it is simple and fast enough for implementation in low-cost CPU and real-time operation. These usually conflicting properties are achieved through three contributions: 1. We propose a new dense image transform built on the Histogram of Oriented Gradients (HOG) approach [11] that encodes a pixel neighbourhood into a binary code by project- ing and quantizing the HOG vectors. A template is matched by the Hamming distance, which is the main factor for the high speed of the algorithm. 2. We represent the class of additive match measures (e.g. L p - and Hamming distances) in the form of robust similarity based on bounded M-estimators. Robust similarity accounts only for similar pixels from the image and template. It can be interpreted as a voting process occurring at the pixel level, similar in spirit to Hough transform based object de- tectors (e.g. [5],[6]) in which local parts vote for object loca- tion. By providing robustness to outliers, our extension im- proves the performance of the baseline match measure. 3. The fact that the proposed robust similarity skips many unrelated pixels from the image and template allows us to use a new image indexing structure – Inverted Location In- dex (ILI). Instead of the conventional method of addressing pixels by location, ILI addresses pixel location by its value. It enables quick selection of matching pixels for any query pixel, thus requiring only a list of template pixels that can represent a non-rectangular shape or set of sparse locations. Contribution 1 is a particular template matching method. Its speedup and robustness are provided by contributions 2 and 3, which are, in fact, more generic and can be used in other contexts. The previous works most relevant to our method are: the fast template matching method based on M- estimators [4]; the HOG approach [11], proposing highly discriminative dense image features; Census transform ap- proach [3], proposing a dense binary code that can be matched by Hamming distance and HoughTAD algorithm [23] that is dual to our ILI algorithm. We discuss the integra- tion of these ideas into our method in the following sections. 2. Preliminaries Let us define template support as a set of pixel coordi- nates R T ={(u i ,v i )| i∈[1,N]} that is not required to be a rectan- gular region. Let R I ={(x,y)| x∈[0,N x -1], y∈[0,N y -1]} be a grid of pixels supporting an image, and P={0,…,Q-1} a set of k-bit pixel values (k=log 2 Q). Let I: R I →P be an image, T: R T →P a template matched to I, and S: R I →ℤ + ∪{0} a match measure of I and T. In this work, we consider only Fast and High-Performance Template Matching Method Alexander Sibiryakov (asibiryakov@asl-vision.co.uk) ASL Vision, Lewes, United Kingdom 1417