Unsupervised appearance map abstraction for indoor Visual Place Recognition with mobile robots Alberto Jaenal Francisco-Angel Moreno Javier Gonzalez-Jimenez Abstract— Visual Place Recognition (VPR), the task of iden- tifying the place where an image has been taken from, is at the core of important robotic problems as relocalization, loop- closure detection or topological navigation. Even for indoors, the focus of this work, VPR is challenging for a number of reasons, including real-time performance when dealing with large image databases (10 4 ) (probably captured by different robots), or the avoidance of Perceptual Aliasing in environments with repetitive structures and scenes. In this paper, we tackle these issues by proposing an off- line mapping technique that abstracts a dense database of georeferenced images without particular order into a Multi- variate Gaussian Mixture Model, by creating soft clusters in terms of their similarity in both pose and appearance. This abstract representation is obtained through an Expectation- Maximization algorithm and plays the role of a simplified map. Since querying this map yields a probability of being in a cluster, we exploit this ”belief” within a Bayesian filter that regards previous query images and a topological map between clusters to perform more robust VPR. We evaluate our proposal in two different indoor datasets, demonstrating comparable VPR precision to querying the full database while incurring in shorter query times and handling Perceptual Aliasing for sequential navigation. Index Terms— Place Recognition, Map Abstraction, Appearance-based localization I. I NTRODUCTION Visual Place Recognition (VPR) [1], [2] aims to detect the most similar place to a certain query image, given a map con- sisting of a generally large database of georeferenced images. This task has received increasing attention during the last decades in the robotic community, due to its involvement in important areas as loop-closure detection, re-localization, or topological navigation. For such tasks, the VPR database is built from geo-tagged images collected during several robot navigations that are encoded with some global descriptor [3], [4], either as a sequence [5], or as a set of unordered elements [6]. This database is treated as an Appearance Map (AM) of the environment. In the case of indoors, where the robot may revisit multiple times some parts of the environment, the AM will typically include repeated views. Not only does this not contribute any meaningful information to the map, but it increases its size to typically tens of thousands of images. This work has been funded by the Government of Spain in part un- der grant FPU17/04512, in part under the research project ARPEGGIO (PID2020-117057GB-I00), funded by the European H2020 program. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal used for this research. The authors are members of the Machine Perception and Intelligent Robotics Group (MAPIR-UMA), within the Malaga Institute for Mechatronics Engineering & Cyber-Physical Systems (IMECH.UMA), University of Malaga, Spain. {ajaenal, famoreno, javiergonzalez}@uma.es Fig. 1: Our work aims to abstract unordered georrefer- enced images (black triangles) into clusters C j defined as multivariate Gaussian distributions (colored ellipses). These distributions represent spatial regions with visual appearance resemblance that can be interpreted as places. Commonly, VPR is addressed on such AMs as an Image Retrieval (IR) problem that searches the Nearest Neighbors (NNs) of the query image according to a certain appearance similarity measure (e.g. Euclidean distance between descrip- tors). Then, the procedure yields an estimated location from the k most similar elements in the database. This IR approach presents the following limitations: It usually follows a similarity criteria for categorization using only the appearance, disregarding the spatial aspect of VPR, hence not being able to deal with Perceptual Aliasing (i.e. places distant in pose but sharing similar appearance). This subsequently leads to incorrect pose estimations. In traditional VPR, this issue is typically solved by using additional topology from sequential databases [7], [8], [9], unavailable for unordered maps. The selection of the NNs follows a hard classification approach, as no information about their confidence is provided. This makes IR more difficult to recover from incorrect query results, as well as unable to be included in probabilistic frameworks. Querying large databases becomes highly time- consuming, often hindering real-time operation as re- quired by mobile robotics applications. The result is a set of discrete, unrelated candidates where reliable pose interpolation is not possible, so post-processing [10], [11], [12] is commonly required to obtain a refined estimation for the image pose. Focusing on performing robust VPR in indoors with mobile robots, we propose in this work to off-line abstract the information stored at large databases of geo-tagged images