A Surface and Appearance-based Next Best View System for Active Object Recognition Pourya Hoseini a , Shuvo Kumar Paul b , Mircea Nicolescu and Monica Nicolescu Department of Computer Science and Engineering, University of Nevada, Reno, U.S.A. Keywords: Object Recognition, Active Vision, Next BestView, View Planning, Foreshortening, Classiﬁcation Dissimilarity, Robotics. Abstract: Active vision represents a set of techniques that attempt to incorporate new visual data by employing camera motion. Object recognition is one of the main areas where active vision can be particularly beneﬁcial. In cases where recognition is uncertain, new perspectives of an object can help in improving the quality of observation and potentially the recognition. A key question, however, is from where to look at the object. Current approaches mostly consider creating an occupancy grid of known object voxels or imagining the entire object shape and appearance to determine the next camera pose. Another current trend is to show every possible object view to the vision system during the training time. These methods typically require multiple observations or considerable training data and time to effectively function. In this paper, a next best view system is proposed that takes into account only the initial surface shape and appearance of the object, and subsequently determines the next camera pose. Therefore, it is a single-shot method without the need to have any speciﬁcally made dataset for the training. Experimental validations prove the feasibility of the proposed method in ﬁnding good viewpoints while showing signiﬁcant improvements in recognition performance. 1 INTRODUCTION It is a necessity for an intelligent entity to sense its environment to act informed. One of the main per- ception mediums is vision. Despite being a heav- ily used sensing modality, a vision mechanism may face difﬁculties in capturing the most useful views for the speciﬁc task at hand. There can be many rea- sons for such issues, including occlusion, lack of dis- criminative features due to bad lighting or unfavor- able viewpoints of the object, or insufﬁcient image resolution. Active vision is an answer to those situa- tions that tries to enhance the performance of the vi- sion system by dynamically incorporating new visual sensory sources. Some application domains of active vision are three-dimensional (3D) object reconstruc- tion and object recognition, the latter being the focus of our work. Active Object Recognition (AOR) has many uses in robotics (Paul et al., 2020), vision-based surveillance, etc. AOR procedures normally involve uncertainty evaluation, camera movement, matching, and information fusion (Hoseini et al., 2019a) and (Hoseini et al., 2019b). If the current recognition is a https://orcid.org/0000-0003-3473-9906 b https://orcid.org/0000-0003-1791-3925 not certain enough, a camera is moved to observe the object from another viewpoint and to fuse the current and new information, usually classiﬁcation decisions, from the matched objects in the views, in order to ob- tain improved results. Regarding the camera movement, a primary ques- tion to answer is where and in what orientation a camera should be placed to fetch the next best view (NBV) of the object. Finding next best view is an ill-posed task, because the current viewpoint of the object is usually not sufﬁcient to deterministically de- duce the object shape and appearance from its other facets. Approaches to NBV are generally impacted by the speciﬁc application they are being employed for. In 3D reconstruction applications, a NBV that plans to acquire a chain of views that are aimed to ex- plore unobserved voxels of objects might be an ideal option. In contrast, the next views in an object recog- nition application are desired to present new discrim- inative features, by which the object recognition per- formance can be enhanced. The number of planned views for object recognition is also intended to be as low as possible to reduce energy and time spent mov- ing the cameras physically. A deep belief network is presented in (Wu et al.,