Surgical Tool Attributes from Monocular Video Suren Kumar 1 , Madusudanan Sathia Narayanan 1 , Pankaj Singhal 2 , Jason J Corso 3 and Venkat Krovi 4 Abstract— HD Video from the (monocular or binocular) endoscopic camera provides a rich real-time sensing channel from surgical site to the surgeon console in various Minimally Invasive Surgery (MIS) procedures. However, a real-time frame- work for video understanding would be critical for tapping into the rich information-content provided by the non-invasive and well-established digital endoscopic video-streaming modality. While contemporary research focuses on enhancing aspects such as tool-tracking within the challenging visual scenes, we consider the associated problem of using that rich (but often compromised) streaming visual data to discover the underlying semantic attributes of the tools. Directly analyzing the surgical videos to extract more realistic attributes online can aid in the decision-making and feedback aspects. We propose a novel probabilistic attribute labelling framework with Bayesian filtering to identify associated se- mantics (open/closed, stained with blood etc.) to ultimately give semantic feedback to the surgeon. Our robust video- understanding framework overcomes many of the challenges (tissue deformations, image specularities, clutter, tool-occlusion due to blood and/or organs) under realistic in-vivo surgi- cal conditions. Specifically, this manuscript performs rigorous experimental analysis of the resulting method with varying parameters and different visual features on a data-corpus consisting of real surgical procedures performed on patients with da Vinci Surgical System [8]. I. INTRODUCTION Increasingly, surgical procedures are being performed us- ing MIS techniques which rely on the endoscopic camera to provide a rich real-time sensing channel from surgical site to the surgeon console. However, the rich information content (often already in digital form as real-time HD video) is underutilized in current surgical procedures. The “last- mile” still remains the analog rendering of the digital image- stream back to the eyeballs of the surgical team. A real-time framework for video capture, processing and understanding (building upon the non-invasive and well-established digital endoscopic video modality) is critical to reaching the goal *This work is in part supported by Kaleida Health Western New York, the UB Foundation Bruce Holm Fund and NSF IIS-1319084 1 Suren Kumar and Madusudanan Sathia Narayanan are PhD Candidates with Department of Mechanical Engineering, University at Buffalo, SUNY Buffalo NY 14260 USA, email:{surenkum,ms329}@buffalo.edu 2 Pankaj Singhal is the Director of Robotic Surgery, Kaleida Health Western New York, Buffalo NY 14214 USA, email:psinghal@buffalo.edu 3 Jason J. Corso is an Associate Professor in Department of Computer Science Engineering, University at Buffalo, SUNY, Buffalo NY 14260 USA, email: jcorso@buffalo.edu 4 Venkat Krovi is an Associate Professor in Department of Mechan- ical Engineering and an Adjunct Professor in Department Obstetrics and Gynecology, University at Buffalo, SUNY Buffalo NY 14260 USA, email:vkrovi@buffalo.edu of intelligent intra-operative surgical assistance. Significant research efforts (surveyed later) are already under-way fo- cusing on locating and tracking critical elements (e.g tools) within the visual scene to enhance situational awareness. In contrast, however, in this manuscript we consider the associated problem of using that rich streaming visual data to discover the underlying semantic attributes of the tools. Semantic attributes (in the domain of minimally-invasive surgery) broadly can now include information about the state of the surgical tools, environment and tool-environment interactions. Knowledge of semantic attributes (such as tool- operational state, blood-stained state, holding tissue or not, cauterization state etc.) would be an essential step in the path towards autonomous robotic surgery. Some example attributes are open/closed, stained with blood or not, holding tissue/without it, state of cauterizing tools etc. Furthermore, this type of attribute labelling can potentially provide an additional layer of safety for the state-of-art human-in-the- loop surgical robotic systems. Semantic attribute feedback in real-time would be beneficial to avoid critical failures and possible surgical errors due to surgeon’s lack of experience and/or situation-awareness, inappropriate operation or com- munication between master-slave type systems [10], [16]. With the commercial success of Intuitive robotic surgery platform [8], a variety of surgical robotic systems, such as RAVEN [19], DLR MiroSurge [9], with widely varying architectures and instrumentation are being developed. How- ever, as teleoperated devices, all examples are fitted with one (or more) camera(s) to provide the surgeon with visual feedback [26]. Hence, our semantic attribute understanding framework (built solely upon sensed visual information) would have wider applicability. Gaining semantic knowledge directly from the actual videos in surgical settings becomes important from multiple perspectives and more specifically for identifying the surgical gestures and providing a context specific surgical feedback. The principal focus of this work is to estimate two specific (binary) semantic attributes of tools, namely open/closed states and blood-stain condition of tools using only a monocular video stream. Figure 1 summarizes our algorithm. Given a video with bounding boxes for tools, we first extract visual features and adapt the probabilistic Support Vector Machine (SVM) formulation to learn a visual attribute scoring function. We feed output of this function to a novel Bayesian tracking framework to maintain accurate and smooth estimates of the semantic attributes. To the best of our knowledge, this is the first work that performs an online probabilistic semantic attribute labeling and tracking from visual data alone.