Video2Report: A Video Database for Automatic Reporting of Medical Consultancy Sessions Laura Schiphorst, Metehan Doyran, Sabine Molenaar, Albert Ali Salah, Sjaak Brinkkemper Dept. Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands Abstract— The regulation of medical consultations for some countries, such as the Netherlands, dictates the general practi- tioners to prepare a detailed report for each consultation, for accountability purposes. Automatic report generation during medical consultations can simplify this time-consuming pro- cedure. Action recognition for automatic reporting of medi- cal actions is not a well-researched area, and there are no publicly available medical video databases. We present in this paper Video2Report, the first publicly available medical consultancy video database involving interactions between a general practitioner and one patient. After reviewing the standard medical procedures for general practitioners, we select the most important actions to record, and have an actual medical professional perform the actions and train further actors to create a resource. The actions, as well as the area of investigation during the actions are annotated separately. In this paper, we describe the collection setup, provide several action recognition baselines with OpenPose feature extraction, and make the database, evaluation protocol and all annotations publicly available. The database contains 192 sessions recorded with up to three cameras, with 332 single action clips and 119 multiple action sequences. While the dataset size is too small for end-to-end deep learning, we believe it will be useful for developing approaches to investigate doctor-patient interactions and for medical action recognition. 1 I. INTRODUCTION In the Dutch healthcare system, care providers (CPs) are obliged to accurately report on the encounters with their patients and on the treatments in an electronic medical record (EMR). These EMRs are designed for improved communica- tion between CPs and capture previous diseases, treatments, and observations [8], [22]. Moreover, they serve to comply with guidelines and can support medical decisions [2]. Even though the EMRs support the medical care for patients, accurately documenting all aspects of healthcare is time con- suming, since it is done manually by the CPs. Administration tasks in healthcare are estimated to take over 100,000 full- time positions in long-term care in the Netherlands with a total cost exceeding 5 billion euros per year 2 . A more efficient and less time-consuming way of reporting medical consultations is necessary. Automatically constructing and storing medical reports in the EMR may be a solution. Recognising actions from videos could aid in automatically This work was supported by the Care2Report project. 1 This is the uncorrected author proof. Copyright with IEEE, please cite as: Schiphorst, L., M. Doyran, A.A. Salah, S. Molenaar, S. Brinkkemper, “Video2Report: A Video Database for Automatic Reporting of Medical Consultancy Sessions,” 15th IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, 2020. 2 https://www.berenschot.nl/actueel/2016/juli/ administratieve-taken/ constructing these reports and recent developments in action recognition provide promising results in other fields. This study aims for recognising actions from videos of medical consultations. To do so, a suitable dataset on medical actions is required. However, to our knowledge, no datasets consisting of one-on-one interactions between GPs and their patients are publicly available. Therefore, in this work, we design and collect Video2Report, a database of medical actions, with conditions similar to real consultation scenarios. This work is part of the Care2Report Project 3 [16] that aims to automatically report and document the medical documents in the EMR by combining automatic speech recognition and action recognition to recognise the relevant medical actions that are performed during consultancy ses- sions. We focus here primarily on human-human interactions between GPs and their patients, rather than a team of specialists operating simultaneously. More precisely, we aim to recognise the set of most frequently performed medical actions performed during a medical consultation, such as blood pressure measurement and auscultation of the heart and lungs. The automatic action recognition results can then be converted into a text-based report draft, which will be completed (and corrected) by the practitioner to eventually add all relevant information to the EMR. II. RELATED WORK A. Databases The action recognition literature saw rapid progress over the last years, with databases initially containing simple, single-person actions in isolated backgrounds [21], followed by increased variance over recording conditions and includ- ing simple two-person actions, like fighting and meeting [6], [1] or a combination of single- and multiple person ac- tions [18], [13], [17]. As described in a recent survey, the number of classes have increased to hundreds of actions in the larger action recognition databases, and the number of clips can exceed a million [26]. However, these large sets are typically harvested from large multimedia websites such as YouTube, where the recording conditions cannot be easily controlled, the actions are not scripted, and the annotations are costly to create. For the existing interaction datasets, it is rare to have mul- tiple viewpoints at the same time (many are harvested from movies and thus have single and often moving camera), and 3 www.care2report.nl