Challenges in Building of Deep Learning Models for Glioblastoma Segmentation: Evidence from Clinical Data Anvar KURMUKOV a,b , Aleksandra DALECHINA c,1 , Talgat SAPAROV a,d , Mikhail BELYAEV e , Svetlana ZOLOTOVA c , Andrey GOLANOV c and Anna NIKOLAEVA c a Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute), Moscow, Russia b Higher School of Economics - National Research University, Moscow, Russia c N. N. Burdenko National Medical Research Center of Neurosurgery, Moscow, Russia d Moscow Institute of Physics and Technology, Moscow, Russia e Skolkovo Institute of Science and Technology, Moscow, Russia Abstract. In this article, we compare the performance of a state-of-the-art segmentation network (UNet) on two different glioblastoma (GB) segmentation datasets. Our experiments show that the same training procedure yields almost twice as bad results on the retrospective clinical data compared to the BraTS challenge data (in terms of Dice score). We discuss possible reasons for such an outcome, including inter-rater variability and high variability in magnetic resonance imaging (MRI) scanners and scanner settings. The high performance of segmentation models, demonstrated on preselected imaging data, does not bring the community closer to using these algorithms in clinical settings. We believe that a clinically applicable deep learning architecture requires a shift from unified datasets to heterogeneous data. Keywords. Deep learning, segmentation, glioblastoma, clinical data 1. Introduction Recently, deep learning methods show great results in medical image segmentation. Automatic segmentation based on convolutional neural networks (CNN) speeds up the process of both tumour and organ at risk delineation, improving efficiency of the contouring process and reducing level of inter and intra-rater variability [1], [2], [3]. Automatic segmentation of brain tumours, especially gliomas is of great research interest. Many methods for glioma segmentation were developed under the competitions like Brain Tumor Segmentation Challenge (BraTS) and on the unified prospective datasets [4], [5], some of them even achieving beyond human-level performance [6]. However, a large amount of retrospective data, for instance, stored in radiation treatment planning systems remains unused. Recently, Eijgelaar et al. demonstrated that a model trained only on a BraTS data reached a median Dice score of 0.81 on BraTS test data 1 Aleksandra Dalechina, Radiosurgery and Radiation therapy Department, N. N. Burdenko National Medical Research Center of Neurosurgery, E-mail: avdalechina@gmail.com Public Health and Informatics J. Mantas et al. (Eds.) © 2021 European Federation for Medical Informatics (EFMI) and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/SHTI210168 298