Challenges in Building of Deep Learning
Models for Glioblastoma Segmentation:
Evidence from Clinical Data
Anvar KURMUKOV
a,b
, Aleksandra DALECHINA
c,1
, Talgat SAPAROV
a,d
, Mikhail
BELYAEV
e
, Svetlana ZOLOTOVA
c
, Andrey GOLANOV
c
and Anna NIKOLAEVA
c
a
Institute for Information Transmission Problems of the Russian Academy of Sciences
(Kharkevich Institute), Moscow, Russia
b
Higher School of Economics - National Research University, Moscow, Russia
c
N. N. Burdenko National Medical Research Center of Neurosurgery, Moscow, Russia
d
Moscow Institute of Physics and Technology, Moscow, Russia
e
Skolkovo Institute of Science and Technology, Moscow, Russia
Abstract. In this article, we compare the performance of a state-of-the-art
segmentation network (UNet) on two different glioblastoma (GB) segmentation
datasets. Our experiments show that the same training procedure yields almost twice
as bad results on the retrospective clinical data compared to the BraTS challenge
data (in terms of Dice score). We discuss possible reasons for such an outcome,
including inter-rater variability and high variability in magnetic resonance imaging
(MRI) scanners and scanner settings. The high performance of segmentation
models, demonstrated on preselected imaging data, does not bring the community
closer to using these algorithms in clinical settings. We believe that a clinically
applicable deep learning architecture requires a shift from unified datasets to
heterogeneous data.
Keywords. Deep learning, segmentation, glioblastoma, clinical data
1. Introduction
Recently, deep learning methods show great results in medical image segmentation.
Automatic segmentation based on convolutional neural networks (CNN) speeds up the
process of both tumour and organ at risk delineation, improving efficiency of the
contouring process and reducing level of inter and intra-rater variability [1], [2], [3].
Automatic segmentation of brain tumours, especially gliomas is of great research
interest. Many methods for glioma segmentation were developed under the competitions
like Brain Tumor Segmentation Challenge (BraTS) and on the unified prospective
datasets [4], [5], some of them even achieving beyond human-level performance [6].
However, a large amount of retrospective data, for instance, stored in radiation treatment
planning systems remains unused. Recently, Eijgelaar et al. demonstrated that a model
trained only on a BraTS data reached a median Dice score of 0.81 on BraTS test data
1
Aleksandra Dalechina, Radiosurgery and Radiation therapy Department, N. N. Burdenko National
Medical Research Center of Neurosurgery, E-mail: avdalechina@gmail.com
Public Health and Informatics
J. Mantas et al. (Eds.)
© 2021 European Federation for Medical Informatics (EFMI) and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI210168
298