A NOVEL TRANSFORMER-BASED PIPELINE FOR LUNG CYTOPATHOLOGICAL WHOLE
SLIDE IMAGE CLASSIFICATION
Gaojie Li
†
Qing Liu
†
Haotian Liu Yixiong Liang
‡
School of Computer Science and Engineering, Central South University, China
ABSTRACT
We propose a novel three-stage Transformer-based method-
ology for entire cytopathological whole slide image (WSI)
classification. The key idea is to leverage Transformer to ex-
tract the fine-grained lesion-level features and then progres-
sively aggregate them into intermediate-grained patch-level
features and coarse-grained WSI-level features for classifi-
cation. Specifically, we first extract multi-scale lesion fea-
tures from each patch image via Transformer-based lesion
detection, and then adaptively aggregate the extracted lesion
features into the corresponding patch feature with an MLP-
Mixer. Finally, we select the most representative patch fea-
tures and feed them into the Vision Transformer (ViT) for the
final WSI-level classification. We collect a dataset consisting
of 961 lung cytopathological WSIs of pleural effusions cytol-
ogy specimens and conduct extensive experiments on it. The
experimental results demonstrate that the proposed method
outperforms existing state-of-the-art (SOTA) methods for cy-
topathological WSI classification.
Index Terms— WSI classification, Cytopathology, Lung
Cancer, Lesion Detection, Transfomer
1. INTRODUCTION
Lung cancer has been one of the most common cancers and
causes of cancer-related death globally, accounting for ap-
proximately one in ten (11.4%) of diagnosed cancers and one
in five (18.0%) of deaths [1]. Cytopathology is rapid and
straightforward in lung cancer screening with less invasive
and providing a strong basis for the evaluation and staging
of lung cancer, which is typically diagnosed by cytopathol-
ogist on liquid-based preparation slides using a microscope
to analyze cell morphology. Specimens of lung cancer cells
are usually obtained from patients’ sputum exfoliated cells,
alveolar lavage fluid, or pleural effusions, etc [2].
Recently, with the widespread use of whole slide imaging
(WSI), pathologists have started the transition from viewing
glass slides under the microscope to the computer monitor.
†Equal contribution.
‡Corresponding author: yxliang@csu.edu.cn
This work is supported in part by the High Performance Computing Cen-
ter of Central South University.
However, manual examination of cytology slides or WSIs is
often tedious, labor-intensive, subjective, and prone to error
[3]. The development of automated cytopathological WSI
analysis is in extraordinary demand. There are a few methods
for computer-aided classification on cytopathological WSI,
and most of them focus on gynecological cervical cytology
WSI [3, 4]. Unlike natural images, the resolution of cy-
topathological WSIs can be as large as 100, 000 × 100, 000
RGB pixels, therefore it is impractical to perform WSI classi-
fication directly. Most existing methods recur to the following
supervised framework: 1) decomposing the gigapixel image
into a set of patches and performing fine-grained (e.g., lesion-
level and patch-level) predictions, and 2) aggregating the
fine-grained predictions into a WSI-level representation for
classification [4].
The early methods often exploit hand-crafted strategies
to combine the fine-grained predictions into WSI-level re-
sults [2, 5, 6, 7]. However, due to the large number of cells,
these methods are often sensitive to unavoidable errors in le-
sion prediction, resulting in poor specificity. A more sophisti-
cated strategy is to learn to aggregate the fine-grained predic-
tions into WSI-level prediction [8, 9, 10, 11, 12, 13]. How-
ever, these methods extract lesion-level features via convo-
lutional neural networks (CNN), which neglect the relation-
ship between cells and the patch-level context that is helpful
for lesion detection. Moreover, there are significant seman-
tic gaps between lesion-level and WSI-level features, and di-
rectly combining the lesion-level features into WSI-level rep-
resentation may not be optimal.
In this paper, we present a novel three-stage pipeline for
binary (Normal and Cancer) gigapixel lung cytopathologi-
cal WSI classification, which exploits the Transformer-based
structures for both lesion-level feature extraction and multi-
level aggregation. Particularly, as shown in Fig.1, we adopt
the Transformer-based Deformable DETR [15] to extract
lesion-level features from patches via supervised lesion de-
tection. The output tokens of Deformable DETR’s decoder
can be treated as lesion-level features. Instead of directly ag-
gregating lesion-level features into WSI-level representation
as in [9, 10, 13], we introduce an auxiliary supervised patch
classification task to combine lesion-level features into patch-
level feature via an MLP-Mixer [16] and finally combine the
top K patch-level features into WSI-feature for classifica-
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095365