A NOVEL TRANSFORMER-BASED PIPELINE FOR LUNG CYTOPATHOLOGICAL WHOLE SLIDE IMAGE CLASSIFICATION Gaojie Li Qing Liu Haotian Liu Yixiong Liang School of Computer Science and Engineering, Central South University, China ABSTRACT We propose a novel three-stage Transformer-based method- ology for entire cytopathological whole slide image (WSI) classification. The key idea is to leverage Transformer to ex- tract the fine-grained lesion-level features and then progres- sively aggregate them into intermediate-grained patch-level features and coarse-grained WSI-level features for classifi- cation. Specifically, we first extract multi-scale lesion fea- tures from each patch image via Transformer-based lesion detection, and then adaptively aggregate the extracted lesion features into the corresponding patch feature with an MLP- Mixer. Finally, we select the most representative patch fea- tures and feed them into the Vision Transformer (ViT) for the final WSI-level classification. We collect a dataset consisting of 961 lung cytopathological WSIs of pleural effusions cytol- ogy specimens and conduct extensive experiments on it. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art (SOTA) methods for cy- topathological WSI classification. Index TermsWSI classification, Cytopathology, Lung Cancer, Lesion Detection, Transfomer 1. INTRODUCTION Lung cancer has been one of the most common cancers and causes of cancer-related death globally, accounting for ap- proximately one in ten (11.4%) of diagnosed cancers and one in five (18.0%) of deaths [1]. Cytopathology is rapid and straightforward in lung cancer screening with less invasive and providing a strong basis for the evaluation and staging of lung cancer, which is typically diagnosed by cytopathol- ogist on liquid-based preparation slides using a microscope to analyze cell morphology. Specimens of lung cancer cells are usually obtained from patients’ sputum exfoliated cells, alveolar lavage fluid, or pleural effusions, etc [2]. Recently, with the widespread use of whole slide imaging (WSI), pathologists have started the transition from viewing glass slides under the microscope to the computer monitor. †Equal contribution. ‡Corresponding author: yxliang@csu.edu.cn This work is supported in part by the High Performance Computing Cen- ter of Central South University. However, manual examination of cytology slides or WSIs is often tedious, labor-intensive, subjective, and prone to error [3]. The development of automated cytopathological WSI analysis is in extraordinary demand. There are a few methods for computer-aided classification on cytopathological WSI, and most of them focus on gynecological cervical cytology WSI [3, 4]. Unlike natural images, the resolution of cy- topathological WSIs can be as large as 100, 000 × 100, 000 RGB pixels, therefore it is impractical to perform WSI classi- fication directly. Most existing methods recur to the following supervised framework: 1) decomposing the gigapixel image into a set of patches and performing fine-grained (e.g., lesion- level and patch-level) predictions, and 2) aggregating the fine-grained predictions into a WSI-level representation for classification [4]. The early methods often exploit hand-crafted strategies to combine the fine-grained predictions into WSI-level re- sults [2, 5, 6, 7]. However, due to the large number of cells, these methods are often sensitive to unavoidable errors in le- sion prediction, resulting in poor specificity. A more sophisti- cated strategy is to learn to aggregate the fine-grained predic- tions into WSI-level prediction [8, 9, 10, 11, 12, 13]. How- ever, these methods extract lesion-level features via convo- lutional neural networks (CNN), which neglect the relation- ship between cells and the patch-level context that is helpful for lesion detection. Moreover, there are significant seman- tic gaps between lesion-level and WSI-level features, and di- rectly combining the lesion-level features into WSI-level rep- resentation may not be optimal. In this paper, we present a novel three-stage pipeline for binary (Normal and Cancer) gigapixel lung cytopathologi- cal WSI classification, which exploits the Transformer-based structures for both lesion-level feature extraction and multi- level aggregation. Particularly, as shown in Fig.1, we adopt the Transformer-based Deformable DETR [15] to extract lesion-level features from patches via supervised lesion de- tection. The output tokens of Deformable DETR’s decoder can be treated as lesion-level features. Instead of directly ag- gregating lesion-level features into WSI-level representation as in [9, 10, 13], we introduce an auxiliary supervised patch classification task to combine lesion-level features into patch- level feature via an MLP-Mixer [16] and finally combine the top K patch-level features into WSI-feature for classifica- ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095365