RESIDUAL SWIN TRANSFORMER UNET WITH CONSISTENCY REGULARIZATION FOR AUTOMATIC BREAST ULTRASOUND TUMOR SEGMENTATION Xianwei Zhuang 1 , Xiner Zhu 1 , Haoji Hu *1 , Jincao Yao 2 , Wei Li 2 , Chen Yang 2 , Liping Wang 2 , Na Feng 2 , Dong Xu 2 1 College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China 2 Cancer Hospital of the University of Chinese Academy of Sciences(Zhejiang Cancer Hospital), Hangzhou, China ABSTRACT Automatic Breast Ultrasound (ABUS) image segmenta- tion is of great signiﬁcance for breast cancer diagnosis and treatment. However, similar to most medical datasets, ABUS image datasets are often small-scale and seriously imbal- anced, which makes ABUS image segmentation become a challenge. To solve this problem, we propose the Resid- ual Swin Transformer Unet with Consistency Regularization (RSTUnet-CR) which can make full use of non-lesion and unlabeled images for high-precision tumor segmentation on ABUS images. We design a consistency-regularization de- coder to reconstruct the input image, which can learn well from non-lesion and unlabeled data. The reconstruction task makes the model more suitable for the imbalanced medical image datasets. In addition, observing that the ABUS images have global semantic correlation, we establish long-distance dependence of images by the residual Swin Transformer block to improve segmentation performance. We evaluate our method on the ABUS dataset collected from 256 subjects and demonstrate the superiority of the proposed method over other state-of-the-art methods in this imbalanced dataset. Index Terms— Automatic Breast Ultrasound, Medical Image Segmentation, Consistency Regularization, Trans- former, Convolutional Neural Networks 1. INTRODUCTION Early detection of breast cancer plays an important role in the treatment of breast tumors [1]. The Automated Breast Ultra- sound (ABUS) has become one of the most important and ef- fective modalities for the early detection of breast cancer [2]. However, due to the limited and low-quality ultrasound im- ages and large variations of breast structures that exist among patients [3], it has become a difﬁcult challenge to segment breast lesions in ABUS images. Consequently, it becomes an urgent requirement to provide automatic, accurate and robust methods for breast lesion segmentation from ABUS images. *Corresponding Author Based on convolutional neural network (CNN), several state-of-the-art algorithms such as UNet and R-CNN are pro- posed to segment breast lesions from ABUS images in recent years [4] [5]. Most of the state-of-the-art methods for ABUS image seg- mentation is label-based supervised learning, which are often difﬁcult to perform well in the imbalanced and small-scale medical dataset. Meanwhile, it is often expensive to obtain images with lesion labels in medical image datasets. Similar to most medical image datasets, our ABUS dataset is small- scale and seriously imbalanced, in which the proportion of le- sion images and non-lesion images is about 1:19. According to the existing work [6], different from the supervised learn- ing with large performance degradation, the self-supervised contrastive learning method can perform stably and well in the case of a serious imbalance of datasets. There are sev- eral works which leveraging unlabeled data to improve per- formance, include consistency-based regularization [7], dual- task consistency [8], and incorporating adversarial constraints [9]. These works inspire us to make full use of non-lesion im- ages through self-supervised consistency learning to obtain better segmentation performance. In this paper, we propose a contrastive learning based method for ABUS tumor segmentation. We construct a con- sistency regularization decoder (CRD) to reconstruct images, which can combine fully-supervised with self-supervised learning to make our model more suitable for practical train- ing scenarios. The extra reconstruction task will make full use of imbalanced data to learn prior knowledge in the pre- training stage, and it is also used as a regularization term in formal training to constrain the encoder. In addition, we note that medical images often have a cer- tain long-distance dependence which is of importance for the effective prediction of lesion areas [10]. Several state-of-the- art methods such as TransUnet [11], MedT [10] and Swin- Unet [12] show the great potential of transformer [13] in im- age segmentation. These works inspire us to use transformer to encode the location information of the dependent block, so we construct the residual Swin Transformer block (RSTB) to 3071 978-1-6654-9620-9/22/$31.00 ©2022 IEEE ICIP 2022 2022 IEEE International Conference on Image Processing (ICIP) | 978-1-6654-9620-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIP46576.2022.9897941