RESIDUAL SWIN TRANSFORMER UNET WITH CONSISTENCY REGULARIZATION FOR
AUTOMATIC BREAST ULTRASOUND TUMOR SEGMENTATION
Xianwei Zhuang
1
, Xiner Zhu
1
, Haoji Hu
*1
, Jincao Yao
2
, Wei Li
2
, Chen Yang
2
,
Liping Wang
2
, Na Feng
2
, Dong Xu
2
1
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China
2
Cancer Hospital of the University of Chinese Academy of Sciences(Zhejiang Cancer Hospital),
Hangzhou, China
ABSTRACT
Automatic Breast Ultrasound (ABUS) image segmenta-
tion is of great significance for breast cancer diagnosis and
treatment. However, similar to most medical datasets, ABUS
image datasets are often small-scale and seriously imbal-
anced, which makes ABUS image segmentation become a
challenge. To solve this problem, we propose the Resid-
ual Swin Transformer Unet with Consistency Regularization
(RSTUnet-CR) which can make full use of non-lesion and
unlabeled images for high-precision tumor segmentation on
ABUS images. We design a consistency-regularization de-
coder to reconstruct the input image, which can learn well
from non-lesion and unlabeled data. The reconstruction task
makes the model more suitable for the imbalanced medical
image datasets. In addition, observing that the ABUS images
have global semantic correlation, we establish long-distance
dependence of images by the residual Swin Transformer
block to improve segmentation performance. We evaluate
our method on the ABUS dataset collected from 256 subjects
and demonstrate the superiority of the proposed method over
other state-of-the-art methods in this imbalanced dataset.
Index Terms— Automatic Breast Ultrasound, Medical
Image Segmentation, Consistency Regularization, Trans-
former, Convolutional Neural Networks
1. INTRODUCTION
Early detection of breast cancer plays an important role in the
treatment of breast tumors [1]. The Automated Breast Ultra-
sound (ABUS) has become one of the most important and ef-
fective modalities for the early detection of breast cancer [2].
However, due to the limited and low-quality ultrasound im-
ages and large variations of breast structures that exist among
patients [3], it has become a difficult challenge to segment
breast lesions in ABUS images. Consequently, it becomes an
urgent requirement to provide automatic, accurate and robust
methods for breast lesion segmentation from ABUS images.
*Corresponding Author
Based on convolutional neural network (CNN), several
state-of-the-art algorithms such as UNet and R-CNN are pro-
posed to segment breast lesions from ABUS images in recent
years [4] [5].
Most of the state-of-the-art methods for ABUS image seg-
mentation is label-based supervised learning, which are often
difficult to perform well in the imbalanced and small-scale
medical dataset. Meanwhile, it is often expensive to obtain
images with lesion labels in medical image datasets. Similar
to most medical image datasets, our ABUS dataset is small-
scale and seriously imbalanced, in which the proportion of le-
sion images and non-lesion images is about 1:19. According
to the existing work [6], different from the supervised learn-
ing with large performance degradation, the self-supervised
contrastive learning method can perform stably and well in
the case of a serious imbalance of datasets. There are sev-
eral works which leveraging unlabeled data to improve per-
formance, include consistency-based regularization [7], dual-
task consistency [8], and incorporating adversarial constraints
[9]. These works inspire us to make full use of non-lesion im-
ages through self-supervised consistency learning to obtain
better segmentation performance.
In this paper, we propose a contrastive learning based
method for ABUS tumor segmentation. We construct a con-
sistency regularization decoder (CRD) to reconstruct images,
which can combine fully-supervised with self-supervised
learning to make our model more suitable for practical train-
ing scenarios. The extra reconstruction task will make full
use of imbalanced data to learn prior knowledge in the pre-
training stage, and it is also used as a regularization term in
formal training to constrain the encoder.
In addition, we note that medical images often have a cer-
tain long-distance dependence which is of importance for the
effective prediction of lesion areas [10]. Several state-of-the-
art methods such as TransUnet [11], MedT [10] and Swin-
Unet [12] show the great potential of transformer [13] in im-
age segmentation. These works inspire us to use transformer
to encode the location information of the dependent block, so
we construct the residual Swin Transformer block (RSTB) to
3071 978-1-6654-9620-9/22/$31.00 ©2022 IEEE ICIP 2022
2022 IEEE International Conference on Image Processing (ICIP) | 978-1-6654-9620-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICIP46576.2022.9897941