Data-Efﬁcient Instance Segmentation with a Single GPU Pengyu Chen 2,* , Wanhua Li 1,* , Jiwen Lu 1 1 Department of Automation, Tsinghua University, China 2 Beijing University of Posts and Telecommunications chenpengyu2018@bupt.edu.cn li-wh17@mails.tsinghua.edu.cn lujiwen@tsinghua.edu.cn Abstract Not everyone is wealthy enough to have hundreds of GPUs or TPUs. Therefore, we’ve got to ﬁnd a way out. In this paper, we introduce a data-efﬁcient instance segmenta- tion method we used in the 2021 VIPriors Instance Segmen- tation Challenge. Our solution is a modiﬁed version of Swin Transformer, based on the mmdetection which is a powerful toolbox. To solve the problem of lack of data, we utilize data augmentation including random ﬂip and multiscale training to train our model. During inference, multiscale fusion is used to boost the performance. We only use a single GPU during the whole training and testing stages. In the end, our team named THU IVG 2018 achieved the result of 0.366 for AP@0.50:0.95 on the test set, which is competitive with other top-ranking methods while only one GPU is used. Be- sides, our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants. In the end, our team ranked third among all the contestants, as announced by the organizers. 1. Introduction Instance segmentation is a popular research ﬁeld in ma- chine learning and computer vision due to its broad ap- plications. However, it is costly to build a large-scale database since a large number of tags on plenty of pictures are needed. Meanwhile, not everyone is wealthy enough to have hundreds of GPUs or TPUs. Therefore, we’ve got to ﬁnd a way out. The challenge ”2021 VIPriors Instance Segmentation Challenge”, hosted at ICCV 2021, is raised which encourages researchers of arbitrary background to participate: no giant GPU clusters are needed, nor will training for a long time. The dataset used in the challenge of instance segmen- tation is provided by Synergy Sports, the partner of this challenge, which contains 310 pictures shot during differ- ent basketball games, and the mission is to segment the * Equal contribution basketball players and the basketball on the images. In the challenge, we don’t use any other datasets to aug- ment the dataset and any pre-trained backbones which, of course, utilizes an external database in another way. On the other hand, the state-of-the-art models are often trained with large-scale datasets, including DetectoRS [17], Cas- cade Eff-B7 NAS-FPN [8], QueryInst [7], and so on. These models may give a poor performance with insufﬁcient la- beled data. We believe that the recently proposed Swin Transformer [15] can better leverage the visual inductive priors. Therefore we utilize the Swin-Transformer as the backbone to extract image features and the Cascade Mask R-CNN [1] as the detection and segmentation head. To training the proposed model with only a few samples, data augmentation is a necessary step. Having trained the model, we conducted multiscale fusion to further boost the perfor- mance. Note that, all the training is done with only one GPU to simulate the situation of limited computation re- sources. In the end, our method achieved the AP@0.50:0.95 of 0.366 on the test set, which is competitive with other top-ranking methods while only one GPU is used. Besides, our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants. In the end, our team ranked third among all the contestants, as announced by the organizers. 2. Related Work In this section, we brieﬂy introduce two related topics: instance segmentation and transformer. 2.1. Instance Segmentation Instance segmentation is essential for a wide variety of applications such as autonomous driving and visual ques- tion answering. Liu et al. [14] proposed Path Aggregation Network to boost information ﬂow by bottom-up path aug- mentation and won the COCO 2017 Challenge Object De- tection Task. Hybrid Task Cascade [3] is the ﬁrst successful model to introduce the cascade into the instance segmen- tation ﬁeld and improve the training results signiﬁcantly. Lee et al. [10] presented the CenterMask, which is a simple arXiv:2110.00242v2 [cs.CV] 8 Oct 2021