Hypercorrelation Squeeze for Few-Shot Segmentation Juhong Min Dahyun Kang Minsu Cho Pohang University of Science and Technology (POSTECH), South Korea http://cvlab.postech.ac.kr/research/HSNet/ Abstract Few-shot semantic segmentation aims at learning to seg- ment a target object from a query image using only a few annotated support images of the target class. This challeng- ing task requires to understand diverse levels of visual cues and analyze fine-grained correspondence relations between the query and the support images. To address the problem, we propose Hypercorrelation Squeeze Networks (HSNet) that leverages multi-level feature correlation and efficient 4D convolutions. It extracts diverse features from different levels of intermediate convolutional layers and constructs a collection of 4D correlation tensors, i.e., hypercorrelations. Using efficient center-pivot 4D convolutions in a pyramidal architecture, the method gradually squeezes high-level se- mantic and low-level geometric cues of the hypercorrelation into precise segmentation masks in coarse-to-fine manner. The significant performance improvements on standard few- shot segmentation benchmarks of PASCAL-5 i , COCO-20 i , and FSS-1000 verify the efficacy of the proposed method. 1. Introduction The advent of deep convolutional neural networks [17, 20, 64] has promoted dramatic advances in many computer vision tasks including object tracking [28, 29, 45], visual correspondence [22, 44, 48], and semantic segmentation [7, 47, 62] to name a few. Despite the effectiveness of deep networks, their demand for a heavy amount of annotated examples from large-scale datasets [9, 11, 35] still remains a fundamental limitation since data labeling requires substan- tial human efforts, especially for dense prediction tasks, e.g., semantic segmentation. To cope with the challenge, there have been various attempts in semi- and weakly-supervised segmentation approaches [6, 26, 39, 66, 72, 77, 88] which in turn effectively alleviated the data-hunger issue. However, given only a few annotated training examples, the problem of poor generalization ability of the deep networks is yet the primary concern that many few-shot segmentation meth- ods [10, 12, 13, 19, 33, 36, 37, 46, 54, 61, 63, 69, 70, 74, 75, 80, 83, 86, 87, 89] struggle to address. Visual correspondences at multiple levels (Hypercorrelation) Support Query Correlation pattern analysis (Hypercorrelation squeeze) semantic (coarse) geometric (fine) squeeze squeeze Figure 1: Our model performs visual reasoning in coarse-to-fine manner by gradually squeezing high-dimensional hypercorrelation to the target segmentation mask with efficient 4D convolutions. In contrast, human visual system easily achieves gen- eralizing appearances of new objects given extremely lim- ited supervision. The crux of such intelligence lies at the ability in finding reliable correspondences across different instances of the same class. Recent work on semantic cor- respondence shows that leveraging dense intermediate fea- tures [38, 42, 44] and processing correlation tensors with high-dimensional convolutions [30, 58, 71] are significantly effective in establishing accurate correspondences. However, while recent few-shot segmentation research began active exploration in the direction of correlation learning, most of them [36, 37, 46, 65, 73, 75, 80] neither exploit diverse levels of feature representations from early to late layers of a CNN nor construct pair-wise feature correlations to cap- ture fine-grained correlation patterns. There have been some attempts [74, 86] in utilizing dense correlations with multi- level features, but they are yet limited in the sense that they simply employ the dense correlations for graph attention, using only a small fraction of intermediate conv layers. In this work we combine the two of the most influen- tial techniques in recent research of visual correspondence, multi-level features and 4D convolutions, and deign a novel 1 arXiv:2104.01538v3 [cs.CV] 14 Oct 2021