VIbCReg: Variance-Invariance-better-Covariance Regularization for Self-Supervised Learning on Time Series Daesoo Lee 1 and Erlend Aune 1,2 1 Norwegian University of Science and Technology 2 BI Norwegian Business School Abstract Self-supervised learning for image representations has recently had many breakthroughs with respect to linear evaluation and ﬁne-tuning evaluation. These approaches rely on both cleverly crafted loss functions and training setups to avoid the feature collapse problem. In this paper, we improve on the recently proposed VICReg paper, which introduced a loss function that does not rely on specialized training loops to converge to useful representations. Our method improves on a covariance term proposed in VICReg, and in addition we augment the head of the architecture by an IterNorm layer that greatly accelerates convergence of the model. Our model achieves superior performance on linear evaluation and ﬁne-tuning evaluation on a subset of the UCR time series classiﬁcation archive and the PTB-XL ECG dataset. Source code will be made available. 1 Introduction In the last year, representation learning (RL) has had great success within computer vision, improving both on SOTA for ﬁne-tuned models and achieving close-to SOTA results on linear evaluation on the learned representations [1, 2, 3, 4, 5, 6, 7], and many more. The main idea in these papers is to train a high-capacity neural network using a self-supervised learning (SSL) loss that is able to produce representations of images that are useful for downstream tasks such as image classiﬁcation and segmentation. The recent mainstream SSL frameworks can be divided into two main categories: 1) contrastive learning method, 2) non-contrastive learning method. The representative contrastive learning methods such as MoCo [3] and SimCLR [8] use positive and negative pairs and they learn representations by pulling the representations of the positive pairs together and pushing those of the negative pairs apart. However, these methods require a large number of negative pairs per positive pair to learn representations eﬀectively. To eliminate a need for negative pairs, a non-contrastive learning method such as BYOL [4], SimSiam [5], Barlow Twins [6], and VICReg [7] have been proposed. Since the non-contrastive learning methods use positive pairs only, their architectures could be simpliﬁed. The non-contrastive learning methods were also able to outperform the existing contrastive learning methods. To further improve quality of learned representations, feature whitening and feature decor- relation have been a main idea behind some recent improvements [9, 6, 10, 7]. Initial SSL frameworks such as SimCLR suﬀered from a problem called feature collapse if there are not enough negative pairs, where the collapsing denotes that features of the representations collapse to constants. The collapsing occurs since a similarity metric is still high even if all the features converged to constants, which is the reason for the use of the negative pairs to prevent the collapsing. The collapsing has been partially resolved by using a momentum encoder [4], an asymmetric framework with a predictor, and stop-gradient [4, 5], which have popularized the non-contrastive learning methods. However, one of the latest SSL frameworks, VICReg, shows 1 arXiv:2109.00783v1 [cs.LG] 2 Sep 2021