Published as a conference paper at ICLR 2022 S EMI - RELAXED G ROMOV-W ASSERSTEIN DIVERGENCE WITH APPLICATIONS ON GRAPHS C´ edric Vincent-Cuaz 1 , R´ emi Flamary 2 , Marco Corneli 1,3 , Titouan Vayer 4 , Nicolas Courty 5 Univ. Cˆ ote d’Azur, Inria, Maasai, CNRS, LJAD 1 ; IP Paris, CMAP, UMR 7641 2 ; MSI 3 ; Univ. Lyon, Inria, CNRS, ENS de Lyon, LIP UMR 5668 4 ; Univ. Bretagne-Suf, CNRS, IRISA 5 . {cedric.vincent-cuaz; marco.corneli; titouan.vayer}@inria.fr remi.flamary@polytechnique.edu; nicolas.courty@irisa.fr ABSTRACT Comparing structured objects such as graphs is a fundamental operation involved in many learning tasks. To this end, the Gromov-Wasserstein (GW) distance, based on Optimal Transport (OT), has proven to be successful in handling the specific nature of the associated objects. More specifically, through the nodes connectivity relations, GW operates on graphs, seen as probability measures over specific spaces. At the core of OT is the idea of conservation of mass, which imposes a coupling between all the nodes from the two considered graphs. We argue in this paper that this property can be detrimental for tasks such as graph dictionary or partition learning, and we relax it by proposing a new semi-relaxed Gromov-Wasserstein divergence. Aside from immediate computational benefits, we discuss its properties, and show that it can lead to an efficient graph dictionary learning algorithm. We empirically demonstrate its relevance for complex tasks on graphs such as partitioning, clustering and completion. 1 I NTRODUCTION One of the main challenges in machine learning (ML) is to design efficient algorithms that are able to learn from structured data (Battaglia et al., 2018). Learning from datasets containing such non- vectorial objects is a difficult task that involves many areas of data analysis such as signal processing (Shuman et al., 2013), Bayesian and kernel methods on graphs (Ng et al., 2018; Kriege et al., 2020) or more recently geometric deep learning (Bronstein et al., 2017; 2021) and graph neural networks (Wu et al., 2020). In terms of applications, building algorithms that go beyond Euclidean data has led to many progresses, e.g. in image analysis (Harchaoui & Bach, 2007), brain connectivity (Ktena et al., 2017), social networks analysis (Yanardag & Vishwanathan, 2015) or protein structure prediction (Jumper et al., 2021). Learning from graph data is ubiquitous in a number of ML tasks. A first one is to learn graph representations that can encode the graph structure (a.k.a. graph representation learning). In this domain, advances on graph neural networks led to state-of-the-art end-to-end embeddings, although requiring a sufficiently large amount of labeled data (Ying et al., 2018; Morris et al., 2019; Gao & Ji, 2019; Wu et al., 2020). Another task is to find a meaningful notion of similarity/distance between graphs. A way to address this problem is to leverage geometric or signal properties through the use of graph kernels (Kriege et al., 2020) or other embeddings accounting for graph isomorphisms (Zambon et al., 2020). Finally, it is often of interest either to establish meaningful structural correspondences between the nodes of different graphs, also known as graph matching (Zhou & De la Torre, 2012; Maron & Lipman, 2018; Bernard et al., 2018; Yan et al., 2016) or to find a representative partition of the nodes of a graph, which we refer to as graph partitioning (Chen et al., 2014; Nazi et al., 2019; Kawamoto et al., 2018; Bianchi et al., 2020). Optimal Transport for structured data. Based on the theory of Optimal Transport (OT, Peyr ´ e & Cuturi, 2019), a novel approach to graph modeling has recently emerged from a series of works. Informally, the goal of OT is to match two probability distributions under the constraint of mass conservation and in order to minimize a given matching cost. OT originally tackles the problem 1 arXiv:2110.02753v3 [cs.LG] 1 Mar 2022