Means in spaces of tree-like shapes Aasa Feragen, Søren Hauberg, Mads Nielsen and Franc ¸ois Lauze Department of Computer Science, University of Copenhagen Universitetsparken 1, DK-2100 Copenhagen, Denmark {aasa, hauberg, madsn, francois}@diku.dk Abstract The mean is often the most important statistic of a dataset as it provides a single point that summarizes the en- tire set. While the mean is readily defined and computed in Euclidean spaces, no commonly accepted solutions are cur- rently available in more complicated spaces, such as spaces of tree-structured data. In this paper we study the notion of means, both generally in Gromov’s CAT (0)-spaces (met- ric spaces of non-positive curvature), but also specifically in the space of tree-like shapes. We prove local existence and uniqueness of means in such spaces and discuss three different algorithms for computing means. We make an experimental evaluation of the three algo- rithms through experiments on three different sets of data with tree-like structure: a synthetic dataset, a leaf morphol- ogy dataset from images, and a set of human airway sub- trees from medical CT scans. This experimental study pro- vides great insight into the behavior of the different meth- ods and how they relate to each other. More importantly, it also provides mathematically well-founded, tractable and robust “average trees”. This statistic is of utmost impor- tance due to the ever-presence of tree-like structures in hu- man anatomy, e.g., airways and vascularization systems. 1. Notions of means Centroids, weighted averages, midpoints of a pair of points, and other variations on the sample mean are the ba- sic building blocks of statistical computations. While they are simple to compute when the underlying sample space is Euclidean, they may become much more complex in non- linear sample spaces. A classical definition of centroids in Euclidean space, dating back to Appolonios of Perga, has a direct extension to general metric spaces [10, 11]: a mean of the finite collection (x i ) i of points in a metric space (X, d) is a minimizer of the function Φ(x)= n i=1 d(x, x i ) 2 . (1) 2-point dataset A TED means of A an infinite family... + Figure 1. The infinite family of trees on the right are TED means for the set of two trees on the left. A local minimizer of Φ is called a Karcher mean while a global is called a Fr´ echet mean. But when does such a min- imizer exist? When is it unique? Although the above defi- nition does not require existence of geodesics, this is often needed in order to compute a minimizer. This reveals im- portant problems already in the simplest of situations. If geodesics exist in X, a solution to the above problem for a set of two points a and b is the point c on the geodesic segment from a to b such that d(a, c)= d(b, c). But what if there is more than one geodesic segment between a and b? The midpoint of each geodesic segment will minimize eq. 1. A key example where this problem occurs is the (Tree) Edit Distance (denoted TED in the sequel) used in spaces of attributed graphs and trees, e.g., shock graphs [5, 12, 17]. This metric is problematic as even locally, geodesics (edit paths) are not unique, and this prevents the existence of well-defined means even in a local context. For the pair of trees on the left in fig. 1 there is an infinite family of geodesics (and hence means) generated by varying the or- der and amount by which the side branches are shrunk and grown while deforming one tree into the other. A common approach for choosing a typical representative using TED is to choose the simplest possible mean, in this case the one shown in the middle. When iteratively computing means, however, one risks ending up with mean trees that are sig- nificantly simpler than the trees in the dataset. This explains the reduced complexity of the TED means found by Trinh and Kimia [17]. Similarly, in the graph embedding work of Bunke and collaborators [5, 16], severe restrictions on