Minimum Description Length, Graphs and Clustering with Exemplars Po-Hsiang Lai 1 *, Joseph A. O’Sullivan 1 , and Robert Pless 2 1 Department of Electrical and Systems Engineering, 2 Department of Computer Science and Engineering, Washington University in Saint Louis *pl1@wustl.edu Clustering • Clustering is a type of unsupervised learning that one seeks to partition data into reasonable groups. – Distribution based clustering • Fit data with a mixture model to partition data. • The objective is to maximize goodness of fit of a model. • The measure of relationship between pairs of data points changes with different choice of mixture parameters. – Distance (similarity) based clustering • Distance measure between data points is fixed. • View data points as vertices of a graph and distances as edge weights. • The objective is to remove a fixed number of edges or to optimize an objective function defined on graphs. -5 0 5 -5 0 5 -4 -2 0 2 4 -5 0 5 -5 0 5 -4 -2 0 2 4 Clustering and Model Selection • Can the number of clusters and other parameters be determined in a principled way? • Can a clustering algorithm balance the number of parameters used and the modeling error? • Distance based clustering and MDL – Distances as an approximate/estimate for description length – Usually distances are defined between pairs of data points – Encoding a data point itself : L(x i ). – Encoding given another data point: L(x i |x j ). – Use available compression algorithms: L(x i |x j ) = L(x i ,x j ) - L(x j ). – Quantize data and use universal code for integers. • Distribution based clustering and MDL • Constraints: – Let , then must hold (weak exemplar) – must hold (strong exemplar) ) ( ) ( ) | ( ) ( ) | ( : : t L x L x x L t L t x L i i i i i t i x i t i x t i i q t q i t t t t i i 1 1 , 1 N t N i i t t i i t t 2 ) ( i x L ) | ( j i x x L Minimum spanning (arborescence) tree Two level tree Weak Exemplar Case • Observe that – Must start with the root node – For every data point, there is one and only one edge pointing to it – No cycles • Minimum spanning tree: undirected graph, symmetric distance • Minimum arborescence tree: directed graph, asymmetric distance Tree Strong Exemplar Case Objective function: subject to ) ( ) ( ) | ( ) ( ) | ( : : t L x L x x L t L t x L i i i i i t i x i t i x t i i i t t 2 • Relax the search over t by assigning probabilities to cluster/exemplar membership: , , 1 min ( |) min ( | , ) min log ( ) (1 ) (, ) ( | ) ( ) exp ( ), ( | ) exp ( | ) t PQ N k k k m k m PQ k m k k k k m k m Lxt LxPQ Ppx P PQkmpx x px Lx px x Lx x • Need one more minimization Objective Functions and Graphical Optimization Alternating Minimization • Use convex decomposition lemma to decouple the optimization problem: | , 1 | | , , 1 min log ( ) (1 ) (, ) ( | ) (1 ) min min log (1 ) log ( ) (1 ) (, ) ( | ) k mk N k k k m k m PQ k m k N k mk k k k mk PQ q k m k k k k m k m Ppx P PQkmpx x q q Ppx PPQkmpx x • Convex decomposition lemma: k k k k k k q q log min log Ρ Objective function: where ) ( ) ( ) | ( ) ( ) | ( : : t L x L x x L t L t x L i i i i i t i x i t i x t i N K N K K N t L log log log ) ( ) ( How many exemplars Which ones are exemplars Assignments of non-exemplar points K p k , Simulations Weak exemplar clustering using MST, uniform quantization and Rissanen’s universal code for integers -10 0 10 20 30 40 50 -30 -20 -10 0 10 20 30 -20 0 20 40 60 -30 -20 -10 0 10 20 30 Strong exemplar clustering using AM algorithm under different signal to noise ratios References [1] R. Cilibrasi and P. M. B. Vitányi, “Clustering by compression,” IEEE Trans. Info. Theory, vol. 51, pp. 1523–1524, 2005. [2] I. Csiszár and G. Tusnády , “Information geometry and alternating minimization procedures,” Statistics and Decisions, Supplement Issue, vol. 1, pp. 205–237, 1984. [3] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, pp. 972–976, 2007. [4] P. Kontkanen, P. Myllymäki, W. Buntine, J. Rissanen, and H. Tirri, “An MDL framework for data clustering,” in Advances in Minimum Description Length, P. D. Grunwald, I. J. Myung, and M. A. Pitt, Eds. MIT press, Cambridge, Massachusetts, 2005, pp. 323–353. [5] J. Rissanen, “Fisher information and stochastic complexity,” IEEE Trans. Info. Theory, vol. 42, pp. 40–47, 1996. [6] J. Rissanen, “A universal prior for integers and estimation by minimum description length,” Annals of Statistics, vol. 11, pp. 417–431, 1983. [7] J. Rissanen, “Universal coding, information, prediction and estimation,” IEEE Trans. Info. Theory, vol. 30, pp. 629– 636, 1984. [8] A. Schrijver, Combinatorial Optimization. Springer, Berlin, 2003. Code length of continuous parameters (Rissanen 96) ) ( ) | ( ) , | ( ) , , ( L L x L x L ) , ( ) 1 ( | ) ( | log 2 log 2 ) ( ) 1 ( log ) , | ( log ) , , ( 2 1 1 ,..., 0 ,..., 1 , K L o d I n d j K j K K c x p x L N K j j N i c i i i log of number of valid vectors c