JMLR: Workshop and Conference Proceedings 25:33–48, 2012 Asian Conference on Machine Learning Learning Latent Variable Models by Pairwise Cluster Comparison Nuaman Asbeh asbeh@post.bgu.ac.il Boaz Lerner boaz@bgu.ac.il Dept. of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel Editor: Steven C.H. Hoi and Wray Buntine Abstract Identification of latent variables that govern a problem and the relationships among them given measurements in the observed world are important for causal discovery. This identi- fication can be made by analyzing constraints imposed by the latents in the measurements. We introduce the concept of pairwise cluster comparison PCC to identify causal relation- ships from clusters and a two-stage algorithm, called LPCC, that learns a latent variable model (LVM) using PCC. First, LPCC learns the exogenous and the collider latents, as well as their observed descendants, by utilizing pairwise comparisons between clusters in the measurement space that may explain latent causes. Second, LPCC learns the non-collider endogenous latents and their children by splitting these latents from their previously learned latent ancestors. LPCC is not limited to linear or latent-tree models and does not make assumptions about the distribution. Using simulated and real-world datasets, we show that LPCC improves accuracy with the sample size, can learn large LVMs, and is accurate in learning compared to state-of-the-art algorithms. Keywords: learning latent variable models, graphical models, clustering 1. Introduction Statistical methods, such as factor analysis, are most commonly used to reveal the existence and influence of latent variables. While these methods accomplish effective dimensionality reduction and may fit the data reasonably well, the resulting models might not have any correspondence to real causal mechanisms (Silva et al., 2006). On the other hand, the focus of learning Bayesian networks (BNs) is on relations among observed variables, while the detection of latent variables and their interrelations has received little attention. Learning latent variable models (LVMs) using Inductive Causation* (IC*) (Pearl, 2000) and Fast Causal Inference (FCI) (Spirtes et al., 2000) returns partial ancestral graphs, which indicate for each link whether it is a (potential) manifestation of a hidden common cause for the two linked variables. The structural EM algorithm (Friedman, 1998) learns a structure using a fixed set of previously given latents. By searching for “structural signatures” of latents, substructures that suggest the presence of latents (in the form of dense sub-networks) can be detected (Elidan et al., 2000). Also, the recovery of latent trees of binary and Gaussian variables has been suggested (Pearl, 2000). Hierarchical latent class (HLC) models of discrete variables, where observed variables are mutually independent given the latents, are learned for clustering (Zhang, 2004). However, for models that are not tree-constrained, e.g., models where multiple latents may have multiple indicators (observed children), i.e., multiple indicator models (MIM), c 2012 N. Asbeh & B. Lerner.