Correlations and Anticorrelations in LDA Inference Alexandre Passos Department of Computer Science University of Massachusetts Amherst apassos@cs.umass.edu Hanna M. Wallach Department of Computer Science University of Massachusetts Amherst wallach@cs.umass.edu Andrew McCallum Department of Computer Science University of Massachusetts Amherst mccallum@cs.umass.edu 1 Introduction Inference of the document-specific topic distributions in latent Dirichlet allocation (LDA) [2] and decoding in compressed sensing [3] exhibit many similarities. Given a matrix and a noisy observed vector, the goal of both tasks is to recover a sparse vector that, when combined with the matrix, provides a good explanation of the noisy observed data. In the case of LDA, the matrix corresponds to the topic-specific distributions over words, the noisy observed vector corresponds to the observed word frequencies for a single document, and the sparse vector corresponds to the document-specific distribution over topics for that document. In the scenario typically considered in the compressed sensing literature (i.e., a small, dense observed vector and a large, very sparse latent vector) the latent structure can be recovered exactly provided the matrix in question satisfies the restricted isometry property [4]. Satisfying this property means that no row in the matrix can be reconstructed from a sparse, linear combination of the other rows. Even though it is infeasible to test whether an arbitrary matrix satisfies the restricted isometry property, it is possible to use intuitions about this property, along with theorems about random matrices, to design improved compressed sensing systems and to prove theorems about the situations in which compressed sensing will work and why. In this paper, we present preliminary work on identifying an analogue of the restricted isometry property for LDA, along with its effect on inference of the document-specific topic distributions. This work is based on the following observation: If a document contains occurrences of some word type that can be well-explained by multiple topics (e.g., the word “neural” in a corpus of NIPS papers, which could be explained by either topics on neural networks or topics on neuroscience), the model structure and associated inference procedures will almost always force the inferred topic distribution for that document to exhibit a strong preference for only one of these topics. Not only is this behavior required by the structure of optimal solutions to the LDA inference problem (see lemma 4 of Sontag and Roy [6]) but, intuitively, it is also this behavior that permits inference of the topic-specific word distributions: By explaining all document-specific occurrences of a word type with only one topic and searching for sparse document–topic distributions, the latent topics can “move apart” and eventually assign high probabilities to words that exhibit within-topic, but not across-topic, co-occurrences. We illustrate our claims using Blei and Lafferty’s correlated topic model [1]. This model is well- known to assign higher probabilities to held-out (i.e., previously unseen) documents than LDA, while simultaneously producing lower-quality topics, as judged by human evaluators [5]. 1