P1: FRP MIJN-2947 NECO.cls February 14, 2005 20:54 LETTER Communicated by Liam Paninski Maximum Likelihood Set for Estimating a Probability Mass Function Bruno M. Jedynak bruno.jedynak@jhu.edu D´ epartement de Math´ ematiques, Universit´ e des Sciences et Technologies de Lille, France, and Center for Imaging Science, Johns Hopkins University, Baltimore, MD 21250, U.S.A. Sanjeev Khudanpur khudanpur@jhu.edu Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21250, U.S.A. We propose a new method for estimating the probability mass function (pmf) of a discrete and ﬁnite random variable from a small sample. We focus on the observed counts—the number of times each value appears in the sample—and deﬁne the maximum likelihood set (MLS) as the set of pmfs that put more mass on the observed counts than on any other set of counts possible for the same sample size. We characterize the MLS in detail in this article. We show that the MLS is a diamond-shaped subset of the probability simplex [0, 1] k bounded by at most k × (k - 1) hyper- planes, where k is the number of possible values of the random variable. The MLS always contains the empirical distribution, as well as a family of Bayesian estimators based on a Dirichlet prior, particularly the well- known Laplace estimator. We propose to select from the MLS the pmf that is closest to a ﬁxed pmf that encodes prior knowledge. When using Kullback-Leibler distance for this selection, the optimization problem comprises ﬁnding the minimum of a convex function over a domain de- ﬁned by linear inequalities, for which standard numerical procedures are available. We apply this estimate to language modeling using Zipf’s law to encode prior knowledge and show that this method permits obtain- ing state-of-the-art results while being conceptually simpler than most competing methods. 1 Introduction Let p be a probability mass function (pmf) over a set {1,..., k } of ﬁnite cardinality. This may represent a set of numerical values for a quantitative variable or a set of indices for a qualitative variable. The latter situation Neural Computation 17, 1–23 (2005) © 2005 Massachusetts Institute of Technology