A Metric to Assess the Quality of a Data Sample ∗ CJ Hattingh † October 24, 2016 Abstract In exploring a methodology for determining the minimum sample size (of a simple random sample) which will guarantee a representative sample, I describe a metric to evaluate the quality of a sample for data with a general frequency distribution. This investigation is limited to discretized variables, and assumes that the population is known. 1 Introduction In many statistical applications it is required to find a sample which is representative of a larger population. Typically the sampling process consists of taking a random sample of representa- tives with a specified size. A pivotal question in this process can be formulated as: What is the smallest sample size which will guarantee a representative sample? Implicit in this question is the question of how to assess whether a sample is representative of a known population, or more generally whether two samples are similar for a general frequency distribution (i.e. not necessarily normal). Conversely we could consider the difference between samples: a type of sampling error. If this sampling error is below a certain threshold we could consider two samples to be similar. This paper investigates these questions, exploring a metric which describes how closely a sam- ple’s distribution matches the population’s distribution specifically, and which can be applied to data with a general frequency distribution. * A technical paper produced while in the service of Revenue Science, a part of the Cyest Corporation (www.cyestcorp.com). † e-mail address: krisjan@simulasie.co.za