Bayesian analysis of systems with random chemical composition: Renormalization-group approach to Dirichlet distributions and the statistical theory of dilution Marcel Ovidiu Vlad, 1,2 Masa Tsuchiya, 3 Peter Oefner, 3 and John Ross 1 1 Department of Chemistry, Stanford University, Stanford, California 94305-5080 2 Center of Mathematical Statistics, Casa Academiei Romane, Calea Septembrie 13, 76100 Bucharest, Romania 3 Stanford Genome Technology Center, Stanford University School of Medicine, 855 California Avenue, Palo Alto, California94304 Received 31 August 2001; published 20 December 2001 We investigate the statistical properties of systems with random chemical composition and try to obtain a theoretical derivation of the self-similar Dirichlet distribution, which is used empirically in molecular biology, environmental chemistry, and geochemistry. We consider a system made up of many chemical species and assume that the statistical distribution of the abundance of each chemical species in the system is the result of a succession of a variable number of random dilution events, which can be described by using the renormalization-group theory. A Bayesian approach is used for evaluating the probability density of the chemi- cal composition of the system in terms of the probability densities of the abundances of the different chemical species. We show that for large cascades of dilution events, the probability density of the composition vector of the system is given by a self-similar probability density of the Dirichlet type. We also give an alternative formal derivation for the Dirichlet law based on the maximum entropy approach, by assuming that the average values of the chemical potentials of different species, expressed in terms of molar fractions, are constant. Although the maximum entropy approach leads formally to the Dirichlet distribution, it does not clarify the physical origin of the Dirichlet statistics and has serious limitations. The random theory of dilution provides a physical picture for the emergence of Dirichlet statistics and makes it possible to investigate its validity range. We discuss the implications of our theory in molecular biology, geochemistry, and environmental science. DOI: 10.1103/PhysRevE.65.011112 PACS numbers: 05.40.-a, 05.10.Cc, 87.15.Cc, 91.65.-n I. INTRODUCTION The statistical analysis of various problems of physics, chemistry, and biology involves the consideration of systems with random chemical compositions. Typical examples in- clude statistical studies of the abundances of different chemi- cal species in geochemistry 1, the distribution of pollutants in the environment 2, or the nucleotide frequencies in ge- nomes 3. For many systems with random composition, the statistics of the fluctuations in composition can be satisfac- torily described by means of the Dirichlet probability density 4 P N ; d =Z  -1 u =1 N u a -1 v =1 N -1 d , 1 where the composition vector =( 1 ,..., N ) is expressed by the mass, volume, or mole fractions 1 ,..., N of the differ- ent species present in the system, 1 0,..., N 0 are posi- tive integers and Z = ¯ u =1 N u -1 v =1 N -1 d = u =1 N u u =1 N u 2 is a partition function. The standard method used in math- ematical statistics for the generation of the Dirichlet prob- ability density is to express the fractions 1 ,..., N in terms of N random variables X 1 ..., X N , as u =X u / u =1 N X u , where each random variable X u is selected from a different Gamma or 2 probability density. Under these circum- stances, it is easy to show that the vector =( 1 ... N ) obeys a probability law of the type 1. Unfortunately, this is only a formal statistical derivation that does not clarify the meaning of the probability density 1. Recently, the empirical use of the Dirichlet distribution has become popular, especially in molecular biology where it provides a satisfactory description of nucleotide statistics in DNA strands or amino acid statistics in proteins 4. Other applications include the description of pollutant distribution in the environment 2, its use in material science for de- scribing the chemical composition of disordered systems 5, as well as its use in geochemistry 1. In all of these cases, the Dirichlet distribution is employed merely as an empirical law, which manages to give a satisfactory description of the observed data. No simple physical explanation for the occur- rence of the Dirichlet law has been given. The purpose of this paper is the presentation of a simple physical explana- tion for the Dirichlet law 1for the composition fluctua- tions. Our main assumption is that the random variations in composition are due to the occurrence of a succession of a random number of dilution events. Such a mechanism seems reasonable not only in environmental chemistry and geochemistry but also in molecular biology, where the pro- cess of nucleotide substitution can act as a dilution factor, which tends to destroy the correlations among the different PHYSICAL REVIEW E, VOLUME 65, 011112 1063-651X/2001/651/0111128/$20.00 ©2001 The American Physical Society 65 011112-1