Univariate statistical analysis of environmental (compositional) data:
Problems and possibilities
Peter Filzmoser
a,
⁎, Karel Hron
b
, Clemens Reimann
c
a
Institute of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria
b
Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, 17. listopadu 12, CZ-77100 Olomouc, Czech Republic
c
Geological Survey of Norway, N-7491 Trondheim, Norway
abstract article info
Article history:
Received 8 May 2009
Received in revised form 24 July 2009
Accepted 5 August 2009
Available online xxxx
Keywords:
Compositional data
Closure problem
Univariate statistical analysis
Exploratory data analysis
Log transformation
For almost 30 years it has been known that compositional (closed) data have special geometrical properties.
In environmental sciences, where the concentration of chemical elements in different sample materials is
investigated, almost all datasets are compositional. In general, compositional data are parts of a whole which
only give relative information. Data that sum up to a constant, e.g. 100 wt.%, 1,000,000 mg/kg are the best
known example. It is widely neglected that the “closure” characteristic remains even if only one of all
possible elements is measured, it is an inherent property of compositional data. No variable is free to vary
independent of all the others.
Existing transformations to “open” closed data are seldom applied. They are more complicated than a log
transformation and the relationship to the original data unit is lost. Results obtained when using classical
statistical techniques for data analysis appeared reasonable and the possible consequences of working with
closed data were rarely questioned. Here the simple univariate case of data analysis is investigated. It can be
demonstrated that data closure must be overcome prior to calculating even simple statistical measures like
mean or standard deviation or plotting graphs of the data distribution, e.g. a histogram. Some measures like
the standard deviation (or the variance) make no statistical sense with closed data and all statistical tests
building on the standard deviation (or variance) will thus provide erroneous results if used with the original
data.
©2009 Elsevier B.V. All rights reserved.
1. Introduction
A classical example for a closed array or closed number system is a
data set in which the individual variables are not independent of each
other but are related by being expressed as a percentage or parts per
million—as almost all environmental data are. Compositional data
have been historically defined as summing up to a constant, but
nowadays they have a broader definition, as they are considered to be
parts of a whole which only give relative information (see Buccianti
and Pawlowsky-Glahn, 2005, for an example). This definition thus
also includes data that do not sum up to a constant. The problems of
undertaking statistical analyses with “closed number systems” have
been discussed much in specialized literature for more than 30 years,
mostly in connection with multivariate data analysis (e.g. Chayes,
1960; Butler, 1976; Le Maitre, 1982; Woronow and Butler, 1986;
Aitchison, 1986, 2008). However, the mathematical formalism is
difficult and the consequences of using classical statistics for com-
positional data have thus never reached the wider environmental
community. Data closure has often been treated as a topic for
mathematical freaks, and intuitively it has been stated that this issue
might have consequences only for multivariate data analysis or if
major elements are considered in data analysis. Using practical
examples, this paper demonstrates what happens if classical statis-
tical methods are applied indiscriminately to environmental data in
the simple univariate case.
The first step in statistical data analysis of environmental data
should be to “look” at the data with appropriate graphical tools
(Reimann et al., 2008). Typically, a histogram is inspected in order to
obtain an idea about the data distribution, or a boxplot is drawn to
show the median, skewness and tailedness of the distribution and to
identify data outliers. In addition it will be of interest to estimate
mean, variance, and probably further statistical data summary
measures that characterize the observed data.
One basic question when performing these standard tasks is
whether the original data or transformed data should be used. In
environmental sciences many data are strongly right-skewed, a
histogram of the original data may be almost uninformative due to
the presence of some extreme outliers. Calculating the arithmetic
mean for right-skewed data will result in a biased (too high) estimate
Science of the Total Environment xxx (2009) xxx–xxx
⁎ Corresponding author. Tel.: +43 1 58801 10733; fax: +43 1 58801 10799.
E-mail addresses: P.Filzmoser@tuwien.ac.at (P. Filzmoser), hronk@seznam.cz
(K. Hron), Clemens.Reimann@ngu.no (C. Reimann).
STOTEN-11466; No of Pages 9
0048-9697/$ – see front matter ©2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.scitotenv.2009.08.008
Contents lists available at ScienceDirect
Science of the Total Environment
journal homepage: www.elsevier.com/locate/scitotenv
ARTICLE IN PRESS
Please cite this article as: Filzmoser P, et al, Univariate statistical analysis of environmental (compositional) data: Problems and possibilities,
Sci Total Environ (2009), doi:10.1016/j.scitotenv.2009.08.008