Univariate statistical analysis of environmental (compositional) data: Problems and possibilities Peter Filzmoser a, ⁎, Karel Hron b , Clemens Reimann c a Institute of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria b Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, 17. listopadu 12, CZ-77100 Olomouc, Czech Republic c Geological Survey of Norway, N-7491 Trondheim, Norway abstract article info Article history: Received 8 May 2009 Received in revised form 24 July 2009 Accepted 5 August 2009 Available online xxxx Keywords: Compositional data Closure problem Univariate statistical analysis Exploratory data analysis Log transformation For almost 30 years it has been known that compositional (closed) data have special geometrical properties. In environmental sciences, where the concentration of chemical elements in different sample materials is investigated, almost all datasets are compositional. In general, compositional data are parts of a whole which only give relative information. Data that sum up to a constant, e.g. 100 wt.%, 1,000,000 mg/kg are the best known example. It is widely neglected that the “closure” characteristic remains even if only one of all possible elements is measured, it is an inherent property of compositional data. No variable is free to vary independent of all the others. Existing transformations to “open” closed data are seldom applied. They are more complicated than a log transformation and the relationship to the original data unit is lost. Results obtained when using classical statistical techniques for data analysis appeared reasonable and the possible consequences of working with closed data were rarely questioned. Here the simple univariate case of data analysis is investigated. It can be demonstrated that data closure must be overcome prior to calculating even simple statistical measures like mean or standard deviation or plotting graphs of the data distribution, e.g. a histogram. Some measures like the standard deviation (or the variance) make no statistical sense with closed data and all statistical tests building on the standard deviation (or variance) will thus provide erroneous results if used with the original data. ©2009 Elsevier B.V. All rights reserved. 1. Introduction A classical example for a closed array or closed number system is a data set in which the individual variables are not independent of each other but are related by being expressed as a percentage or parts per million—as almost all environmental data are. Compositional data have been historically deﬁned as summing up to a constant, but nowadays they have a broader deﬁnition, as they are considered to be parts of a whole which only give relative information (see Buccianti and Pawlowsky-Glahn, 2005, for an example). This deﬁnition thus also includes data that do not sum up to a constant. The problems of undertaking statistical analyses with “closed number systems” have been discussed much in specialized literature for more than 30 years, mostly in connection with multivariate data analysis (e.g. Chayes, 1960; Butler, 1976; Le Maitre, 1982; Woronow and Butler, 1986; Aitchison, 1986, 2008). However, the mathematical formalism is difﬁcult and the consequences of using classical statistics for com- positional data have thus never reached the wider environmental community. Data closure has often been treated as a topic for mathematical freaks, and intuitively it has been stated that this issue might have consequences only for multivariate data analysis or if major elements are considered in data analysis. Using practical examples, this paper demonstrates what happens if classical statis- tical methods are applied indiscriminately to environmental data in the simple univariate case. The ﬁrst step in statistical data analysis of environmental data should be to “look” at the data with appropriate graphical tools (Reimann et al., 2008). Typically, a histogram is inspected in order to obtain an idea about the data distribution, or a boxplot is drawn to show the median, skewness and tailedness of the distribution and to identify data outliers. In addition it will be of interest to estimate mean, variance, and probably further statistical data summary measures that characterize the observed data. One basic question when performing these standard tasks is whether the original data or transformed data should be used. In environmental sciences many data are strongly right-skewed, a histogram of the original data may be almost uninformative due to the presence of some extreme outliers. Calculating the arithmetic mean for right-skewed data will result in a biased (too high) estimate Science of the Total Environment xxx (2009) xxx–xxx ⁎ Corresponding author. Tel.: +43 1 58801 10733; fax: +43 1 58801 10799. E-mail addresses: P.Filzmoser@tuwien.ac.at (P. Filzmoser), hronk@seznam.cz (K. Hron), Clemens.Reimann@ngu.no (C. Reimann). STOTEN-11466; No of Pages 9 0048-9697/$ – see front matter ©2009 Elsevier B.V. All rights reserved. doi:10.1016/j.scitotenv.2009.08.008 Contents lists available at ScienceDirect Science of the Total Environment journal homepage: www.elsevier.com/locate/scitotenv ARTICLE IN PRESS Please cite this article as: Filzmoser P, et al, Univariate statistical analysis of environmental (compositional) data: Problems and possibilities, Sci Total Environ (2009), doi:10.1016/j.scitotenv.2009.08.008