Vegetatio 68: 139-143, 1987 139
© Dr W. Junk Publishers, Dordrecht - Printed in the Netherlands
Two-step vegetation analysis based on very large data sets*
Eddy Van der Maarel 1, Ileana Espejel2 & Patricia Moreno-Casasola3.*
1Institute of Ecological Botany, Uppsala University, Box 559, S-75122 Uppsala, Sweden
2Permanent address: Centro de Recursos Bioticos de la Peninsula de Yucatan, INIREB, Apto. postal 281,
97000 Merida, Yucatan, Mexico
3Permanent address: Laboratorio de Ecologia, Facultad de Ciencias, UNAM, 04510 Mexico, D.F., Mexico
Keywords: Classification, Composite sample, Dune vegetation, Large data set, Sample stratification,
Synoptic value, Yucatan
Abstract
A two-step method for the classification of very large phytosociological data sets is demonstrated. Stratifi-
cation of the set is suggested either by area in the case of a large and geographically heterogeneous region,
or by vegetation type in the case of a set covering all the plant communities of an area. First, cluster analysis
is performed on each subset. The resulting basic clusters are summarized by calculating a 'synoptic cover-
abundance value' for each species in each cluster. All basic clusters are then subjected to the same procedure.
Second order clusters are interpreted as community types. The synoptic value proposed reflects both frequen-
cy and average cover-abundance. It is emphasized that a species should have a high frequency to be used as
a diagnostic species.
The method is demonstrated with a set of 1138 relev6s and 250 species of coastal sand dune vegetation
in Yucatan treated with the programs TWINSPAN and TABORD. Some problems and perspectives of the
approach are discussed in the light of hierarchy theory and classification theory.
Introduction
Multivariate treatment of very large data sets has
become possible since the development of cluster-
ing programs such as CLUSLA (Louppen & Van der
Maarel, 1979), TWlNSPAN (Hill, 1979), COMP-
CLUS (Gauch, 1980) and FLEXCLUS (Van Ton-
geren, in prep.), and the ordination package DE-
CORANA (Hill, 1979). The main aim of such
treatments is usually to obtain a plant community
classification of the material and we will therefore
only discuss clustering of large data sets.
Whereas CLUSLA and COMPCLUS start from
a limited number of initial clusters, TWINSPAN
and FLEXCLUS allow a full hierarchical treatment
of all analyses (relev6s) involved. Most of such pro-
*Nomenclature follows Sosa et al. (1985), Etnoflora Yu-
catanense
**We thank Drs Mike Dale, Henk Doing and Colin Prentice for
comments on the manuscript.
grams make use of so-called condensed storage of
data matrices as devised by Hill (e.g., 1979, see also
Gauch, 1982) and can now deal with thousands of
relev~s simultaneously with low CPU demands.
Very large data sets have been brought together
for instance (1) the salt marsh data set of the Work-
ing Group for Data-Processing with nearly 7 000 re-
lev6s and over 9900 species (Van der Maarel et al.,
1976; Kortekaas et al., 1980) and (2) the data set
collected in the British project 'National Vegetation
Classification' with 35 000 relev~s and 3 000 species.
(See Huntley et al., 1981).
The question discussed in this paper is whether
we should really apply the new powerful programs
directly to very large data sets. Little has been said
about the problems involved. Gauch (1982) men-
tioned the communication and assimilation of the
results of hierarchical classification as a limitation.
Van der Maarel (1982) thought that it might be in-
effective to treat a very heterogeneous data set, even
with very fast programs such as DECORANA and