Vegetatio 68: 139-143, 1987 139 © Dr W. Junk Publishers, Dordrecht - Printed in the Netherlands Two-step vegetation analysis based on very large data sets* Eddy Van der Maarel 1, Ileana Espejel2 & Patricia Moreno-Casasola3.* 1Institute of Ecological Botany, Uppsala University, Box 559, S-75122 Uppsala, Sweden 2Permanent address: Centro de Recursos Bioticos de la Peninsula de Yucatan, INIREB, Apto. postal 281, 97000 Merida, Yucatan, Mexico 3Permanent address: Laboratorio de Ecologia, Facultad de Ciencias, UNAM, 04510 Mexico, D.F., Mexico Keywords: Classification, Composite sample, Dune vegetation, Large data set, Sample stratification, Synoptic value, Yucatan Abstract A two-step method for the classification of very large phytosociological data sets is demonstrated. Stratifi- cation of the set is suggested either by area in the case of a large and geographically heterogeneous region, or by vegetation type in the case of a set covering all the plant communities of an area. First, cluster analysis is performed on each subset. The resulting basic clusters are summarized by calculating a 'synoptic cover- abundance value' for each species in each cluster. All basic clusters are then subjected to the same procedure. Second order clusters are interpreted as community types. The synoptic value proposed reflects both frequen- cy and average cover-abundance. It is emphasized that a species should have a high frequency to be used as a diagnostic species. The method is demonstrated with a set of 1138 relev6s and 250 species of coastal sand dune vegetation in Yucatan treated with the programs TWINSPAN and TABORD. Some problems and perspectives of the approach are discussed in the light of hierarchy theory and classification theory. Introduction Multivariate treatment of very large data sets has become possible since the development of cluster- ing programs such as CLUSLA (Louppen & Van der Maarel, 1979), TWlNSPAN (Hill, 1979), COMP- CLUS (Gauch, 1980) and FLEXCLUS (Van Ton- geren, in prep.), and the ordination package DE- CORANA (Hill, 1979). The main aim of such treatments is usually to obtain a plant community classification of the material and we will therefore only discuss clustering of large data sets. Whereas CLUSLA and COMPCLUS start from a limited number of initial clusters, TWINSPAN and FLEXCLUS allow a full hierarchical treatment of all analyses (relev6s) involved. Most of such pro- *Nomenclature follows Sosa et al. (1985), Etnoflora Yu- catanense **We thank Drs Mike Dale, Henk Doing and Colin Prentice for comments on the manuscript. grams make use of so-called condensed storage of data matrices as devised by Hill (e.g., 1979, see also Gauch, 1982) and can now deal with thousands of relev~s simultaneously with low CPU demands. Very large data sets have been brought together for instance (1) the salt marsh data set of the Work- ing Group for Data-Processing with nearly 7 000 re- lev6s and over 9900 species (Van der Maarel et al., 1976; Kortekaas et al., 1980) and (2) the data set collected in the British project 'National Vegetation Classification' with 35 000 relev~s and 3 000 species. (See Huntley et al., 1981). The question discussed in this paper is whether we should really apply the new powerful programs directly to very large data sets. Little has been said about the problems involved. Gauch (1982) men- tioned the communication and assimilation of the results of hierarchical classification as a limitation. Van der Maarel (1982) thought that it might be in- effective to treat a very heterogeneous data set, even with very fast programs such as DECORANA and