Ecology and Evoluion 2017; 1–19
|
1 www.ecolevol.org
Received: 27 April 2016
|
Revised: 23 November 2016
|
Accepted: 22 December 2016
DOI: 10.1002/ece3.2760
ORIGINAL RESEARCH
Resemblance proiles as clustering decision criteria: Esimaing
staisical power, error, and correspondence for a hypothesis
test for mulivariate structure
Joshua P. Kilborn | David L. Jones | Ernst B. Peebles | David F. Naar
This is an open access aricle under the terms of the Creaive Commons Atribuion License, which permits use, distribuion and reproducion in any medium,
provided the original work is properly cited.
© 2017 The Authors. Ecology and Evoluion published by John Wiley & Sons Ltd.
College of Marine Science, University of South
Florida, Saint Petersburg, FL, USA
Correspondence
Joshua P. Kilborn, College of Marine Science,
University of South Florida, Saint Petersburg,
FL, USA.
Email: jpk@mail.usf.edu
Funding informaion
Naional Oceanic and Atmospheric
Administraion, Grant/Award Number:
NA10NMF4550468.
Abstract
Clustering data coninues to be a highly acive area of data analysis, and resemblance
proiles are being incorporated into ecological methodologies as a hypothesis tesing-
based approach to clustering mulivariate data. However, these new clustering
techniques have not been rigorously tested to determine the performance variability
based on the algorithm’s assumpions or any underlying data structures. Here, we use
simulaion studies to esimate the staisical error rates for the hypothesis test for
mulivariate structure based on dissimilarity proiles (DISPROF). We concurrently
tested a widely used algorithm that employs the unweighted pair group method with
arithmeic mean (UPGMA) to esimate the proiciency of clustering with DISPROF as a
decision criterion. We simulated unstructured mulivariate data from diferent
probability distribuions with increasing numbers of objects and descriptors, and
grouped data with increasing overlap, overdispersion for ecological data, and correlaion
among descriptors within groups. Using simulated data, we measured the resoluion
and correspondence of clustering soluions achieved by DISPROF with UPGMA against
the reference grouping pariions used to simulate the structured test datasets. Our
results highlight the dynamic interacions between dataset dimensionality, group
overlap, and the properies of the descriptors within a group (i.e., overdispersion or
correlaion structure) that are relevant to resemblance proiles as a clustering criterion
for mulivariate data. These methods are paricularly useful for mulivariate ecological
datasets that beneit from distance-based staisical analyses. We propose guidelines
for using DISPROF as a clustering decision tool that will help future users avoid
potenial pifalls during the applicaion of methods and the interpretaion of results.
KEYWORDS
constrained clustering, data simulaion, Monte Carlo, permutaion tesing, PRIMER-E, SIMPROF
1 | INTRODUCTION
In data-rich scieniic studies, it is oten necessary to apply a clustering
algorithm to detect groups of homogenous objects with respect to a
set of descriptors (i.e., measured variables). Detecion of groups is use-
ful in ecology, economics, geneics, and other disciplines that analyze
large, mulidimensional datasets. Clustering techniques for mulivari-
ate datasets are diverse and can be drawn from methods derived from