Integrated Feature Selection and Clustering for
Taxonomic Problems within Fish Species
Complexes
Huimin Chen Henry L. Bart, Jr. Shuqing Huang
Dept. of Electrical Engineering Dept. of Ecology & Evolutionary Biology General Dynamics Information Technology
University of New Orleans Tulane University 1201 Elmwood Park Blvd
New Orleans, LA 70148 New Orleans, LA 70118 New Orleans, LA 70123
Email: hchen2@uno.edu Email: hank@museum.tulane.edu Email: sophie.huang@gdit.com
Abstract— As computer and database technologies advance
rapidly, biologists all over the world can share biologically
meaningful data from images of specimens and use the
data to classify the specimens taxonomically. Accurate shape
analysis of a specimen from multiple views of 2D images
is crucial for finding diagnostic features using geometric
morphometric techniques. We propose an integrated fea-
ture selection and clustering framework that automatically
identifies a set of feature variables to group specimens
into a binary cluster tree. The candidate features are
generated from reconstructed 3D shape and local saliency
characteristics from 2D images of the specimens. A Gaussian
mixture model is used to estimate the significance value
of each feature and control the false discovery rate in the
feature selection process so that the clustering algorithm can
efficiently partition the specimen samples into clusters that
may correspond to different species. The experiments on a
taxonomic problem involving species of suckers in the genus
Carpiodes demonstrate promising results using the proposed
framework with only a small size of samples.
Index Terms— feature selection, clustering, taxonomy, shape
analysis, false discovery rate, image fusion
I. I NTRODUCTION
Biologists have traditionally consulted field guides and
other published works to identify species that they en-
counter in the field and to summarize what is known
about the biology of those species. However, these guides
rarely contain complete information on species identity,
distribution and biology. Much of this information resides
with specimens in natural history museums, inaccessible
to most biologists. Existing information systems of natural
history museums are mainly taxonomically focused. They
are designed to give the research community global access
to specimen information for various named species or
higher taxonomic groups. However, the names assigned
This paper is based on “Integrated Feature Selection and Clustering
from Multiple Views for a Taxonomic Problem,” by H. Chen, H. L. Bart,
and S. Huang, which appeared in the Proceedings IEEE 9th Workshop
on Multimedia Signal Processing (MMSP 2007), Chania, Crete, Greece,
October 2007, c 2007 IEEE.
This work was supported in part by Air Force Research Lab #
FA8650-07-M-1161 and Navy Air through Planning Systems Inc. Con-
tract # N68335-05-C-0382.
to specimens are not always the most up-to-date, or the
specimens may belong to groups that have not been
studied and fully resolved taxonomically.
The job of identifying and describing new species
and determining interrelationships among species falls on
taxonomists and systematists. Taxonomy and systematics,
as traditionally practiced, can be painfully slow. The
reason for this is that taxonomists typically have to
examine and gather data from large numbers of specimens
across broad geographical areas in order to identify the
features that uniquely diagnose a new species from related
known species. As a consequence, it is estimated that
only 10% of the world’s species have been discovered
and described. The pace of new species discovery and
description would speed up significantly if multimedia
and machine learning techniques could be developed to
automatically identify diagnostic features of specimens
archived in natural history museums.
Geometric morphometrics [20], as a well developed
technique, has been widely used in diagnosing fish species
[1]–[3]. The idea is to use landmarks to characterize
shape variation among the specimens of different species.
Computer-based statistical methods such as multivari-
ate analysis [4] are often applied to various taxonomic
problems with many successful stories [23]. However,
understanding why and how morphological differences
have emerged is challenging since body shape has a
genetic basis but is also subject to epigenetic and envi-
ronmental processes. An alternative is to apply outline
analysis [11] or eigenshape analysis [12] where more
information than the homologous landmarks is explored to
derive biologically meaningful features. As the advances
of efficient machine learning and data mining algorithms
[13], a new computational framework has been developed
[8] to jointly select features and classify fish species.
One interesting question is whether a good clustering
algorithm can automatically select useful features to quan-
titatively compare the similarity among specimens.
Feature selection algorithms for clustering largely fall
into three categories: the filter model [10], the wrapper
model [6], and the hybrid model [9]. The filter model
10 JOURNAL OF MULTIMEDIA, VOL. 3, NO. 3, JULY 2008
© 2008 ACADEMY PUBLISHER