Integrated Feature Selection and Clustering for Taxonomic Problems within Fish Species Complexes Huimin Chen Henry L. Bart, Jr. Shuqing Huang Dept. of Electrical Engineering Dept. of Ecology & Evolutionary Biology General Dynamics Information Technology University of New Orleans Tulane University 1201 Elmwood Park Blvd New Orleans, LA 70148 New Orleans, LA 70118 New Orleans, LA 70123 Email: hchen2@uno.edu Email: hank@museum.tulane.edu Email: sophie.huang@gdit.com Abstract— As computer and database technologies advance rapidly, biologists all over the world can share biologically meaningful data from images of specimens and use the data to classify the specimens taxonomically. Accurate shape analysis of a specimen from multiple views of 2D images is crucial for finding diagnostic features using geometric morphometric techniques. We propose an integrated fea- ture selection and clustering framework that automatically identifies a set of feature variables to group specimens into a binary cluster tree. The candidate features are generated from reconstructed 3D shape and local saliency characteristics from 2D images of the specimens. A Gaussian mixture model is used to estimate the significance value of each feature and control the false discovery rate in the feature selection process so that the clustering algorithm can efficiently partition the specimen samples into clusters that may correspond to different species. The experiments on a taxonomic problem involving species of suckers in the genus Carpiodes demonstrate promising results using the proposed framework with only a small size of samples. Index Terms— feature selection, clustering, taxonomy, shape analysis, false discovery rate, image fusion I. I NTRODUCTION Biologists have traditionally consulted field guides and other published works to identify species that they en- counter in the field and to summarize what is known about the biology of those species. However, these guides rarely contain complete information on species identity, distribution and biology. Much of this information resides with specimens in natural history museums, inaccessible to most biologists. Existing information systems of natural history museums are mainly taxonomically focused. They are designed to give the research community global access to specimen information for various named species or higher taxonomic groups. However, the names assigned This paper is based on “Integrated Feature Selection and Clustering from Multiple Views for a Taxonomic Problem,” by H. Chen, H. L. Bart, and S. Huang, which appeared in the Proceedings IEEE 9th Workshop on Multimedia Signal Processing (MMSP 2007), Chania, Crete, Greece, October 2007, c 2007 IEEE. This work was supported in part by Air Force Research Lab # FA8650-07-M-1161 and Navy Air through Planning Systems Inc. Con- tract # N68335-05-C-0382. to specimens are not always the most up-to-date, or the specimens may belong to groups that have not been studied and fully resolved taxonomically. The job of identifying and describing new species and determining interrelationships among species falls on taxonomists and systematists. Taxonomy and systematics, as traditionally practiced, can be painfully slow. The reason for this is that taxonomists typically have to examine and gather data from large numbers of specimens across broad geographical areas in order to identify the features that uniquely diagnose a new species from related known species. As a consequence, it is estimated that only 10% of the world’s species have been discovered and described. The pace of new species discovery and description would speed up significantly if multimedia and machine learning techniques could be developed to automatically identify diagnostic features of specimens archived in natural history museums. Geometric morphometrics [20], as a well developed technique, has been widely used in diagnosing fish species [1]–[3]. The idea is to use landmarks to characterize shape variation among the specimens of different species. Computer-based statistical methods such as multivari- ate analysis [4] are often applied to various taxonomic problems with many successful stories [23]. However, understanding why and how morphological differences have emerged is challenging since body shape has a genetic basis but is also subject to epigenetic and envi- ronmental processes. An alternative is to apply outline analysis [11] or eigenshape analysis [12] where more information than the homologous landmarks is explored to derive biologically meaningful features. As the advances of efficient machine learning and data mining algorithms [13], a new computational framework has been developed [8] to jointly select features and classify fish species. One interesting question is whether a good clustering algorithm can automatically select useful features to quan- titatively compare the similarity among specimens. Feature selection algorithms for clustering largely fall into three categories: the filter model [10], the wrapper model [6], and the hybrid model [9]. The filter model 10 JOURNAL OF MULTIMEDIA, VOL. 3, NO. 3, JULY 2008 © 2008 ACADEMY PUBLISHER