Comparing and Clustering Flow Cytometry Data ∗ Lin Liu, Li Xiong, James J. Lu Department of Math/CS Emory University lliu24, lxiong, jlu@mathcs.emory.edu Kim M. Gernert BimCore Emory University gernert@emory.edu Vicki Hertzberg Department of Biostatistics Emory University vhertzb@sph.emory.edu Abstract Flow cytometry technique produces large, multi- dimensional datasets of properties of individual cells that are helpful for biomedical science and clinical research. This paper explores an approach for comparing and clus- tering ﬂow cytometry data. To overcome challenges posed by the irregularities and the high dimensions of the data, we develop a set of data preprocessing techniques to facil- itate effective clustering of ﬂow cytometry data ﬁles. We present a set of experiments using real data from the Pro- tective Immunity Project (PIP) showing the effectiveness of the approach. 1. Introduction Flow Cytometry (FCM) [7] is a technique used in clin- ical research for studying the immunological status of pa- tients with vaccines or other immunotherapies, for charac- terizing cancer, HIV/AIDS infection and other diseases, as well as for research and therapy involving stem cell manipu- lations. The technique measures the characteristics of single cells, determined by visible and ﬂuorescent light emissions from the markers on the cells. As the liquid ﬂow moves the suspended, labeled cells past a laser that emits light at a par- ticular wavelength, the speciﬁc markers attached to the cell ﬂuoresce. The ﬂuorescence emission from each cell is col- lected, and subsequent electrical events are analyzed on a computer that assigns a ﬂuorescence intensity value to each signal in Flow Cytometry Standard (FCS) data ﬁles. Each FCS data ﬁle thus consists of multi-parametric descriptions of thousands to millions of individual cells. How such large sets of data points in a highly multidi- mensional space can be efﬁciently and systematically ana- lyzed represents a basic yet important challenge. The clas- sical method of analyzing cytometry data ﬁles is to plot the * The research is partially supported by the Protective Immunity Project through the NIH grant NO1-AI-50025. different combinations of channels (parameters) two at a time in a 2D scatter plot and then select subgroups of cells using gates. The gates are regions that can have any shape, but usually are rectangular. The cells within the gate are in- cluded for further analysis and viewed in another 2D scatter plot with different axis, i.e. other channels. Comparisons between different cytometry dataﬁles often depends on hu- man inspection of such 2D visualizations of all the different combinations of channels [4]. The major disadvantage of this method is that it is not only tedious, it can also miss po- tential subgroups of cells due to projection of higher dimen- sional data down to two dimensional spaces which makes them indiscernible as separate clusters. In addition, the shape, size and position of the gate are largely dependent on the experience and expectations of the researcher. Recently, some clustering algorithms and tools have been developed for FCM data that cluster cells into cell groups based on their intensity patterns using all dimensions (channels) at once [11, 8, 1]. However, comparison between different cytometry dataﬁles still remains a challenge as it is not straightforward how the clusters of cells can be directly compared across different data ﬁles. In this paper, we explore an approach for comparing and clustering FCS dataﬁles or samples. Such sample-based clustering presents a number of challenges due to the high dimensions and irregularities of the data. First, while there may be only tens to hundreds of samples available, the space of potential features consists of thousands to millions of cell intensity data values at multiple channels. This in- duces an extraordinarily large search space for the param- eters of the model. Second, the cells are not ordered uni- formly across samples; they may be in any random order. This makes feature modeling a challenge as the data (cell intensity values) are not directly comparable across samples because of the unknown order of the cells. To address the above challenges, we developed a set of data preprocessing techniques to facilitate effective cluster- ing of FCS data ﬁles. The key is to summarize each FCS data ﬁle in a way so that they can be compared and clustered