FM-test: A Fuzzy Set Theory Based Approach for Discovering Diabetes Genes Yi Lu, Shiyong Lu Wayne State University luyi, shiyong @wayne.edu Lily R. Liang, Deepak Kumar University of the District of Columbia lliang, dkumar @udc.edu Abstract Diabetes is a disorder of metabolism that has affected 18.2 million people in the United States. In recent years, researchers have identified many genes that play important roles in the onset, development and progression of diabetes. Identification of these diabetes genes offers better under- standing of the molecular mechanisms underlying patho- genesis, which is essential for developing preventative and therapeutic methods. In this paper, we propose an innova- tive approach, fuzzy membership test (FM-test), based on fuzzy set theory to identify diabetes associated genes from microarray gene expression profiles. A new concept of FM d-value is defined to quantify the divergence of two sets of values. Experiments were conducted to study the distribu- tion of d-values and the relationship between the d-value and the significance level of p-value. We applied FM-test to a gene expression dataset obtained from insulin-sensitive and insulin-resistant people and identified ten significant genes. Six of the ten have been confirmed to be associ- ated with diabetes in the literature and one has been sug- gested by other researchers. The remaining three genes, , and , are suggested as potential diabetes genes for further biological investigation. 1 Introduction Diabetes is a group of diseases characterized by high lev- els of blood glucose resulting from defects in insulin pro- duction, insulin action, or both. There are 18.2 million peo- ple in the United States, or 6.3% of the population, who have diabetes. Diabetes is also one of the leading causes of death in U.S. In 2000, it contributed to 213,062 deaths. The risk for death among people with diabetes is about 2 times of that among people without diabetes [1]. The direct and indirect cost of diabetes in the United States for 2002 totaled $132 billion, among which, $92 billion are direct This work was supported by the Agricultural Experiment Station at the University of the District of Columbia (Project No.: DC-0LIANG; Ac- cession No.: 0203877) medical costs and $40 billion are indirect costs of disabil- ity, work loss, premature mortality etc[1]. Microarray techniques have revolutionalized genomic research by making it possible to monitor the expression of thousands of genes in parallel. As the amount of microarray data being produced in an exponential rate, there is a great demand for efficient and effective expression data analysis tools. The gene expression profile of a cell determines its phenotype and responses to the environment. These re- sponses include its responses towards environmental fac- tors, drugs and therapies. Gene expression patterns can be determined by measuring the quantity of the end product, protein, or the mRNA template used to synthesize the pro- tein. Comparison of gene expression profiling in diabetes patients versus the normal counterpart people will enhance our understanding of the disease and identify leads for ther- apeutic intervention. Several important breakthroughs and progress in the gene expression profiling of diabetes have been made [10, 14, 13]. Patterns of gene expression have been proposed and associated with diabetes [15, 16]. More interestingly, researchers have identified many genes that play important roles in the onset, development, and progres- sion of diabetes. Identification of these diabetes genes of- fers a route to better understanding of the molecular mecha- nisms underlying pathogenesis, a necessary prerequisite for the rational development of improved preventative and ther- apeutic methods. One effective approach of identifying genes that are as- sociated with diabetes is to measure the divergence of two sets of values of gene expression, one from a group of peo- ple that are insulin resistant (IR), the other from a group that are insulin sensitive (IS) [17]. A motivating example is shown in Table 1, which records the microarray gene ex- pression values of five genes for two groups of people: five insulin-sensitive humans and five insulin-resistant humans. In order to identify the genes that are associated with dia- betes, one needs to determine for each gene whether or not the two sets of expression values are significantly different from each other. One popular method is t-test [11], which uses the difference of the means of the two sets to measure the divergence. In Table 1, the first four genes are iden-