U Statistics for Microar- rays: Normalization, Sig- nal Value Estimation, Gene Expression Profiles MAYTE SUÁREZ F ARIÑAS, ASIFA HAIDER KNUT M. WITTKOWSKI Summary When assessing genomic profiles, it is rare that a single gene is sufficient to represent all aspects of genetic ac- tivity. Since complex systems tend to be neither linear, nor hierarchical in nature, but correlated and of un- known relative importance, the assumptions of tradi- tional multivariate statistical methods can often not be justified on theoretical grounds. Establishing validity through empirical validation is not only problematic, but also time consuming. This paper proposes the use of u- statistics for scoring multivariate ordinal data and a fam- ily of simple non-parametric tests for analysis. The scor- ing method is demonstrated to be applicable to scoring profiles genomic that best correlated with complex re- sponses to an intervention (treatment of psoriasis). Fi- nally, we will demonstrate that the same methodology also leads to less biased estimates for (low-level) signal value estimation in microarray data. We apply this ap- proach to correlating activity of anti-inflammatory drugs along genomic pathways with disease severity of psoria- sis based on both clinical and histological parameters. Key words: multivariate, rank test, ordinal, genomic, profile, risk, microarray 1. INTRODUCTION When analyzing complex phenomena by means of sta- tistical methods, a single measure does often not appro- priately reflect all relevant aspects to be considered, so that several measures of influences and/or outcomes need to be considered. Sometimes the definite measure is not easily obtained, so that a set surrogate measures has to be evaluated. At other times, e.g., when the aim is to ameliorate a complex phenomenon, a definitive measure may not even exist. Such problems may arise in many applications, although here we focus on low and high level gene expression analysis. In our main example, we will focus on the effect of treatment on chronic diseases, in general, and psoriasis, in particular. Psoriasis is a skin disease caused by activa- tion of multiple cell types including keratinocytes, vas- cular cells, and various types of leukocytes. Treatment efficacy can be measured by histological criteria, by in- tradermal expression of inflammatory cytokines, or by clinical characteristics, such as redness (vascular re- sponse) and scaling (keratinocyte response). Since the advent of micro arrays, researchers are now interested in genes whose expression is controlled in a concerted fashion and related to the response. 1 Most multivariate methods are based on the linear model, either explicitly, as in regression, factor, dis- criminant, and cluster analysis, or implicitly, as in neural networks. One scores each variable individually on a comparable scale, either present/absent, low/inter- mediate/high, 1 to 10, or z-transformation, and then de- fines a global score as a weighted average of these scores. In other words, data are interpreted as points in a Euclidian space of (independent) dimensions. The num- ber of dimensions is reduced by assuming the dimen- sions to be related by a specific function of known type (linear, exponential, etc.), allowing one to determine for each point the Euclidian distance from a hyperspace. While mathematically elegant and computationally effi- cient, this approach has shortcomings when applied to real world data. Since the relative importance of the variables, the correlation among them, and the func- tional relationship of each variable with the immeasur- able latent factor ‘efficacy’, ‘safety’, ‘risk’, or ‘overall usefulness’ are typically unknown, construct validity 2 cannot be established on theoretical grounds. Instead, one needs to resort to empirical ‘validation’, choosing weights and functions to provide a reasonable fit with a ‘gold standard’ when applied to a sample. While this al- lows for a comparison between studies where the re- searchers agreed to use the same scoring system, the di- versity of scoring systems used attests to the subjective nature of this process. Even when the assumptions of the linear model regard- ing the contribution to and the relationship with the un- derlying immeasurable factor are questionable, as in ge- nomics, it is often reasonable to assume that the expres- sion of each gene has at least an ‘orientation’, i.e., that, if all other conditions are held constant, an increase in this gene’s expression is either ‘good’ or ‘bad’. The di- rection of this orientation can be known (hypothesis test- ing) or unknown (selection procedures). A higher ex- pression of several related genes may indicate increased disease activity. When we were faced with the analysis of anal vs. vagi- nal contacts as risk factors for sexual transmission of HIV, 3 we presented a partial ordering for dealing with graded and ungraded variables, which allowed to incor- porate preexisting knowledge that anal contacts carry more risk without having to ignore the number of vagi- nal contacts reported. Using the marginal likelihood principle with this partial ordering, we developed a non-