Combining inference and integration for multiple ranked data from genomic studies Highlights Track Poster W07 Michael G. Schimek 1 , Alena Myˇ siˇ ckov ´ a 2 and Eva Budinsk´ a 3 1 Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz 2 Max Planck Institute for Molecular Genetics, Berlin 3 Swiss Institute of Bioinformatics, Lausanne E-Mail: mysickov@molgen.mpg.de Motivation Lists of common distinct objects in rank order are typ- ical for various omics applications such as the integra- tion of gene expression measurements across experi- ments and array platforms. The rank of an object indicates its respective position among all other objects. Our aim is to • ﬁnd partial ranked lists of lengths ˆ k j which are char- acterized by high conformity of rankings in their top parts • integrate the partial ranked lists to obtain a consoli- dated subset of objects in a new rank order Notation O = {1, 2,...,N} set of all N objects j =1, 2,...,l different experiments or laboratories τ j j-th ranked list assigning rank positions to the set O R τj (i) ∈{1, 2 ...,N} ranking of object i τ ∗ integrated ranked list of length k ∗ Distance measures Kendall’s τ • deﬁned as the number of adjacent pairwise ex- changes required to convert one ranking to another • for two rankings τ 1 and τ 2 of a set O of objects is: d K (τ 1 ,τ 2 )=  {i,j}∈O K i,j (τ 1 ,τ 2 ) ¯ K i,j (τ 1 ,τ 2 )=        0 analogue orderings of ranks of objects i and j 1 otherwise Modiﬁcation for incomplete ranked lists: in case of a missing information about the ordering of objects i and j ¯ K i,j (τ 1 ,τ 2 ) set to p, a penalty parameter p ∈ [0, 1]. Spearman’s footrule • deﬁned as the sum of the absolute differences be- tween the ranks of the two lists over all elements • for two permutations τ 1 and τ 2 of a set O of objects is: d S (τ 1 ,τ 2 )=  i∈O |R τ1 (i) − R τ2 (i)| , where R τ1 (i) is the rank of object i in list τ 1 . Modiﬁcation for incomplete ranked lists: for two ranked lists τ 1 and τ 2 of two sets of objects O 1 and O 2 , simply redeﬁne the rank function: R τ1 (i)= R(i) I(i ∈O 1 )+(N + 1) I(i ∈O 1 ) R τ2 (i)= R(i) I(i ∈O 2 )+(N + 1) I(i ∈O 2 ) Step 1: Estimation of k In applications with large N, consensus hardly prevails for the whole list. Here we assume reasonable confor- mity in the rankings for the ﬁrst k elements. Selection of  k ∗ • moderate deviation-based inference for random de- generation in pairs of ranked lists, see [3] • assume Bernoulli random variables I 1 ,...,I N deﬁned such that: I j =  1 |R τ1 (i) − R τ2 (i)|≤ δ 0 otherwise • estimate the point of degeneration into noise j 0 ≥ 2 with following assumptions: P (I j = 1) =        p j ≥ 1 2 for 1 ≤ j ≤ j 0 − 2 p j > 1 2 for j = j 0 − 1 p j = 1 2 for j ≥ j 0 • general decrease of the probability p j for increasing j is assumed •  k ∗ = max l  k l Step 2: Integration of ranked lists Cross-entropy Monte Carlo (CEMC) approach for con- solidation of top-k objects, see [4] • assume a random matrix X with elements from {0, 1} and a corresponding probability matrix p • given the probability mass function P p (x), any real- ization x of X uniquely determines the corresponding top-k list • stochastic search for an ordering x ∗ that corresponds to an optimal τ ∗ satisfying the optimization criterion: τ ∗ = arg min τ    l  j=1 w j d(τ j ,τ )    with prespeciﬁed assessors’ weights w j and distance measure d(τ j ,τ ) • iterative CEMC algorithm with two steps 1. simulation step in which random samples from P p (x) are drawn 2. updating step to improve samples increasingly con- centrating around x ∗ References [1] M. G. Schimek et al. TopKLists. R - package, 2011. [2] V. Popovici et al. Selecting control genes for rt-qpcr using public microarray data. BMC Bioinf., 2008. [3]P. Hall and M. G. Schimek. Moderate deviation-based infer- ence for random degeneratioon in paired rank lists. revised for J.Amer.Stat.Assoc., 2010. [4] S. Lin and J. Ding. Integration of ranked lists via CEMC with applications to mRNA and microRNA studies. Biometrics, 65(1):9–18, 2009. [5] M. G. Schimek, A. Myˇ siˇ ckov ´ a, and E. Budinsk´ a. An inference and integration approach for ranked lists with applications in omics research. J. Stat. Plan. Inf., accepted for publication in, 2010. Simulation study Gene expression data were simulated with ﬁrst k = 10 differentially expressed genes, N = 100. Fig.1: Frequency of ﬁrst 15 objects of the simulated data for Kendall’s τ (top) and Spearman’s footrule (bottom) Fig.2: Boxplots of estimated  k for simulated data (N = 100) for δ = 28 delta=28 pilot sample size nu estimated k 10 14 18 22 26 30 34 38 0 20 40 60 80 100 Gene expression data Differential expression proﬁles of 3 different cancer types (breast, prostate and colon) from [2] with the goal to ﬁnd control genes for RT PCR were analyzed. The length of the ranked lists is N = 10, 000 genes. Tab1.: Estimated ˆ k ∗ ’s for selected combinations of δ and ν for the cancer data. δ 0 10 50 100 150 200 500 1000 ν 10 34 65 154 259 472 680 1365 2318 20 68 76 154 631 631 901 2219 2340 30 79 79 641 641 641 1364 2224 2789 40 79 79 646 646 797 1366 2230 2795 50 117 126 782 797 797 1366 3739 4310 100 269 269 907 907 1183 1367 6222 6229 150 310 1408 1408 1408 1408 1727 7309 8815 200 1220 1450 1450 1450 1450 2394 7336 9301 500 2387 2408 3446 3465 3471 4695 7988 9304 Fig.3: Plot displays the decrease of discordance of two lists as a function of increasing δ, for list L 1 (breast) and L 3 (colon). The ﬁrst directional changes of the plotted curve occur for the δ values 10 and 200 (subplot). As the study goal is the identiﬁcation of control genes, naturally a small set, the δ choice should not exceed 200. . F1000 Pos ight protected. F1000 Posters. Copyrig osters. Copyright protected. F1000 Posters. Copyright protected. F cted. F1000 Posters. Copyright protected. F1000 Posters. Copyright protected. F1000 Posters opyright protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright protected. F1000 P ght protected. F1000 Posters. Copyright protected. F1000 Posters. Copyright prote F1000 Posters. Copyright protected. F1000 Posters. C rs. Copyright protected. F100 p