Automatic Quality Assessment of Affymetrix GeneChip Data Steffen Heber Department of Computer Science 1519 Partners II NC State University, Raleigh, NC 27695-7566 sheber@ncsu.edu Beate Sick DNA Array Facility, Center for Integrative Genomics University of Lausanne Lausanne, Switzerland Beate.Sick@unil.ch ABSTRACT Computing reliable gene expression levels from microarray experiments is a sophisticated process with many potential pitfalls. Quality control is one of the most important steps in this process. We present a web based expert system for automatic quality assessment of Affymetrix GeneChip data. Our approach combines multiple quality metrics with su- pervised machine learning in order to identify data of low quality. Our system approximates expert opinion as rep- resented in a knowledge base consisting of 41 microarray experiments with 352 CEL files annotated by a domain ex- pert. GeneChips of low quality are detected automatically and can be excluded from subsequent analysis. This is espe- cially important for large experiments or can assist the in- experienced users. Our expert system is fully implemented and integrated into a publicly available remote analysis com- putation engine for gene expression data. Categories and Subject Descriptors J.3 [Life and Medical Sciences]: biology and genetics; G.3 [Mathematics of Computing]: Probability and statis- tics—statistical computing, statistical software General Terms Algorithms, measurement, reliability Keywords Affymetrix GeneChip microarray experiments, automated quality assessment, knowledge based expert system 1. INTRODUCTION Affymetrix GeneChip microarrays are among the most widely used research tools in modern genetics. Applications Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SE’06 March 10-12, 2006, Melbourne, Florida, USA Copyright 2006 ACM 1-59593-315-8/06/0004 ...$5.00. include the determination of genome-wide expression levels [4], genomic re-sequencing [13], genotyping [14], and tran- script discovery [18]. Generating gene expression measures by using a microarray platform like Affymetrix GeneChips is a sophisticated and time consuming process with many potential sources of variation. Before applying higher-level analysis, an initial examination can give evidence for the presence of quality problems. Utilizing robust statistical methods often allows users to include chips with small ar- tifacts in the analysis. However, in some cases arrays are beyond correction, and removing the defective arrays from the data set is warranted. Due to the huge amount of data, and the various preprocessing steps required to pro- duce the expression measure for each array, quality assess- ment is a challenging task. Often the advice of an expert is required. Several excellent statistical software tools like R [10] in conjunction with BioConductor [8] or the Sta- tistical Analysis System (SAS) [17] allow users to inves- tigate different quality parameters. Part of this quality analysis can be performed by web-tools like SmudgeMiner (http://www.discover.nci.nih.gov/affytools), a tool which as- sesses the extent of regional biases and some other microar- ray artifacts. In the context of the bioinformatic service within the DNA array facility of the University of Lausanne (DAFL) we have performed quality checks on several hun- dreds of Affymetrix arrays and were able to gain extensive experience with many different quality measures. We de- veloped a standardized procedure to assess the quality of individual chips within an experiment. This procedure in- volves the collection, visualization, and interpretation of a defined set of quality measures (see below for a detailed de- scription). Unfortunately, often these quality measures are not easy to interpret, and at this point many experimental- ists request advice which should be based on expert knowl- edge. Despite the clearly visible desideratum - to the best of our knowledge - no tool exists that provides an exper- imentalist with an expert system-based quality judgement of the array data. Our goal is to present a computational model of microarray quality that approximates expert opin- ion, and a corresponding web-tool performing automated quality judgement for Affymetrix GeneChip data within the context of multi-chip experiments. Our tool is publicly avail- able at http://race.unil.ch/ (tab ”QualityJudgment”). It is fully integrated into RACE [16], an existing remote analysis computation engine for gene expression data. Users can sub- 411