Improved and automated prediction of effective siRNA Alistair M. Chalk, * Claes Wahlestedt, and Erik L.L. Sonnhammer Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius v€ ag 35, S-171 77 Stockholm, Sweden Received 26 April 2004 Available online Abstract Short interfering RNAs are used in functional genomics studies to knockdown a single gene in a reversible manner. The results of siRNA experiments are highly dependent on the choice of siRNA sequence. In order to evaluate siRNA design rules, we collected a database of 398 siRNAs of known efficacy from 92 genes. We used this database to evaluate previously proposed rules from smaller datasets, and to find a new set of rules that are optimal for the entire database. We also trained a regression tree with full cross- validation. It was however difficult to obtain the same precision as methods previously tested on small datasets from one or two genes. We show that those methods are overfitting as they work poorly on independent validation datasets from multiple genes. Our new design rules can predict siRNAs with efficacy P 50% in 91% of cases, and with efficacy P 90% in 52% of cases, which is more than a twofold improvement over random selection. Software for designing siRNAs is available online via a web server at http:// sisearch.cgb.ki.se/ or as a standalone version for high-throughput applications. Ó 2004 Elsevier Inc. All rights reserved. RNAi is a recently discovered biological phenome- non whereby a single gene may be inhibited at the RNA stage of synthesis. short interfering RNAs (siRNAs) are duplexes of two RNA molecules, typically 21-mers with a 2nt 3 0 overhang [1]. One strand is loaded into the RISC complex [2] after which a sequence specific cleavage of the target takes place. The strength of the principle be- hind any form of RNA targeting (siRNA and antisense) is that the molecule can be used to inhibit expression of any mRNA, and thus the protein it encodes. This effect can be demonstrated without affecting related proteins, making it an invaluable tool for functional genomics. siRNAs have been found to be effective in Arabidopsis thaliana, Drosophila melanogaster, Caenorhabditis ele- gans, and mammals [3]. Many reviews of the biological processes behind siRNA inhibition exist, see for example [3,4]. Efficient reliable design of high efficacy siRNA mol- ecules is essential to meet the needs for cost-effective high-throughput functional genomics projects. To meet these needs the siRNAs designed should at least con- form to the following criteria: (a) be predicted with high accuracy, (b) be sequence specific, and (c) be produced in a form that facilitates high-throughput production. In this paper we address these criteria, focusing primarily on a and b. Despite the apparent ease of designing siRNAs (compared to a popular DNA-based knockdown tech- nique: antisense oligonucleotides (AOs)), a number of problems still remain. Randomly selected siRNAs pro- duce knockdown P 50% with 58–78% success rate, while very effective siRNAs ( P 90/95%) are found by chance 11–18% of the time [5,6]. Initial design paradigms for siRNAs were based on motif rules, such as AAN(19)TT, with 3 0 TT overhangs on both strands. These evolved into motifs to match siRNAs with strong binding at one end (the sense 5 0 end). Recent studies have demonstrated that binding energy is a key factor in siRNA design. The more so- phisticated methods compare binding energies along the siRNA molecule. In brief, findings indicate that relative and absolute binding energies of the 5 0 antisense and 5 0 sense strands determine which strand enters the RISC complex [7]. The energy values for positions 9–15 have also been shown to be important in siRNA design [6]. Other studies have indicated that low GC content is correlated with high efficacy [8,9]. * Corresponding author. E-mail address: alistair.chalk@cgb.ki.se (A.M. Chalk). 0006-291X/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2004.04.181 Biochemical and Biophysical Research Communications 319 (2004) 264–274 BBRC www.elsevier.com/locate/ybbrc