doi: 10.1111/j.1469-1809.2011.00670.x Using Penalised Logistic Regression to Fine Map HLA Variants for Rheumatoid Arthritis Charlotte M. Vignal 1,2 , Aruna T. Bansal 3 , and David J. Balding 4 1 Department of Epidemiology and Public Health, Imperial College London, UK 2 GlaxoSmithKline, Brentford, Middlesex, TW8 9GS, UK 3 Acclarogen Limited, St John’s Innovation Center, Cambridge, CB4 0WS, UK 4 Institute of Genetics, University College London, UK Summary Rheumatoid arthritis (RA) is strongly associated with the human leukocyte antigen (HLA) genomic region, most notably with a group of HLA-DRB1 alleles termed the shared epitope (SE). There is also substantial evidence of other risk loci in the HLA region, but refinement has been hampered by extensive linkage disequilibrium (LD). Using genotype imputation, we analysed 6575 RA cases and controls with genotypes at 6180 HLA SNPs; about half the subjects had four-digit DRB1 genotypes. Single-SNP tests revealed hundreds of strong associations across the HLA region, even after adjusting for DRB1. We implemented penalised logistic regression in a multi-SNP association analysis using the double- exponential (DE) penalty term on the regression coefficients and the normal-exponential-gamma (NEG). The penalised approaches identified sparse sets of SNPs that could collectively explain most of the association with RA over the whole HLA region. The HLA-DPB1 SNP rs3117225, was consistently identified in our analyses and was confirmed by results from the North American Rheumatoid Arthritis Consortium study (NARAC). We conclude that SNP selection using penalised regression shows a substantial benefit over single-SNP analyses in identifying risk loci in regions of high LD, and the flexibility of the NEG conveys additional advantages. Keywords: HLA region, penalised logistic regression, LASSO, hyperlasso, rheumatoid arthritis Introduction Rheumatoid arthritis (RA) is a complex autoimmune disor- der that affects approximately 1% of the population world- wide (Silman & Pearson, 2002). The cause of RA has not been established, but both environmental and genetic fac- tors appear to play important roles (Firestein, 2003; Oliver & Silman, 2006). The HLA-DRB1 gene within the class II Human Leucocyte Antigen (HLA) region of the genome is strongly associated with RA susceptibility. A group of HLA- DRB1 alleles known collectively as the shared epitope (SE), encode a similar amino acid sequence located in the peptide- binding groove of the protein (Gregersen et al., 1987). In our previous investigation of the Genetics of Rheumatoid Arthritis (GoRA) study (Vignal et al., 2009) we found that the addition of HLA SNPs in a backwards stepwise regression led to a significantly better model than that based only on the Corresponding author: Aruna T. Bansal, Acclarogen Ltd, St John’s Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK. Tel: (+44) 1223 421 662; Fax: (+44) 1223 420 844; E-mail: aruna.t.bansal@acclarogen.com number of SE+ alleles. However, there has also been consid- erable debate about whether a binary classification SE+/SE- is sufficient to describe the RA risk for all DRB1 alleles. There are many strong HLA associations outside DRB1 (Vignal et al., 2009; Jawaheer et al., 2002; Kilding et al., 2004; Newton et al., 2003; Ota et al., 2001; Singal et al., 1999; Zanelli et al., 2001), but extensive linkage disequilib- rium (LD) in the HLA region has retarded progress in fine- mapping the causal variants underlying these signals. Multi- SNP analyses can help to disentangle the effects of LD and identify a set of distinct contributors to disease risk. Forward and backward stepwise regressions are traditional model selec- tion methods but are unsatisfactory in the presence of many correlated predictors, due to computational demands and in- stability of the selected model. Penalised regression provides a better approach to overcoming the problems of over-fitting and of multiple correlated predictors (Ayers & Cordell, 2010). The penalty term imposed on the likelihood can have the ef- fect of shrinking the maximum likelihood estimates towards zero, reducing over-fitting. Moreover, with an appropriate penalty term penalised regression can reduce the problem of Annals of Human Genetics (2011) 75,655–664 655 C 2011 The Authors Annals of Human Genetics C 2011 Blackwell Publishing Ltd/University College London