Research Article
Received 2 October 2012, Accepted 26 April 2013 Published online 29 May 2013 in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1002/sim.5855
Graphical tools for model selection in
generalized linear models
K. Murray,
a,b
*
†
S. Heritier
c
and S. Müller
a
Model selection techniques have existed for many years; however, to date, simple, clear and effective methods
of visualising the model building process are sparse. This article describes graphical methods that assist in the
selection of models and comparison of many different selection criteria. Specifically, we describe for logistic
regression, how to visualize measures of description loss and of model complexity to facilitate the model selec-
tion dilemma. We advocate the use of the bootstrap to assess the stability of selected models and to enhance our
graphical tools. We demonstrate which variables are important using variable inclusion plots and show that these
can be invaluable plots for the model building process. We show with two case studies how these proposed tools
are useful to learn more about important variables in the data and how these tools can assist the understanding
of the model building process. Copyright © 2013 John Wiley & Sons, Ltd.
Keywords: model selection curves; Akaike information criterion; graphical methods; Bayesian information
criterion; variable selection; model selection; generalized linear models
1. Introduction
Many medical problems involve the collection of data with multiple potential predictor variables. In
analysing the data, one usually engages in a process of model building, of which a crucial part is to
determine one or more appropriate models. For a general introduction and overview into the topic of
model building, we refer to [1]. One of the most commonly used techniques for model selection, which
is probably the least advocated by statisticians, is a ‘hypothesis test/P-value’ stepwise approach, using
either forward selection or backward selection or a combination of the two. These approaches have
been shown to be inefficient in many situations, and have particular issues such as multiple testing and
localization of solutions. For many models, including the vast array of generalized linear models, the
information theoretic approach and the use of the log-likelihood to compare models is widespread in
general for model selection purposes. For this reason, our article focuses on measuring the descriptive
ability of a model via the log-likelihood, but our ideas extend directly to using other loss functions.
To date, in medical research, data analysts have used many different techniques to select models for
prediction purposes. In most instances, only one final model is presented, and how such a model is
reached is typically not sufficiently described in the statistical methods section of research articles. Rea-
sons for this include space restrictions and a shortage of simple graphical tools that can be shown to
explain the reasoning behind the final model. Consequently, future researchers have difficulty obtaining
a clear understanding of available model selection techniques in current medical research, what these
techniques are really doing, and how to replicate and adapt such techniques.
When using an information theoretic approach to model selection, a somewhat controversial and much
debated question is whether to select a model using the AIC [2] or the BIC [3]. The purpose of the anal-
ysis drives the model selection. Often, a separation is made between the purposes to describe the data
well and to obtain a model that has good predictive qualities. A major difference between AIC and BIC
a
School of Mathematics and Statistics, University of Sydney, Carslaw Building (F07), NSW 2006, Australia
b
Centre for Applied Statistics (M019), University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia
c
The George Institute for Global Health, University of Sydney, Sydney, NSW 2050, Australia
*Correspondence to: K. Murray, School of Mathematics and Statistics, University of Sydney, Carslaw Building (F07), NSW
2006, Australia.
†
E-mail: kevin.murray@uwa.edu.au
4438
Copyright © 2013 John Wiley & Sons, Ltd. Statist. Med. 2013, 32 4438–4451