Dimensionality reduction in drought modelling João Filipe Santos, 1 * Maria Manuela Portela 2 and Inmaculada Pulido-Calvo 3 1 Departamento Engenharia, ESTIG, Instituto Politécnico de Beja, Rua Afonso III, 7800-050 Beja, Portugal 2 Departamento Engenharia Civil, SHRH, Instituto Superior Técnico (Lisboa), Portugal, Avda. RoviscoPais, 1049-001 Lisboa, Portugal 3 Departamento Ciencias Agroforestales, Escuela Técnica Superior de Ingeniería, Campus La Rábida, Universidad de Huelva, 21819 Palos de la Frontera, Huelva, Spain Abstract: For monitoring hydrological events characterized by high spatial and temporal variability, the number and location of recording stations must be carefully selected to ensure that the necessary information is collected. Depending on the characteristics of each natural process, certain stations may be spurious or redundant, whereas others may provide most of the relevant data. With the objective of reducing the costs of the monitoring system and, at the same time, improving its operational effectiveness, three procedures were applied to identify the minimum network of rain gauge stations able to capture the characteristics of droughts in mainland Portugal. Drought severity is characterized by the standardized precipitation index applied to the timescales of 1, 3, 6 and 12 consecutive months. The three techniques used to reduce the dimensionality of the network of rain gauges were as follows: (i) artiﬁcial neural networks with sensitivity analysis, (ii) application of the mutual information criterion and (iii) K-means cluster analysis using Euclidean distances. The results demonstrated that the best dimensionality reduction method was case dependent in the three regions of Portugal (northern, central and southern) previously identiﬁed by cluster analysis. All the reduction techniques lead to the selection of a subset of rain gauges capable of reproducing the original temporal patterns of drought. For speciﬁc severe drought events in Portugal in the past, the comparison between drought spatial patterns obtained with the original stations and the selected subset indicated that the subset produced statistically satisfactory results (correlation coefﬁcients higher than 0.6 and efﬁciency coefﬁcients higher than 0.5). Copyright © 2012 John Wiley & Sons, Ltd. KEY WORDS rain gauge network; drought monitoring; standardized precipitation index; mutual information; sensitivity analysis; artiﬁcial neural network; Portugal Received 27 October 2011; Accepted 2 March 2012 INTRODUCTION Drought is a recurrent natural phenomenon that can lead to a severe reduction in the availability of fresh water for a certain period and can affect wide areas. Although complex, attempts are made to characterize this phenomenon, and the process generally involves the calculation of drought indices, derived from meteoro- logical and hydrological records. These provide informa- tion about historical droughts and therefore can also be used to monitor current conditions. According to Tsakiris (2008), such indices are useful for planning and management purposes as they provide standardized measures of the deviation of the water availability from normal conditions. Among these drought indices, the most widely used is the standardized precipitation index (SPI), which is based on precipitation data, as implied by the name (McKee et al. 1993; Santos et al. 2010). Various authors, such as Bonaccorso et al. (2003), have emphasized that the monitoring of meteorological and hydrological events characterized by high spatial and temporal variability, such as droughts, requires careful selection of the optimal number of gauge stations able to describe the phenomenon within the area under study. One of the most widely used criteria for network design is based on geostatistical techniques (Bastin et al. 1984; Bogardi and Bardossy 1985; Pardo-Igúzquiza 1998, and more recently Chen et al. 2008) the aim of being able to improve the operational performance of a monitoring network with fewer gauges by selecting just the most important stations. In the case of drought monitoring, to achieve effective modelling of the phenomenon, most common methods still depend on a relatively large network of meteorological stations with long time series. For forecasting purposes, the size and the quality of the input network of a model are crucial, as reported by many authors including Murphy (1991), Zheng and Billings (1996), Maier and Dandy (2000), Back and Trappenberg (2001) and, more recently, Guest and Smith-Genut (2010). If relevant inputs (independent variables) are omitted, the model cannot fully capture the input–output pattern (i.e. the model is underspeciﬁed). On the other hand, if the model includes redundant or unnecessary inputs (i.e. the model is overspeciﬁed), one or more of the following may occur: (i) the size, computational complexity and memory requirements of the model increase; (ii) the calibration of the model becomes more difﬁcult due to an increase in the size of the search space and the greater number of local optima; (iii) the interpretation of the physical meaning of results from calibrated models becomes more difﬁcult; and (iv) more data are needed to efﬁciently estimate the optimal values of the model parameters. *Correspondence to: João Filipe Santos, Departamento Engenharia, ESTIG, Instituto Politécnico de Beja, Rua Afonso III, 7800-050 Beja, Portugal. E-mail: joaof.santos@estig.ipbeja.pt HYDROLOGICAL PROCESSES Hydrol. Process. 27, 1399–1410 (2013) Published online 17 April 2012 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/hyp.9300 Copyright © 2012 John Wiley & Sons, Ltd.