Article fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search Francesca Torti 1, * , Aldo Corbellini 2 and Anthony C. Atkinson 3   Citation: Torti, F.; Corbellini, A.; Atkinson, C.A. fsdaSAS: A Package for Robust Regression for Very Large Datasets Including the Batch Forward Search. Stats 2021, 4, 327–347. https://doi.org/10.3390/stats4020022 Academic Editors: Paulo Canas Rodrigues and Wei Zhu Received: 12 March 2021 Accepted: 14 April 2021 Published: 18 April 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 European Commission, Joint Research Centre (JRC), 21027 Ispra, Italy 2 Department of Economics and Management, University of Parma, 43125 Parma, Italy; aldo.corbellini@unipr.it 3 Department of Statistics, The London School of Economics, London WC2A 2AE, UK; A.C.Atkinson@lse.ac.uk * Correspondence: francesca.torti@ec.europa.eu Abstract: The forward search (FS) is a general method of robust data ﬁtting that moves smoothly from very robust to maximum likelihood estimation. The regression procedures are included in the MATLAB toolbox FSDA. The work on a SAS version of the FS originates from the need for the analysis of large datasets expressed by law enforcement services operating in the European Union that use our SAS software for detecting data anomalies that may point to fraudulent customs returns. Speciﬁc to our SAS implementation, the fsdaSAS package, we describe the approximation used to provide fast analyses of large datasets using an FS which progresses through the inclusion of batches of observations, rather than progressing one observation at a time. We do, however, test for outliers one observation at a time. We demonstrate that our SAS implementation becomes appreciably faster than the MATLAB version as the sample size increases and is also able to analyse larger datasets. The series of ﬁts provided by the FS leads to the adaptive data-dependent choice of maximally efﬁcient robust estimates. This also allows the monitoring of residuals and parameter estimates for ﬁts of differing robustness levels. We mention that our fsdaSAS also applies the idea of monitoring to several robust estimators for regression for a range of values of breakdown point or nominal efﬁciency, leading to adaptive values for these parameters. We have also provided a variety of plots linked through brushing. Further programmed analyses include the robust transformations of the response in regression. Our package also provides the SAS community with methods of monitoring robust estimators for multivariate data, including multivariate data transformations. Keywords: approximate analysis; big data; linked plots; monitoring; robust regression 1. Introduction Data frequently contain outlying observations, which need to be recognised and perhaps modelled. In regression, recognition can be made difﬁcult when the presence of several outliers leads to “masking” in which the outliers are not evident from a least squares ﬁt. Robust methods are therefore necessary. This paper is concerned with the robust regression modelling of large datasets—our major example contains 44,140 univariate observations and ﬁve explanatory variables. We use the forward search (FS), which provides a general method of robust data ﬁtting that moves smoothly from very robust to maximum likelihood estimation. Many robust procedures using the FS are included in the MATLAB toolbox FSDA [1,2]. The core of the method is a series of ﬁts to the data for subsets of m observations, with m, incremented in steps of one, going from very small to being equal to n, the total number of observations. As we show in Section 6, the procedure becomes appreciably slower as n increases. The performance of the MATLAB version is further slowed by the language’s handling of large ﬁles. In this paper, we present two enhancements of FS regression for large datasets: 1. The Batch Forward Search. Instead of incrementing the subset used in ﬁtting by one observation we move from a subset of size m to one of size m + k. In our example, Stats 2021, 4, 327–347. https://doi.org/10.3390/stats4020022 https://www.mdpi.com/journal/stats