Spatial Dependency Modeling Using Spatial Auto-Regression * Mete Celik 1 , Baris M. Kazar 2 , Shashi Shekhar 1 , Daniel Boley 1 , David J. Lilja 3 Abstract Parameter estimation of the spatial auto-regression model (SAR) is important because we can model the spatial dependency, i.e., spatial autocorrelation present in the geo-spatial data. SAR is a popular data mining technique used in many geo-spatial application domains such as regional economics, ecology, environmental management, public safety, public health, transportation, and business. However, it is computationally expensive because of the need to compute the logarithm of the determinant of a large matrix due to Maximum Likelihood Theory (ML). Current approaches are computationally expensive, memory-intensive and not scalable. In this paper, we propose a new ML-based approximate SAR model solution based on the Gauss-Lanczos algorithm and compare the proposed solution with two other ML-based approximate SAR model solutions, namely Taylor's series, and Chebyshev polynomials. We also algebraically ranked these methods. Experiments showed that the proposed algorithm gives better results than the related approaches when the data is strongly correlated and problem size is large. Keywords: Spatial Auto-Regression Model, Spatial Dependency Modeling, Spatial Autocorrelation, Maximum Likelihood Theory, Gauss-Lanczos Method. 1. Introduction Extracting useful and interesting patterns from massive geo-spatial datasets is important for many application domains, including regional economics, ecology, environmental management, public safety, public health, transportation, and business [3, 15, 17]. Many classical data mining algorithms, such as linear regression, assume that the learning samples are independently and identically distributed (i.i.d.). This assumption is violated in the case of spatial data due to spatial autocorrelation [15] and in such cases classical linear regression yields a weak model with not only low prediction accuracy [17] but also residual error exhibiting spatial dependence. Modeling spatial dependencies improves overall classification and prediction accuracies. The Spatial auto-regression (SAR) model is a generalization of linear regression to handle these concerns. However, estimation of the SAR model parameters is computationally very expensive because of the need to compute the logarithm of the determinant (log-det) of a large matrix. For example, it can take an hour of computation for a spatial dataset with 10K observation points on a single IBM Regatta processor using a 1.3GHz pSeries 690 Power4 architecture with 3.2 GB memory. This has limited the use of SAR to small problem sizes, despite its promise to improve classification and prediction accuracy. ML-based SAR model solutions [1, 5]can be classified into exact [6, 8, 11-13] and approximate solutions [7, 10, 16], based on how they compute certain compute-intensive terms (log-det term) in the SAR solution procedure. Exact solutions suffer from high computational complexities and memory requirements due to the computation of all the eigenvalues of a large matrix. Approximate SAR model solutions try to approach the computationally complex term of the SAR model by reducing the computation time and providing computationally feasible and scalable SAR model solutions. This study covers only ML-based approximate SAR model solutions. However, we will also include exact solution in our experiments for comparison purposes. In this paper, we propose a new ML-based approximate SAR solution, and compare and algebraically rank approximate ML-based SAR model solutions. In contrast to the related approximate SAR model solutions, our algorithm provides better approximation when the data is strongly correlated (i.e., spatial dependency is high) and problem size gets high. The key idea of the proposed algorithm is to find only the some of the eigenvalues of a large * This work was partially supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory (ARL) under contract number DAAD19-01-2-0014 and the NSF grant IIS-0208621, and the NSF grant IIS-0534286. This work received additional support from the University of Minnesota Digital Technology Center and the Minnesota Supercomputing Institute. 1. Computer Science Department, University of Minnesota, MN, USA, {mcelik, shekhar, boley}@cs.umn.edu 2. Oracle Corporation, USA, baris.kazar@oracle.com 3. Electrical and Computer Engineering Department, University of Minnesota, MN, USA, lilja@ece.umn.edu 1