IEEE WIRELESS COMMUNICATIONS LETTERS, ACCEPTED FOR PUBLICATION 1 Validation of a Probabilistic Approach to Outdoor Localization Kejiong Li, John Bigham, and Laurissa Tokarchuk Abstract—The validation of a probabilistic fingerprinting ap- proach for outdoor location estimation using received signal strength (RSS) from GSM base stations (BSs) is described. The proposed approach is compared with a traditional probabilistic algorithm for three different area partitioning methods. Two contrasting real environments are used for the comparisons: one is a city environment and the other one is a rural setting. For each test-bed, over 9000 data points are collected over 170,000 and 110,000 square meters respectively. For each environment, principal components analysis (PCA) is globally used to remove the least useful transmitters to avoid unnecessary calculations. Then each environment is partitioned into different clusters based on RSS. PCA is again used within each cluster. The proposed scheme retains accuracy by not losing the substantial RSS correlations in each cluster, but also accommodates the different RSS distributions in each cluster. The experimental results show that the positioning accuracy is significantly improved and our clustering scheme gives good support for location estimation. Index Terms—Clustering, fingerprinting, outdoor localization, principle component analysis, probability. I. I NTRODUCTION R ECEIVED signal strength (RSS)-based fingerprinting localization is the most widely used technique for po- sitioning. This is because: a) the data required to create the RSS database is readily collected from indoors; b) it performs relatively well for non-line-of-sight circumstances; c) it does not require extra battery resources of GPS and computational resources of triangulation methods. RSS-based fingerprinting localization typically involves two phases: the training phase where RSS are measured at known locations to form a location fingerprints database (a.k.a. radio map) based on different partitioning models; and the online phase where the geographical coordinates of an observed RSS tuple is estimated using the radio map. The radio map may be partitioned in different ways. Grid partitioning [1] divides the environment into a regular grid and then attempts to map the location of a mobile station (MS) to a point in a grid element. Reducing the size of a grid element improves the accuracy of the position estimations but increases the site-survey costs. Some location-aware applications [1] (mainly indoor ones) do not use a regular grid but use a topographical model, in which the environment, is divided into cells corresponding to different office rooms or hallway segments. In order to automatically link the partitioning with topography, several cluster-based location estimation methods have been proposed recently. Again, most studies obtain pos- itive results only in indoor environments, e.g. IEEE 802.11b Manuscript received September 13, 2012. The associate editor coordinating the review of this letter and approving it for publication was A. Bletsas. The authors are with the School of Electronic Engineering and Computer Science, Queen Mary, University of London, UK (e-mail: john.bigham@eecs.qmul.ac.uk.) Digital Object Identifier 10.1109/WCL.2012.122612.120666 wireless LAN networks [2]. All the clustering tools partition the environment into regions that are more homogeneously covered by radio signals. In our work, we have developed a clustering approach that can represent topographical features and its improved accuracy has been verified on outdoor data sets [3]. (Similar improvements for indoors locations have been found, but not discussed here.) Location estimation models are then fitted to each cluster. Fingerprinting can be implemented either deterministically or probabilistically. This letter uses a probabilistic approach. A deterministic approach is described in [3]. Due to the complex propagation environment, a nonparametric approach, such as the construction of the RSS histogram from the training data or a kernel density estimator (KDE), is used to estimate the RSS probability density as it avoids assumptions regarding the form of the density. The KDE can smooth a discrete histogram to a continuous function, accommodating incomplete RSS data, and it requires less data to model the joint distribution rather than a raw histogram approximation. Although probabilistic techniques are reported to provide higher positioning accuracy than deterministic techniques [4], their higher computational complexities make them difficult when the vector of observa- tions is of high dimensions. In this letter, we investigate the use of PCA [5] to choose subsets of relevant transmitters and support the construction of joint RSS probability distributions in the areas of interest in a manner that overcomes the drawbacks of traditional methods. For example, MaxMean [2] chooses the access points (APs) with the strongest average RSS to estimate user location. However, it unavoidably abandons the information of unselected APs. It also requires that at least one AP can communicate with every point on the grid, which makes it suitable only for small areas. In our work, we cluster the RSS tuples based on the deviations of the raw RSS from the estimated path loss model [3] [6]. Then our approach returns to use the raw RSS in each cluster and rotates the RSS space into independent principal components (PCs). So we can estimate the joint probability densities by simply multiplying values from univariate probability density functions (PDFs), similar to [7]. Distinct from [7], which is based on a very small data set in a small indoor area, a) we focus on larger scale outdoor areas with much larger data sets; and b) we not only use PCA to select an optimal number of transmitters but also apply it within each cluster rather than the whole area in order to make the radio map heterogeneous. The number of PCs and the transformations can be markedly different in each cluster. In this letter, novel features that contribute to the better accuracy are: (1) the heterogeneity of the environment is modelled with dimensionality reduction and rotations are different in different clusters. Important correlations are retained in the 2162-2337/12$31.00 c 2012 IEEE This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.