Improving electricity non technical losses detection including neighborhood information Pablo Massaferro, Henry Marichal, Matias Di Martino, Fernando Santomauro, Juan Pablo Kosut and Alicia Fernandez. Abstract— Non technical losses (NTL) cause significant damage to power supply companies’ economies. Detecting abnormal clients behavior is an important and difficult task. In this paper we analyze the impact of considering customers geo-localization information, in automatic NTL detection. A methodology to find optimal grid sizes to compute a set of local features with a random search procedure is proposed. The number and size of the grids, and other classification algorithm parameters are adjusted to maximize the area under receiver operating characteristic curve (AUC), showing performance improvements in a data set of 6 thousand of Uruguayan residential customers. Comparative analysis with different sub-sets of characteristics, that include the monthly consumption, contractual information and the new local features are presented. In addition, we probe that raw customers’ geographical location used as an input feature, gives competitive results as well. In addition we evaluate a entire new database of 6 thousand Uruguayan customers, whom were inspected in-site by UTE experts between 2015 and 2017. I. I NTRODUCTION Since the nineteenth century the access to electricity has strongly influenced the way we live. In particular, in the past forty years electrical consumption has increased dramatically and access to electricity is a major concern in modern societies. In this context, losses in power grids are a very important problem that every year generates substantial economic losses. These losses can be classified into two categories: technical (TL) and non-technical (NTL). Technical losses are associated with dissipation or failures of the electrical components of the power grid, while NTL are associated with electricity theft, faulty meters or billing errors. NTLs cause a significant harm to economies, for example, in India NTLs are estimated at $4.5 billion, and in countries such as Brazil, Malaysia or Lebanon NTLs can represent up to the 40% of the total electricity distributed [1], [2]. In the UK and USA non-technical losses are estimated between $1-6 billion [1], [3], [4]. The present work is developed in Uruguay as part of an existing collaboration between the ”Universidad de la Republica” (UdelaR) university and UTE (the national company in charge of the power distribution for the whole country). In Montevideo, capital of Uruguay, TL and NTL represented the 7% and 13% respectively of the total energy distributed during the year 2016. Related Work Different machine learning approaches have addressed the detection of non-technical losses, both supervised or unsuper- vised. Leon et al. review the main research works found in the area between 1990 and 2008 [5], and Glauner et al. made a recent survey including the latest work in the field [1]. Several of these approaches consider unsupervised classification using different techniques such as fuzzy clustering [6], neural net- works [7], [8], among others. Monedero et al. used regression based on correlation between time and monthly consumption, looking for significant drops in consumption [9]. Supervised approaches on the other hand, build and learn mathematical models that describe the problem based on labeled datasets provided by power distribution companies. For example, many works explored the use of Support Vector Machines (SVM) algorithm [2], [3], [10], [11] or combinations of SVM method with other methods such as Genetic Algorithm [12]. From the point of view of the features used to represent each customer profile, a different path has been followed. Some distribution companies have access to the real-time energy consumption measurements, or smart meters that can monitor the energy consumed with a temporal resolution of minutes or hours [13]. However the most common scenario is to have access to monthly [3], [12], [14]–[16] or bimonthly [17] energy consumptions. These consumption profiles are obtained from different customers, and are used as input features of machine learning systems. It is also common to enhance this feature vector with additional features extracted from the profiles (such as Fourier coefficients, local averages, between many others) [14]. Also, information of customers consumption is sometimes complemented with additional information such as: meter type, history of theft, or credit worthiness rating [3], [18] between other data that could be associated to the customer profile. More recently, Glanuer et al. [19] included the use of neighborhood local features by splitting the area in which the customers are located into grids of different sizes. For each grid cell they compute the proportion of inspected customers and the proportion of NTL found among the inspected customers. Other recent research [20] uses a Generalized Additive Model to generate a local estimation of NTL, and Markov chains to estimate the future changes of it. To that end, this work makes use of complementary socio-economical variables obtained from the latest national (Brazilian) census. In the present work we improved our previous works [14], [18] presenting a more effective and robust automatic method for NTL detection. Inspired by the recent work of Glauner et al. [19] we define a new set of features and extend some of the ideas there presented. The main contributions of the present work are: (i) We performed thousands of in site inspections to