Customer Segmentation Using Unsupervised Learning on Daily Energy Load Profiles J. du Toit Eskom, Brackenfell, South Africa Email: jacowp357@gmail.com R. Davimes, A. Mohamed, K. Patel, and J.M. Nye Eskom, Brackenfell, South Africa Email: {ricdavimes, ashfaaqmohamed22, kavirpatelkp, nyejon}@gmail.com Abstract—Power utilities collect a large amount of metering data from substations and customers. This data can provide insights for planning outages, making network investment decisions, predicting future load growth and predictive maintenance. One of the requirements is the ability to group similar behaving loads together. This paper provides a comparison between different similarity measures, used in the k-means clustering algorithm, to group daily load profiles together based on metering data. The various methods are compared using two well-known cluster evaluation metrics and the results are then analysed by subject matter experts to determine the validity of the findings. The results, from our particular data set, indicate that various speed improvement techniques can be considered that complement the k-means algorithm without sacrificing intra- to inter-cluster accuracy. A small increase in the optimal number of clusters, using domain expertise, allowed for additional profiles to be extracted that were not explained by algorithmic evaluations. Interplay between both theoretical evaluations and domain knowledge facilitated a preferred number of clusters for practical purposes. Index Terms—daily load profiles, customer segmentation, k- means clustering, similarity measures, non-uniform binary split I. INTRODUCTION Utilities around the world are preparing to use the vast amount of metering and general customer data that will be collected following the rollout of smart metering devices. It is vital that utilities extract value from this data and make use of it to better understand their customers and make improved investment decisions. Conventional descriptive statistics has its limitations when trying to extract information, make predictions, automate customer monitoring and network anomaly detection. This paper investigates a machine learning approach, using unsupervised learning techniques that allows for more in-depth analytical capability, accurate predictions and processing of high volumes of data. A well-known unsupervised learning technique, the k- means clustering algorithm, is explored. Four distance metrics for the k-means algorithm are studied and the differences between each algorithm will be analysed using two different evaluation metrics. The objective is to group similar load profiles, using the different metrics, and then compare the differences between the groups. Principle component analysis (PCA) and a suboptimal cluster boundary technique, the non-uniform binary split (NUBS) algorithm, are both used to improve convergence time. The cluster centroids are interpreted and further explained by visualisation and linkage to geographic locations for practical analysis of the various load types. The paper is structured as follows: Section II covers the research methodology and data preparation, Section III showcases the results and Section IV concludes the paper. A. Literature review Advantages of segmenting daily energy profiles can span from: achieving more sustainable and efficient urban or rural development, allowing a more flexible electricity market, smoothing of energy peaks and imbalances, and enhancing the exploitation of renewable energy sources [1] [2] [3] [4] [5] [6] [7]. The shapes of daily load profiles can be grouped into three main types: residential, industrial and commercial [8]. Methods that have been applied and modified to segment temporal data include: variations of Hierarchical clustering, variations of k- means clustering, Self-organising maps, and Fuzzy clustering [9] [10]. Examples of work that make use of techniques related to this study can be found in [11] [12] [13] [14] [15] [16] [17] [18]. These studies focused on comparing and optimising similarity measures used in some of the techniques alluded to above. Most of these studies were committed to overcoming issues related to clustering temporal data, such as: sensitivity to small changes, sum of distance not capturing the shape of a curve and computational complexity using other similarity measures [19]. Optimisation of k-means with respect to convergence time presents a gap in literature for the application considered in this study. Ref. [20] introduces a low complexity pre-clustering technique known as the Non- Uniform Binary Split (NUBS) algorithm. This can be utilised to initialise the k-means centroids. In addition to exploring the different similarity measures, this paper combines the same approaches with techniques that can be used to increase convergence time of the k-means clustering algorithm.