Wireless Sensor Network, 2010, 2, 115-122 doi:10.4236/wsn.2010.22016 y 2010 (http://www.SciRP.org/journal/wsn/). Copyright © 2010 SciRes. WSN Published Online Februar K-Nearest Neighbor Based Missing Data Estimation Algorithm in Wireless Sensor Networks Liqiang Pan, Jianzhong Li School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China Email: {panlq, lijzh}@hit.edu.cn Received November 21, 2009; revised November 30, 2009; accepted December 4, 2009 Abstract In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the missing data should be estimated as accurately as possible. In this paper, a k-nearest neighbor based missing data estimation algorithm is proposed based on the temporal and spatial correlation of sensor data. It adopts the linear regression model to describe the spatial correlation of sensor data among different sensor nodes, and utilizes the data information of multiple neighbor nodes to estimate the missing data jointly rather than independently, so that a stable and reliable estimation performance can be achieved. Experimental results on two real-world datasets show that the proposed algorithm can estimate the missing data accurately. Keywords: Missing Data, Estimation, Wireless Sensor Networks 1. Introduction The rapid development of wireless communication tech- niques, micro-electronics techniques and embedded com- putation techniques makes Wireless Sensor Networks (WSNs) being applied in many fields [1–4]. WSNs con- sist of many sensor nodes deployed in a special region where users are interested in, and each sensor node has some computing ability, storage ability and communica- tion ability. Users issue queries to obtain information about the monitored region. Faced with the features of WSNs, many query processing algorithms have been proposed for various applications. However, all these query processing techniques are frustrated by a common problem, that is, the missing of sensor data. Actually, the missing of sensor data is inevitable due to the inherent characteristic of WSNs. For example, the communication ability of sensor nodes is limited. Some sensor nodes may be isolated from the WSNs for a short or long time due to the influences of surrounding envi- ronment such as mountains and obstacles, which results that the sensor data of these nodes may be lost. In addi- tion, the natural environment such as rain, thunder and lightning will influence the sensor nodes’ communica- tion quality either and make the communication links between sensor nodes connected and disconnected fre- quently. This will also result in the sensor data lost dur- ing the data transmission. Secondly, the power of sensor nodes is limited. When a sensor node’s power is low, it usually works under an unstable state. This not only causes the unstable communication which may make the sensor data lost, but also makes the sensor data sampled be often useless abnormal data (e.g. the temperature of a room is 300). The abnormal data is looked as the missing data since it can never be used. When the power of a sensor node is exhausted, the sensor node cannot collect the data any more and the data cached in the storage which have not been sent back may also be lost. In addition, the size of sensor node is small and it is easy to be damaged, which may also result in the lost of sen- sor data. Due to the reasons given above, no matter how efficient and robust query processing algorithms are de- veloped, the missing of the sensing data is inevitable. The missing of sensor data will cause many difficulties in various applications. For example, in the data collec- tion applications, the missing data will not only decrease the availability of sensing datasets, but also decrease the efficiency of WSNs greatly. In the research of forest en- vironment [5], a WSN is deployed in the forest to collect the environment variables such as temperature, humidity, atmosphere pressure and sunlight etc. Based on the sensor data collected, biologists can study the forest microcli- mate, the dynamic tree respiration and growth models etc. However, the data collected by sensor nodes is raw data. Biologists need use some analysis tools on the amounts of raw data and then can get the analysis results and draw a conclusion. Unfortunately, the existing analysis tools which are adopted in these fields, such as support vector