Wireless Sensor Network, 2010, 2, 115-122
doi:10.4236/wsn.2010.22016 y 2010 (http://www.SciRP.org/journal/wsn/).
Copyright © 2010 SciRes. WSN
Published Online Februar
K-Nearest Neighbor Based Missing Data Estimation
Algorithm in Wireless Sensor Networks
Liqiang Pan, Jianzhong Li
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Email: {panlq, lijzh}@hit.edu.cn
Received November 21, 2009; revised November 30, 2009; accepted December 4, 2009
Abstract
In wireless sensor networks, the missing of sensor data is inevitable due to the inherent characteristic of
wireless sensor networks, and it causes many difficulties in various applications. To solve the problem, the
missing data should be estimated as accurately as possible. In this paper, a k-nearest neighbor based missing
data estimation algorithm is proposed based on the temporal and spatial correlation of sensor data. It adopts
the linear regression model to describe the spatial correlation of sensor data among different sensor nodes,
and utilizes the data information of multiple neighbor nodes to estimate the missing data jointly rather than
independently, so that a stable and reliable estimation performance can be achieved. Experimental results on
two real-world datasets show that the proposed algorithm can estimate the missing data accurately.
Keywords: Missing Data, Estimation, Wireless Sensor Networks
1. Introduction
The rapid development of wireless communication tech-
niques, micro-electronics techniques and embedded com-
putation techniques makes Wireless Sensor Networks
(WSNs) being applied in many fields [1–4]. WSNs con-
sist of many sensor nodes deployed in a special region
where users are interested in, and each sensor node has
some computing ability, storage ability and communica-
tion ability. Users issue queries to obtain information
about the monitored region. Faced with the features of
WSNs, many query processing algorithms have been
proposed for various applications. However, all these
query processing techniques are frustrated by a common
problem, that is, the missing of sensor data.
Actually, the missing of sensor data is inevitable due
to the inherent characteristic of WSNs. For example, the
communication ability of sensor nodes is limited. Some
sensor nodes may be isolated from the WSNs for a short
or long time due to the influences of surrounding envi-
ronment such as mountains and obstacles, which results
that the sensor data of these nodes may be lost. In addi-
tion, the natural environment such as rain, thunder and
lightning will influence the sensor nodes’ communica-
tion quality either and make the communication links
between sensor nodes connected and disconnected fre-
quently. This will also result in the sensor data lost dur-
ing the data transmission. Secondly, the power of sensor
nodes is limited. When a sensor node’s power is low, it
usually works under an unstable state. This not only
causes the unstable communication which may make the
sensor data lost, but also makes the sensor data sampled
be often useless abnormal data (e.g. the temperature of a
room is 300℃). The abnormal data is looked as the
missing data since it can never be used. When the power
of a sensor node is exhausted, the sensor node cannot
collect the data any more and the data cached in the
storage which have not been sent back may also be lost.
In addition, the size of sensor node is small and it is easy
to be damaged, which may also result in the lost of sen-
sor data. Due to the reasons given above, no matter how
efficient and robust query processing algorithms are de-
veloped, the missing of the sensing data is inevitable.
The missing of sensor data will cause many difficulties
in various applications. For example, in the data collec-
tion applications, the missing data will not only decrease
the availability of sensing datasets, but also decrease the
efficiency of WSNs greatly. In the research of forest en-
vironment [5], a WSN is deployed in the forest to collect
the environment variables such as temperature, humidity,
atmosphere pressure and sunlight etc. Based on the sensor
data collected, biologists can study the forest microcli-
mate, the dynamic tree respiration and growth models etc.
However, the data collected by sensor nodes is raw data.
Biologists need use some analysis tools on the amounts of
raw data and then can get the analysis results and draw a
conclusion. Unfortunately, the existing analysis tools
which are adopted in these fields, such as support vector