Missing Value Estimation in DNA Microarrays
Using B-Splines
Sujay Saha
Heritage Institute of Technology, Kolkata, India
Email: sujay.saha@heritageit.edu
Kashi Nath Dey
University of Calcutta, Kolkata, India
Email: kndey55@gmail.com
Riddhiman Dasgupta, Anirban Ghose, and Koustav Mullick
Heritage Institute of Technology, Kolkata, India
Email: {riddhiman.dasgupta, anighose25}@gmail.com, koustav.mullick@yahoo.com
Abstract—Gene expression profiles generated by the high-
throughput microarray experiments are usually in the form
of large matrices with high dimensionality. Unfortunately,
microarray experiments can generate data sets with
multiple missing values, which significantly affect the
performance of subsequent statistical analysis and machine
learning algorithms. Numerous imputation algorithms have
been proposed to estimate the missing values. However,
most of these algorithms fail to take into account the fact
that gene expressions are continuous time series, and deal
with gene expression profiles in terms of discrete data. In
this paper, we present a new approach, FDVSplineImpute,
for time series gene expression analysis that permits the
estimation of missing observations using B-splines of similar
genes from fuzzy difference vectors. We have used
smoothing splines to relax the fit of the splines so that they
are less prone to over fitting the data. Our algorithm shows
significant improvement over the current state-of-the-art
methods in use.
Index Terms—missing value estimation, DNA microarray,
fuzzy logic, B-Spline, FDVSplineImpute
I. INTRODUCTION
A gene expression microarray is a collection of
microscopic DNA spots attached to a solid surface, which
is used to study the expression levels of thousands of
genes under various conditions simultaneously. Gene
expression microarray experiments generate datasets of
massive order which are in the form of matrices of gene
expression levels under various experimental conditions.
Each row of a gene expression matrix is basically a gene
of the organism used in the experiment, while each
column refers to a particular experimental condition
under which the corresponding gene was examined. But
biological experiments tend to generate gene expression
matrices that contain missing values. These missing
Manuscript received December 13, 2012; revised February 5, 2013
values occur due to errors in the experimental process
that lead to corruption or absence of expression
measurements. Various statistical methods used for gene
expression analysis requires the complete gene
expression matrix for providing accurate results. Methods
such as hierarchical clustering, K-Means clustering are
not robust to missing values. Hence, it is necessary to
devise proper and accurate methods which impute data
values when they are missing.
Time series data are a sequence of data points sampled
at regular intervals of time. Gene expression time series
data is a special class of microarray data where gene
expression levels are sampled at regular intervals of time.
Data sets measuring temporal behavior of thousands of
genes offer rich opportunities for computational
biologists [1]. A time-series gene expression data set is
very sparse in nature as it contains a handful of data
points. So a very accurate prediction method must be
used for estimation.
II. SPLINES
A spline curve is a sequence of curve segments that are
connected together to form a single continuous curve.
They are basically piecewise polynomials with boundary,
continuity and smoothness constraints. The use of
piecewise low-degree polynomials result in smooth
curves, thereby avoiding the problems of over fitting
which would occur if only one high degree polynomial
had been used for estimation. One can write a cubic
polynomial in terms of a set of four normalized basis
functions. A very popular basis is the B-spline basis. For
the application of fitting curves to gene expression time-
series data, it is quite convenient with the B-spline basis
to obtain approximating or smoothing splines rather than
interpolating splines. Smoothing splines use fewer basis
coefficients than there are observed data points, which is
helpful in avoiding over fitting. In this regard, the
coefficients
can be interpreted geometrically as control
88
Journal of Medical and Bioengineering Vol. 2, No. 2, June 2013
©2013 Engineering and Technology Publishing
doi: 10.12720/jomb.2.2.88-92