Missing Value Estimation in DNA Microarrays Using B-Splines Sujay Saha Heritage Institute of Technology, Kolkata, India Email: sujay.saha@heritageit.edu Kashi Nath Dey University of Calcutta, Kolkata, India Email: kndey55@gmail.com Riddhiman Dasgupta, Anirban Ghose, and Koustav Mullick Heritage Institute of Technology, Kolkata, India Email: {riddhiman.dasgupta, anighose25}@gmail.com, koustav.mullick@yahoo.com AbstractGene expression profiles generated by the high- throughput microarray experiments are usually in the form of large matrices with high dimensionality. Unfortunately, microarray experiments can generate data sets with multiple missing values, which significantly affect the performance of subsequent statistical analysis and machine learning algorithms. Numerous imputation algorithms have been proposed to estimate the missing values. However, most of these algorithms fail to take into account the fact that gene expressions are continuous time series, and deal with gene expression profiles in terms of discrete data. In this paper, we present a new approach, FDVSplineImpute, for time series gene expression analysis that permits the estimation of missing observations using B-splines of similar genes from fuzzy difference vectors. We have used smoothing splines to relax the fit of the splines so that they are less prone to over fitting the data. Our algorithm shows significant improvement over the current state-of-the-art methods in use. Index Termsmissing value estimation, DNA microarray, fuzzy logic, B-Spline, FDVSplineImpute I. INTRODUCTION A gene expression microarray is a collection of microscopic DNA spots attached to a solid surface, which is used to study the expression levels of thousands of genes under various conditions simultaneously. Gene expression microarray experiments generate datasets of massive order which are in the form of matrices of gene expression levels under various experimental conditions. Each row of a gene expression matrix is basically a gene of the organism used in the experiment, while each column refers to a particular experimental condition under which the corresponding gene was examined. But biological experiments tend to generate gene expression matrices that contain missing values. These missing Manuscript received December 13, 2012; revised February 5, 2013 values occur due to errors in the experimental process that lead to corruption or absence of expression measurements. Various statistical methods used for gene expression analysis requires the complete gene expression matrix for providing accurate results. Methods such as hierarchical clustering, K-Means clustering are not robust to missing values. Hence, it is necessary to devise proper and accurate methods which impute data values when they are missing. Time series data are a sequence of data points sampled at regular intervals of time. Gene expression time series data is a special class of microarray data where gene expression levels are sampled at regular intervals of time. Data sets measuring temporal behavior of thousands of genes offer rich opportunities for computational biologists [1]. A time-series gene expression data set is very sparse in nature as it contains a handful of data points. So a very accurate prediction method must be used for estimation. II. SPLINES A spline curve is a sequence of curve segments that are connected together to form a single continuous curve. They are basically piecewise polynomials with boundary, continuity and smoothness constraints. The use of piecewise low-degree polynomials result in smooth curves, thereby avoiding the problems of over fitting which would occur if only one high degree polynomial had been used for estimation. One can write a cubic polynomial in terms of a set of four normalized basis functions. A very popular basis is the B-spline basis. For the application of fitting curves to gene expression time- series data, it is quite convenient with the B-spline basis to obtain approximating or smoothing splines rather than interpolating splines. Smoothing splines use fewer basis coefficients than there are observed data points, which is helpful in avoiding over fitting. In this regard, the coefficients can be interpreted geometrically as control 88 Journal of Medical and Bioengineering Vol. 2, No. 2, June 2013 ©2013 Engineering and Technology Publishing doi: 10.12720/jomb.2.2.88-92