3837 | International Journal of Current Engineering and Technology, Vol.4, No.6 (Dec 2014) Research Article International Journal of Current Engineering and Technology E-ISSN 2277 – 4106, P-ISSN 2347 - 5161 ©2014 INPRESSCO ® , All Rights Reserved Available at http://inpressco.com/category/ijcet Logistic Regression in Data Mining and its Application in Identification of Disease Dhaval Sanghavi Ȧ* , Hitarth Patel Ȧ and Sindhu Nair Ȧ Ȧ Computer Science, Dwarkadas J.Sanghvi College of Engineering, Vile Parle(W),Mumbai-400056,India Accepted 05 Nov 2014, Available online 01 Dec 2014, Vol.4, No.6 (Dec 2014) Abstract Data mining in clinical medicine deals with learning models to predict health of patients. The models is used to support clinicians in therapeutic or monitoring tasks. Data mining techniques are usually applied in clinical contexts to analyze retrospective data, thus giving professionals to check large amounts of data routinely collected during their day activity. Moreover, clinicians can take advantage of data mining techniques to deal with the amount of research results obtained by molecular medicine, which may allow transition from population-based to personalized medicine.Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent variables. Keywords: logistic regression, feature extraction, Data Mining 1. Introduction 1 Predictions may range from the simple stratification of the patients' population on the basis of known risk factors, such as age or lifestyle, to the forecast of the effect that a treatment or drug may have on a single patient. Generally speaking, in a clinical context, predictions may support diagnostic, therapeutic, or monitoring tasks. Diagnosis is related to the classification of patients into disease classes or subclasses on the basis of patients' data. Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent variables. Logistic regression combines the independent variables to estimate the probability that a event will occur . The variate or value produced by logistic regression is a probability value between 0.0 and 1.0. If the probability for group membership in the modeled category is above some cut point (the default is half), the subject is predicted to be a member of the modeled group. If the probability is less than the cut point, the subject is included to be a member of the other group. For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category. Yi=e^u/(1+e^u) where Yi is define as the estimated probability that the ith case is in a category and u is the regular linear regression equation: Feature selection based data mining methods is one of the most important research directions in the fields of *Corresponding author: Dhaval Sanghavi machine learning. Especially in recent years, along with the appearance of many high dimension / small sample problems, such as, natural language processing, biological information, economic and financial, network and telecom, and medical data analysis, the study of feature selection once again become the focus of attention. Identifying biomarkers with high sensitivity and specificity for high mortality and morbidity diseases such as UA etc. plays a key role in diagnosis and prognosis for them. Moreover, recent years have seen increasing research interest in biomarker identification is turned from one specific biomarker to biomarker pattern with interactions. We had found that feature selection based data mining methods better fit to investigate syndrome of biological basis . The definition of “Characteristic pattern” was proposed to reduce the gap between “golden index” and biological basis of syndrome . 2. Material and Methods 2.1 Statistical methods to detect metabolites and proteins with significant change Independent sample t test and analysis of variance ANOVA was used to detect biomarkers with significant change of concentrations between the disease and control samples. P value was calculated to measure significance of mean and variance of each biomarker between the two groups. It is noted that a sample with missing value was not included to calculating p value since arithmetic mean may obliterate the significance of a biomarker. 2.2 Feature selection based data mining methods We use three kinds of feature selection methods: Filter, Wrapper and Embedded, to carry out a comparison study