3837 | International Journal of Current Engineering and Technology, Vol.4, No.6 (Dec 2014)
Research Article
International Journal of Current Engineering and Technology
E-ISSN 2277 – 4106, P-ISSN 2347 - 5161
©2014 INPRESSCO
®
, All Rights Reserved
Available at http://inpressco.com/category/ijcet
Logistic Regression in Data Mining and its Application in Identification of
Disease
Dhaval Sanghavi
Ȧ*
, Hitarth Patel
Ȧ
and Sindhu Nair
Ȧ
Ȧ
Computer Science, Dwarkadas J.Sanghvi College of Engineering, Vile Parle(W),Mumbai-400056,India
Accepted 05 Nov 2014, Available online 01 Dec 2014, Vol.4, No.6 (Dec 2014)
Abstract
Data mining in clinical medicine deals with learning models to predict health of patients. The models is used to support
clinicians in therapeutic or monitoring tasks. Data mining techniques are usually applied in clinical contexts to analyze
retrospective data, thus giving professionals to check large amounts of data routinely collected during their day activity.
Moreover, clinicians can take advantage of data mining techniques to deal with the amount of research results obtained
by molecular medicine, which may allow transition from population-based to personalized medicine.Logistic regression
is used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent
variables.
Keywords: logistic regression, feature extraction, Data Mining
1. Introduction
1
Predictions may range from the simple stratification of the
patients' population on the basis of known risk factors,
such as age or lifestyle, to the forecast of the effect that a
treatment or drug may have on a single patient. Generally
speaking, in a clinical context, predictions may support
diagnostic, therapeutic, or monitoring tasks. Diagnosis is
related to the classification of patients into disease classes
or subclasses on the basis of patients' data.
Logistic regression is used to analyze relationships
between a dichotomous dependent variable and metric or
dichotomous independent variables. Logistic regression
combines the independent variables to estimate the
probability that a event will occur . The variate or value
produced by logistic regression is a probability value
between 0.0 and 1.0. If the probability for group
membership in the modeled category is above some cut
point (the default is half), the subject is predicted to be a
member of the modeled group. If the probability is less
than the cut point, the subject is included to be a member
of the other group. For any given case, logistic regression
computes the probability that a case with a particular set
of values for the independent variable is a member of the
modeled category.
Yi=e^u/(1+e^u)
where Yi is define as the estimated probability that the ith
case is in a category and u is the regular linear regression
equation:
Feature selection based data mining methods is one of
the most important research directions in the fields of
*Corresponding author: Dhaval Sanghavi
machine learning. Especially in recent years, along with
the appearance of many high dimension / small sample
problems, such as, natural language processing, biological
information, economic and financial, network and
telecom, and medical data analysis, the study of feature
selection once again become the focus of attention.
Identifying biomarkers with high sensitivity and
specificity for high mortality and morbidity diseases such
as UA etc. plays a key role in diagnosis and prognosis for
them. Moreover, recent years have seen increasing
research interest in biomarker identification is turned from
one specific biomarker to biomarker pattern with
interactions. We had found that feature selection based
data mining methods better fit to investigate syndrome of
biological basis . The definition of “Characteristic
pattern” was proposed to reduce the gap between “golden
index” and biological basis of syndrome .
2. Material and Methods
2.1 Statistical methods to detect metabolites and proteins
with significant change Independent sample t test and
analysis of variance
ANOVA was used to detect biomarkers with significant
change of concentrations between the disease and control
samples. P value was calculated to measure significance
of mean and variance of each biomarker between the two
groups. It is noted that a sample with missing value was
not included to calculating p value since arithmetic mean
may obliterate the significance of a biomarker.
2.2 Feature selection based data mining methods
We use three kinds of feature selection methods: Filter,
Wrapper and Embedded, to carry out a comparison study