2018 Proceedings of the Conference on Information Systems Applied Research ISSN: 2167-1508
Norfolk, Virginia v11 n 4813
©2018 ISCAP (Information Systems & Computing Academic Professionals) Page 1
http://iscap.info
Effects of Normalization Techniques
on Logistic Regression in Data Science
Adekunle Adeyemo
Hayden Wimmer
hayden.himmer@gmail.com
Georgia Southern University
Statesboro, GA, 30458
Loreen Powell
lpowell@bloomu.edu
Bloomsburg University
Bloomsburg, PA 17815
Abstract
The improvements in the data science profession have allowed the introduction of several
mathematical ideas to social patterns of data. This research seeks to investigate how different
normalization techniques can affect the performance of logistic regression. The original dataset was
modeled using the SQL Server Analysis Services (SSAS) Logistic Regression model. This became the
baseline model for the research. The normalization methods used to transform the original dataset
were described. Next, different logistic models were built based on the three normalization techniques
discussed. This work found that, in terms of accuracy, decimal scaling marginally outperformed min-
max and z-score scaling. But when Lift was used to evaluate the performances of the models built,
decimal scaling and z-score slightly performed better than min-max method. Future work is
recommended to test the regression model on other datasets specifically those whose dependent
variable are a 2-category problem or those with varying magnitude independent attributes.
Keywords: Normalization, Logistic Regression, Z-Score, Min-Max, Decimal Scaling
1. INTRODUCTION
Advancements in the field of data science have
allowed the application of several mathematical
concepts to behavioral patterns of data.
Precisely, different normalization techniques
have been applied to numerous datasets to
solve problems from all walks of life. Data
normalization is a preprocessing method used in
different data mining systems, particularly, for
classifying algorithms such as neural networks,
clustering and neighbor classification (Evans,
2016). A lot of works have been published in
data normalization and its application to
different fields of human endeavors; Statistical
Normalization and back Propagation for
Classification, Min-Max Normalization based on
Data Perturbation method for Privacy Protection,
Importance of Data Normalization for the
application of Neural Networks to Complex
Industrial Problems and the Impact of
Normalization Methods on RNA-Seq Data
Analysis. In this research, we investigated how
different normalization techniques affect the
Performance of a Logistic Regression Classifier.
Logistic regression is an ideal tool for answering