0016-7622/2018-92-3-305/$ 1.00 © GEOL. SOC. INDIA | DOI: 10.1007/s12594-018-1012-9
JOURNAL GEOLOGICAL SOCIETY OF INDIA
Vol.92, September 2018, pp.305-312
Rainfall-Runoff Modeling using Clustering and Regression
Analysis for the River Brahmaputra Basin
Satanand Mishra
a*
, C. Saravanan
b
, V. K. Dwivedi
c
and J. P. Shukla
d
a,d
Water Resource Management Group, CSIR-Advanced Material Process & Research Institute, Bhopal - 462 064, India
b
Computer Centre;
c
Department of Civil Engineering, National Institute of Technology, Durgapur – 713 209, India
*E-mail: snmishra07@gmail.com
ABSTRACT
In this research, k-means, agglomerative hierarchical
clustering and regression analysis have been applied in hydrological
real time series in the form of patterns and models, which gives
the fruitful results of data analysis, pattern discovery and
forecasting of hydrological runoff of the catchment. The
present study compares with the actual field data, predicted
value and validation of statistical yields obtained from
cluster analysis, regression analysis with ARIMA model. The
seasonal autoregressive integrated moving average (SARIMA)
and autoregressive integrated moving average (ARIMA) models
is investigated for monthly runoff forecasting. The different
parameters have been analyzed for the validation of results
with casual effects. The comparison of model results obtained
by K-means & AHC have very close similarities. Result of
models is compared with casual effects in the same scenario
and it is found that the developed model is more suitable for the
runoff forecasting. The average value of R
2
determined is 0.92 for
eight ARIMA models. This shows more accuracy of developed
ARIMA model under these processes. The developed rainfall runoff
models are highly useful for water resources planning and
development.
INTRODUCTION
Data mining is a process that uses a variety of data analysis
tools to discover patterns and relationships in data that may be used to
make valid predictions. Data mining, also popularly referred to as
knowledge discovery from database (KDD), is defined as discovery
of comprehensible, important and previously unknown rules or
anything that is useful and non-trivial or unexpected from our
collected data (Piatetsky-Shapiro and Frawley, 1991).Data mining is
a multi-disciplinary field, drawing work from areas including
database technology, machine learning, statistics, pattern recognition,
information retrieval, neural networks, knowledge-based systems,
artificial intelligence, high-performance computing, and data
visualization (Han and Kamber, 2001). A large number of data
mining techniques and tools are available for extracting trends,
characteristics or rules from data. Selections of those relevant to
hydrology are covered in this study. Clustering, classification,
association rule extraction and dominant mode analysis techniques
could be used in a hydrological modeling. Temporal data mining is
concerned with data mining of large sequential data sets (Laxman and
Sastry, 2006).
Regression analysis is a statistical tool for the investigation of
relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent
variables. From the hydrological point of view, the researcher seeks
to ascertain the causal effect of one variable upon another—the effect
of a stage increase upon rainfall. To explore such issues, the hydrologist
assembles data on the underlying variables of interest and employs
regression to estimate the quantitative effect of the causal variables
upon the variable that they influence. The researcher also typically
assesses the “statistical significance” of the estimated relationships
like trend or models, that is, the degree of confidence that the true
relationship is close to the estimated relationship (Sykes, 1992).
Regression analysis is used to predict a continuous dependent
variable from a number of independent variables. If the dependent
variable is dichotomous, then logistic regression should be used.
Analyzing stream flow records can give significant ideas for both
past and future characteristics of stream flows. Therefore, recording
and analyzing stream flows measurement have important roles in
planning, designing and management of water resources (Cohen,
1995).
Simple linear regression is the least square estimator of a linear
regression model with a single explanatory variable. Simple linear
regression fits a straight line through the set of n points variables in
such a way that makes the sum of squared residuals as small
as possible. The slope of the fitted line is equal to the correlation
between y and x corrected by the ratio of standard deviations of these
variables. The intercept of the fitted line is such that it passes through
the centre of mass (x, y) of the data points. Multiple regression analysis
is a technique used for predicting the unknown value of a variable
from the known value of two or more variables- also called the
predictors. More precisely, multiple regression analysis helps us to
predict the value of Y for given values of X
1
, X
2
, …,X
k
(Salas et al.,
2006). Regression analysis including simple regression and multiple
regressions is one of the most significant and frequently used methods
in stream flow forecasting. Statistical model is used for rainfall-runoff
modeling. Statistical models generally require a data set of past
observations sufficiently large to allow the system to be adequately
parameterized (Morales et al., 2006). Such statistical models include
autoregressive linear model, multiple linear regression model and
moving average method among others. Multiple linear regressions
establish quantitative relationship between group of predictor variables
and observed response. Consider the linear regression model with
single independent variable in equation 1.
The linear model has the form
y = Xα + ε (1)
Where X refers to regressor variable; α refers to the vector of
parameter or coefficient ε is the random disturbances. y is the dependent
observation. To resolve x using least square estimate, equation (2) can
be used
α = (X
T
X)
-1
X
T
y (2)
If y is a function of more than one independent variable, the matrix
equations that express the relationships among the variables can be