0016-7622/2018-92-3-305/$ 1.00 © GEOL. SOC. INDIA | DOI: 10.1007/s12594-018-1012-9 JOURNAL GEOLOGICAL SOCIETY OF INDIA Vol.92, September 2018, pp.305-312 Rainfall-Runoff Modeling using Clustering and Regression Analysis for the River Brahmaputra Basin Satanand Mishra a* , C. Saravanan b , V. K. Dwivedi c and J. P. Shukla d a,d Water Resource Management Group, CSIR-Advanced Material Process & Research Institute, Bhopal - 462 064, India b Computer Centre; c Department of Civil Engineering, National Institute of Technology, Durgapur – 713 209, India *E-mail: snmishra07@gmail.com ABSTRACT In this research, k-means, agglomerative hierarchical clustering and regression analysis have been applied in hydrological real time series in the form of patterns and models, which gives the fruitful results of data analysis, pattern discovery and forecasting of hydrological runoff of the catchment. The present study compares with the actual field data, predicted value and validation of statistical yields obtained from cluster analysis, regression analysis with ARIMA model. The seasonal autoregressive integrated moving average (SARIMA) and autoregressive integrated moving average (ARIMA) models is investigated for monthly runoff forecasting. The different parameters have been analyzed for the validation of results with casual effects. The comparison of model results obtained by K-means & AHC have very close similarities. Result of models is compared with casual effects in the same scenario and it is found that the developed model is more suitable for the runoff forecasting. The average value of R 2 determined is 0.92 for eight ARIMA models. This shows more accuracy of developed ARIMA model under these processes. The developed rainfall runoff models are highly useful for water resources planning and development. INTRODUCTION Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Data mining, also popularly referred to as knowledge discovery from database (KDD), is defined as discovery of comprehensible, important and previously unknown rules or anything that is useful and non-trivial or unexpected from our collected data (Piatetsky-Shapiro and Frawley, 1991).Data mining is a multi-disciplinary field, drawing work from areas including database technology, machine learning, statistics, pattern recognition, information retrieval, neural networks, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization (Han and Kamber, 2001). A large number of data mining techniques and tools are available for extracting trends, characteristics or rules from data. Selections of those relevant to hydrology are covered in this study. Clustering, classification, association rule extraction and dominant mode analysis techniques could be used in a hydrological modeling. Temporal data mining is concerned with data mining of large sequential data sets (Laxman and Sastry, 2006). Regression analysis is a statistical tool for the investigation of relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. From the hydrological point of view, the researcher seeks to ascertain the causal effect of one variable upon another—the effect of a stage increase upon rainfall. To explore such issues, the hydrologist assembles data on the underlying variables of interest and employs regression to estimate the quantitative effect of the causal variables upon the variable that they influence. The researcher also typically assesses the “statistical significance” of the estimated relationships like trend or models, that is, the degree of confidence that the true relationship is close to the estimated relationship (Sykes, 1992). Regression analysis is used to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. Analyzing stream flow records can give significant ideas for both past and future characteristics of stream flows. Therefore, recording and analyzing stream flows measurement have important roles in planning, designing and management of water resources (Cohen, 1995). Simple linear regression is the least square estimator of a linear regression model with a single explanatory variable. Simple linear regression fits a straight line through the set of n points variables in such a way that makes the sum of squared residuals as small as possible. The slope of the fitted line is equal to the correlation between y and x corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that it passes through the centre of mass (x, y) of the data points. Multiple regression analysis is a technique used for predicting the unknown value of a variable from the known value of two or more variables- also called the predictors. More precisely, multiple regression analysis helps us to predict the value of Y for given values of X 1 , X 2 , …,X k (Salas et al., 2006). Regression analysis including simple regression and multiple regressions is one of the most significant and frequently used methods in stream flow forecasting. Statistical model is used for rainfall-runoff modeling. Statistical models generally require a data set of past observations sufficiently large to allow the system to be adequately parameterized (Morales et al., 2006). Such statistical models include autoregressive linear model, multiple linear regression model and moving average method among others. Multiple linear regressions establish quantitative relationship between group of predictor variables and observed response. Consider the linear regression model with single independent variable in equation 1. The linear model has the form y = Xα + ε (1) Where X refers to regressor variable; α refers to the vector of parameter or coefficient ε is the random disturbances. y is the dependent observation. To resolve x using least square estimate, equation (2) can be used α = (X T X) -1 X T y (2) If y is a function of more than one independent variable, the matrix equations that express the relationships among the variables can be