International Journal of Computer Applications (0975 – 8887) Volume 72– No.4, June 2013 9 Evaluation of Best First Decision Tree on Categorical Soil Survey Data for Land Capability Classification Nirmal Kumar National Bureau of Soil Survey and Land Use Planning Amravati Road Nagpur, Maharashtra - 440033 G. P. Obi Reddy National Bureau of Soil Survey and Land Use Planning Amravati Road Nagpur, Maharashtra - 440033 S Chatterji National Bureau of Soil Survey and Land Use Planning Amravati Road Nagpur, Maharashtra – 440033 ABSTRACT Land capability classification (LCC) of a soil map unit is sought for sustainable use, management and conservation practices. High speed, high precision and simple generating of rules by machine learning algorithms can be utilized to construct pre-defined rules for LCC of soil map units in developing decision support systems for land use planning of an area. Decision tree (DT) is one of the most popular classification algorithms currently in machine learning and data mining. Generation of Best First Tree (BF Tree) from qualitative soil survey data for LCC reported in reconnaissance soil survey data of Wardha district, Maharashtra has been demonstrated in the present study with soil depth, slope, and erosion as attributes for LCC. A 10-fold cross validation provided accuracy of 100%. The results indicated that BF Tree algorithms had good potential in automation of LCC of soil survey data, which in turn, will help to develop decision support system to suggest suitable land use system and soil and water conservation practices. General Terms Data mining algorithms, Decision Tree Keywords Best First Decision Tree, Land Capability Classification, Information gain 1. INTRODUCTION LCC - A qualitative system - developed by the US Department of Agriculture [1] is the most used land classification system. LCC provides information of the kind of soil, its location on the landscape, its extent, and its suitability for various uses, which is needed for conservation planning [2]. LCC includes eight classes of which, first four are suitable for cropland and the limitations on their use and necessity of conservation measures and careful management increase from I through IV. The remaining four classes, V through VIII, are unsuitable for cropland, but may be used for pasture, range, woodland, grazing, wildlife, recreation, and esthetic purposes. Within the broad classes are subclasses, which signify special limitations such as (e) erosion, (w) excess wetness, (s) problems in the rooting zone, and (c) climatic limitations. The task of LCC occurs every time a soil surveyor identifies a map unit. A large and diversified dataset have already been generated through soil surveys. A pre- defined rule set learned on these data for automatically defining the LCC of the future soil units being surveyed, will be of great help for developing decision support systems for land use planning and suggesting conservation and management practices. Machine learning and data mining techniques which gives computers the ability to learn based on the inherent characteristics of data, without being explicitly programmed [3,4] may be utilized for generating these rule sets. DT is one of the most popular classification algorithms currently in machine learning and data mining [5-7]. In their simplest form, DT classifiers successively partition the input training data into more and more homogeneous sub sets by producing optimal rules or decisions, also called nodes [7-9]. The rules or the splitting criteria at these nodes are the key to successful decision tree creation. The most frequently used splitting criteria are the information gain, the information gain ratio [10], and the Gini index [11]. Some of the most popular DT methods are ID3, C4.5 [10, 12, 13], CART [11] and BF Tree [14]. A detailed review of DT applications in agricultural and biological engineering may be found in [5] and [7]. In the field of applying DT algorithms for soil survey data, [15] evaluated ID3 DT for LCC with 12 simulated samples with soil depth, slope, and texture as attributes for LCC; however, model was not validated. ID3 DT algorithm was applied on soil survey data with an accuracy of 86.84% on 10-fold cross validation [16]. 2. MATERIAL AND METHODS 2.1 Training Data Used By considering slope, soil depth and erosion as important attributes, LCC of 38 soil series of Wardha district, Maharashtra, India, was assessed as per the procedure laid down by Soil Survey Manual by All India Soil and Land Use Survey Organization [17]. Waikato Environment for Knowledge Analysis (WEKA) – an open source data mining tool – was used for generation of BF tree rules for comparison with the one manually generated. 2.2 Best First Tree Algorithm In BF tree learners the “best” node is expanded first as compared to standard DT learners such as C4.5 and CART which expand nodes in depth-first order [14]. The “best” node is the node whose split leads to maximum reduction of impurity (e.g. Gini index or information gain) among all nodes available for splitting. The resulting tree will be the same when fully grown; just the order in which it is built is different. BF tree constructs binary trees, i.e., each internal node has exactly two outgoing edges. The tree growing method attempts to maximize within-node homogeneity. The extent to which a node does not represent a homogenous subset of cases is an indication of impurity. For example, a terminal node in which all cases have the same value for the dependent variable is a homogenous node that requires no further splitting because it is “pure.” The impurity measures for nominal dependent variables are entropy-based definition of information gain and gini index. The measure used in this