402 AJCS 4(6):402-407 (2010) ISSN:1835-2707 Determining the most important features contributing to wheat grain yield using supervised feature selection model Ehsan Bijanzadeh, Yahya Emam, Esmaeil Ebrahimie Crop production and Plant Breeding Department, Shiraz University, Shiraz, Iran *Corresponding author: ebrahimie@shirazu.ac.ir Abstract A supervised feature selection algorithm was applied to determine the most important features contributing to wheat grain yield. Four hundreds seventy two fields (as records) from different parts of Iran which were different in 21 characteristics (features) were selected for feature selection analysis. Selection of the wide range of features, including location, genotype, irrigation regime, fertilizers, soil textures, physiological attitudes, and morphological characters, provided the opportunity of precise simultaneous study of a large number of factors in wheat grain yield topic by hand of data mining. The grain yield of each record assumed as target variable. The feature selection algorithm selected 14 features as the most effective features on grain yield. These features included culture type, location, soil texture, 1000 kernel weight, nitrogen supply, irrigation regime, biological yield, organic content of the soil, the amount of rainfall, genotype, plant height, and spike number per unit area. Interestingly, growing season length and plant density were the second most important features for wheat grain yield. Based on the feature selection model, culture type, as dryland farming or irrigated, severely affected wheat grain yield. The soil pH had a marginal effect on wheat grain yield. The results of this investigation demonstrated that feature classification using feature selection algorithms might be a suitable option for determining the important features contributing to wheat grain yield, providing a comprehensive view about these traits. This is the first report in identifying the most important factors on wheat grain yield from many fields using feature selection model. Keywords: culture type; data mining; plant physiology, wheat genotype, wheat grain yield Introduction Data mining is a process of discovering previously unknown and potentially interesting patterns in large datasets (Piatetsky-Shapiro and Frawley, 1991). Nowadays, intelligent data mining and knowledge discovery by artificial neural network, decision trees, and feature selection algorithms have become the important revolutionary issues in prediction and modeling (Roddick et al., 2001; Elson et al., 2004; Schuize et al., 2005). The ‘mined’ information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classifi- cation (Liu and Motoda, 2008). In data mining, feature selection tools are useful for identifying irrelevant attributes to be excluded from the dataset (Liu and Motoda, 2001). The main idea of the feature selection is to choose a subset of all variables by eliminating a large number of features with little discriminative and predictive information (Blum and Langley, 1997; Beltrán et al., 2005). Usually in a dataset, not all the features are important; some are redundant and some are irrelevant. Data with several irrelevant features can misguide the clustering results and make it hard to explain (Liu and Motoda, 2001 and 2008). There are two ways to reduce the dimensionality: feature transformation and feature selection. Feature transformation reduces the dimension by applying some type of linear or non-linear function on the original features whereas feature selection selects a subset of the original features. One may wish to perform feature selection rather than transformation to keep the original meaning of the features. Furthermore, after feature selection, one does not need to measure the features that are not selected. Feature transformation, on the other hand, still needs all the features to extract the reduced dimensions (Liu and Motoda, 2008). Recently, there is a great interest in employing feature selection algorithms to find the critical features involving in different phenomena including enzyme thermostability (Ebrahimi et al., 2009) and pH tolerance (Ebrahimie et al., 2008). Feature selection allows the variable set to be reduced in size, creating a more manageable set of attributes for modeling (Blum & Langley, 1997). Adding feature selection to the analytical process has several bene- fits: it simplifies and narrows the scope of the features that is essential in building a predictive model, minimizes the computational time and memory requirements for building a predictive model, because focus can be directed to the subset of predictors that is most essential. It also leads to more accurate and/or more parsimonious models (Dash and Liu, 1997; Liu and Motoda, 1998). Furthermore, it reduces the time for generating scores since the predictive model is based upon only a subset of predictors. Feature selection algorithms have two main components: feature search and feature subset evaluation consists in of screening, ranking and selecting (Liu and Motoda, 2008). There are two types of feature selection algorithms: supervised and unsupervised. Supervised feature selection algorithms rely on measures that take into account the class information. A well-known measure is information gain, which is widely used in both feature selection and decision tree induction (Dash and Liu, 1997). In essence,