ORIGINAL ARTICLE Free alignment classification of dikarya fungi using some machine learning methods Abbas Rohani 1 Mojtaba Mamarabadi 2 Received: 13 July 2017 / Accepted: 11 May 2018 Ó The Natural Computing Applications Forum 2018 Abstract Gene clustering based on amino acid sequence similarity has been one of the most important problems and always challenging in molecular biology. The most conventional methods are based on alignment-technique. These methods cannot identify and classify sequences, especially when the lengths of sequence are long and unequal. Therefore, in order to classify fungal hexosaminidase amino acid sequences and put them in the right taxonomical group we evaluate the feasibility of computational free alignment methods based on machine learning classifiers such as SVM, KNN, SOM and ensemble technique. The classifiers have appropriately categorized large Dikarya hexosaminidase amino acid sequences as data sets according to their taxonomical groups in two phyla named, the ‘‘Ascomycota’’ and the ‘‘Basidiomycota’’. Two statistical methods including paired t test and PCA were used for the feature selection and reduce the dimensionality of the features, respectively. Seven classifier performance metrics, randomized complete block design, pairwise Tukey’s honestly significant difference tests and the technique for order preference by similarity to ideal solution with modified k-fold cross validation have been used as tools in order to evaluate and ranking of classifiers. In this study, the effect of training data size on the classifier performance was investigated. The results showed that the rank and the performance of classifiers were depended on the training data size. The highest obtained values for the average overall accuracy of the following training data sizes, 80, 60, 40 and 20% using KNN, KNN, ensemble and ensemble classifier were 96.96, 95.81, 94.47 and 92.47%, respectively. Keywords Fungal hexosaminidase Dikarya Classification Classifier Abbreviations ANN Artificial neural network ANOVA Analysis of variance ARB Adaptive rule-based AUC Area under an ROC curve DNA Deoxyribonucleic acid Ens Ensemble classifier FH Fungal hexosaminidases FN Number of positive samples FP Number of negative samples HSD Honestly significant difference KNN K-nearest neighbor MCC Matthew’s correlation coefficient MCDM Multi-criteria decision-making MLP Multilayer perceptron NB Naı ¨ve Bayes PC Principal component PCA Principal component analysis PNN Probability neural network Poly2 Polynomial degree 2 Poly3 Polynomial degree 3 PSO Particle swarm optimization RBF Radial basic function RCBD Randomized complete block design RF Random forest SOFM Self-organizing feature map SOM Self-organized map SST Total sum of squares SSW Within-groups sum of squares SVM Support vector machine TDS Training data size TLCF Two-layer classification framework TN Negative samples & Abbas Rohani arohani@um.ac.ir 1 Department of Biosystems Engineering, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran 2 Department of Plant Protection, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran 123 Neural Computing and Applications https://doi.org/10.1007/s00521-018-3539-5