چکیده انگلیسی مقاله |
The variable selection is an important part of contemporary quantitative structure-activity relationship (QSAR) studies, which is often used as a preprocessing step in these approaches. The elimination of irrelevant and redundant information often improves the performance of learning algorithms as well as reduces the computational costs associated with model building and predictions tasks. Which variable selection method works well in QSAR study is still considered as a challenging question. To answer this question, we have examined the effects of 23 different variable selection methods on the performance of the SVM, SKN and PLS-DA techniques to classify Bcl-2 and Bcl-xL selective inhibitors. These methods include Variable importance in projection, Feature selection via concave minimization, ReliefF [1] , B2 and B4 algorithms, Non-iterative B2, Particle swarm optimization, Ant colony optimization, Centrality eigenvector feature selection, Infinite feature selection, Distributions of mutual information, Feature selection with adaptive structure learning, Fisher, Infinite latent feature selection, SVM recursive feature elimination, Least absolute shrinkage and selection operator, Simple pairwise correlation method, K Inflation factor (KIF) index and KIF non-iteration index method, First eigenvector and First eigenvector iterative method, Variance inflation factor (VIF) and VIF iterative, Genetic algorithm, Simulated annealing, Reshaped sequential replacement. The statistical evaluation of the developed classification models was implemented by parameters derived from confusion matrix. In addition, the predictive power of the created models was investigated by the 10-fold venetian blind cross-validation. Classification accuracy of more than 70% was achieved in all three machine learning techniques and for all variable selection methods in the evaluation series. No statistically significant difference can be seen in the results. This demonstrates that the only main factor to choose the best variable selection method is not the good statistical results from model building. Other parameters are also very important such as the information obtained from the selected variables and their relationship with the structural features and biological activities of the studied data. |