چکیده انگلیسی مقاله |
Nowadays, with the advent of big data from a variety of processes, variable selection (VS) has become an important research topic in academia and industry. Feature selection techniques are widely applied but the use of abstract variables may lose some important information. PCA is a linear modeling method which has been commonly used to explore the data sets. By performing PCA, the experimental data matrix D is decomposed into two matrices: scores, containing the information related to objects, and loadings, containing the information related to variables (spectral information). PCA, through feature reduction and visual display, allows us to observe the sources of variation in complex data sets. It is, however, possible to extract much more information from a PCA. The principal components (PCs) are called latent variables. The purest variables can identify by convex hull of the principal component scores. It was shown by removing all other data points; the data set can be reduced to a very sparse set of essential data points . In this work, the recent introduced concept in our group, a data point importance (DPI) is used in order to sorting information and variable selection in the data set. The DPI defines an easily calculable value corresponding to each row or column of the data matrix (data point) to reflect its impact on keeping the pattern of the data structure. Usually, a lot of data points have DPIs equal or very close to zero that they do not carry on useful information about keeping the data pattern. DPI values for some of the data points are significant and they have been sorted regarding to their importances. In this regard, the applicability of the DPI information sorting of objects and variables is tested for the exploration and classification of the diabetic and healthy people dataset. This data contains the concentration values of 163 lipids that represent the differentiation of healthy people from diabetics for 30 healthy individuals and 30 diabetics . DPI values have been used for discovering the relative importances of variables via DPI plot (figure 1a) and also the DPI information sorting has been incorporated in score plot (figure 1b) for fast visualization of relative impacts of variables in classifications applications. |