چکیده انگلیسی مقاله |
The eukaryotic cells are organized into membrane-covered compartments that are characterized by specific sets of proteins and biochemically distinct cellular processes. Identifying the functions of proteins in various cellular organelles and pathways is one of the fundamental goals in proteomics, cell biology, and drug design research. Predicting appropriate protein subcellular localization can provide useful insights for revealing their functions and dysfunctions. The two principal obstacles of this problem are the prediction of multi-location proteins and deficiency of suitable knowledge of proper data modeling. Most of the existing methods were designed to treat the proteins as single-location. Whereas, recent experimental data report that location of the protein in a cell is actually a multi-label system, where some proteins may simultaneously occur in two or more different location regions [1, 2]. In this paper, we presented a robust classifier to predict the multiple-location of proteins. Our method is based on a graphical probabilistic model that combining two sources of information, the features derived from protein families, and protein-protein interactions. It seems that proteins in a subcellular location have interaction with each other, hence, the PPI network would be a beneficial resource for prediction of proteins locations. We benchmark our method using dataset taken from SUBA4 [3], that is a comprehensive data center for Arabidopsis thaliana subcellular proteins. The dataset contains 5331 proteins in 11 subcellular locations. We obtain protein sequences and PPI from [3]. Our algorithm includes the following three steps: (1) each protein sequence was aligned to each HMM profile of the top 20 largest families of Pfam [4], that results in a vector represented similarity indices to major Pfam Protein families, (2) a probabilistic model merged this family-derived features and protein-protein interaction network that relates probability of co-location of a pair of proteins to the features, (3) via maximum likelihood, the method predicts a set of overlapping cluster which is assigned to subcellular locations for each protein. Finally, we compare our algorithm with several recent location predictors [1,5] and results proved the superiority of algorithm in comparison with others. |