Current dataset of mycobacterial proteins along with their subcellular localization has been developed from along with their subcellular localization. Out of 1365 proteins, non-experimental qualifier "by similarity" is excluded resulting in 882 proteins. Among 13 different subcellular compartments , 4 major sites have been selected containing reasonable number of samples.
|Subcellular Localization||Sample Number|
|4.Attached to the membrane by lipid anchor||60|
Support Vector Machine (SVM):
SVMlight has been used in the present study in classification mode.Several parameters may be tuned for their appropriate values to get optimum results.Among different inbuilt kernels three have been used namely linear,polynomial and RBF.Subcellular localization prediction is a multi-class approach. For a defined protein feature, four types of SVM modules have been developed each belonging to a specific subcellular localization.The nth SVM model learns from nth class samples with positive labels and rest other samples with negative labels.Prediction of an unknown sample is based upon the maximum score out of four scores, generated by four models specific to four different subcellular compartments.
Evaluation of prediction performance of TBpred:
The performance of this method is evaluated by 5-fold cross-validation technique.The whole data is partitioned in 5 sets in such a manner that no two proteins from different sets shows sequence similarity greater than 36%.The training is done on four sets and remaining one is used for testing.In order to test each and every protein this process is carried out 5 times, each time using distinct set for testing.Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC) by the following equations:
where, x can be any subcellular location (cytoplasmic, mitochondrial, nuclear, or plasma membrane), exp(x) is the number of sequences observed in location x, p(x) is the number of correctly predicted sequences of location x, n(x) is the number of correctly predicted sequences not of location x, u(x) is the number of under predicted sequences and o(x) is the number of over-predicted sequences.
Various Prdiction Approahes:
In this study mainly three approaches have been studied, based on different features of proteins.
|Attached to membrane by a lipid anchor||55.00||0.58|
|Attached to membrane by a lipid anchor||50.00||0.57|
|From the PSSM obtained for each protein sequence a SVM pattern has been made.The input vector contains 400 dimensions.Overall accuracy acheived by this SVM module (kernel-RBF,g=2, c=50, j=1) was 86.62%.|
|Attached to membrane by a lipid anchor||68.33||0.69|