Amino acid
Composition
Amino acid composition is the fraction of each amino acid in a
protein. The fraction of all 20 natural amino acids was calculated
using following equation.
Composition of physico-chemical properties
The 33 physico-chemical properties were used to represent the
proteins. The values of physico-chemical property for all 20 amino
acids were normalized between 0 and 1 using the standard
conversion formula. The input vector has 33 scalar values, each
representing the average value of a distinct physico-chemical
property of a protein.
Dipeptide composition
Dipeptide composition was
used to encapsulate the global information about each protein
sequence, which gives a fixed pattern length of 400.This representation encompassed the information about amino
acid composition along local order of amino acid. The fraction of
each dipeptide was calculated using following equation .
PSI-BLAST
A module PSI-BLAST was developed to predict the subcellular
localization of proteins, in which a query sequence was searched
against a database of proteins using PSI-BLAST. The database
consists of 1302 sequences belonging to 4 major subcellular
locations. The PSI-BLAST was used instead of normal standard
BLAST because it has the capability to detect remote homologies
(Altschul et al, 1990). It carries out an iterative search in
which the sequences found in one round of search are used to
build score model for the next round of searching. Three
iterations of PSI-BLAST were carried out at a cut-off E-value of
0.001. This module could predict any of the four localizations
(cytoplasmic, inner-membrane, periplasmic, outer-membrane, and
extracellular) depending upon the similarity of the query
protein to the proteins in the database. The module would return
“unknown subcellular localization” if no significant similarity
was obtained.
Hybrid SVM module
Recently, our group (Bhasin and Raghava, 2001) has introduced
the concept of hybrid SVM module for the prediction of
subcellular localization of eukaryotic proteins and achieved
remarkable success. In the present study, the same approach was
used to construct hybrid SVM module. The hybrid SVM module
encapsulates the complete information of a protein such as amino
acid composition, composition of physico-chemical properties,
dipeptide composition, and PSI-BLAST output. SVM was provided
with an input vector of 459 dimensions that consisted of 20 for
amino acid composition, 33 for physico-chemical properties, 400
for dipeptide composition, and six for PSI-BLAST output.
Evaluation of
PSLpred
In the present
study, 5-fold cross validation technique has been adopted to
evaluate the performance of the various SVM modules constructed.
In this technique, the data set was partioned randomly into five
equally sized sets. The training and testing was carried out
five times, each time using one distinct set for testing and
reaming four sets for training. In order to assess the
prediction performances, accuracy and Mathew’s correlation
coefficient (MCC) were calculated as described by Hua and Sun
2001 using equations.
where,
x can be any
subcellular location (cytoplasmic, inner membrane, periplasmic,
outher membrane, and extracellular)
exp(x) is the number
of sequences observed in location
x,
p(x) is the
number of correctly predicted sequences of location
x,
n(x) is the number of correctly predicted sequences not of
location
x, u(x) is the number of under-predicted
sequences and
o(x) is the number of over-predicted
sequences.
Reliability
Index
The reliability
index (RI) assignment is used to measure the level of certainty
in the prediction for a particular sequence. Hence, it is
helpful to gain the confidence of the users about the
prediction. The strategy used for assigning the RI is similar as
used previously by our group. The RI was assigned according to
the difference
between the highest and second highest SVM output scores. The
reliability index for the hybrid approach based methods was
calculated using following equation.
RESULTS: The detail results obtained after 5-fold cross validation for all the SVM modules developed in the present study are as follows:
-
Amino acid composition
A SVM module developed on the basis of amino acid
composition in a protein has achieved best results with
the RBF kernel (g=100, c=2, j=1). The calculation of amino acid composition generates the 20 dimensional input vectors for each protein sequence which were used to train five types of SVM models for the five types of subcellular localizations. The composition based SVM module
was predicted with an overall accuracy of 86%.
| Subcellular localization |
Accuracy (%) |
MCC |
| Cytoplasmic |
87.1 |
0.80 |
| Extracellular |
77.9 |
0.81 |
| Inner-membrane |
86.9 |
0.87 |
| Outer-membrane |
93.5 |
0.76 |
| Periplasmic |
79.9 |
0.83 |
-
Composition of physico-chemical properties
The calculation of composition of physico-chemical properties of the protein sequences
generates input vector of 33 dimensions for each sequence. The overall accuracy of properties based SVM module is 83%,~3% lesser then amino acid composition based SVM module.
| Subcellular localization |
Accuracy (%) |
MCC |
| Cytoplasmic |
83.5 |
0.77 |
| Extracellular |
75.8 |
0.77 |
| Inner-membrane |
85.8 |
0.83 |
| Outer-membrane |
87.8 |
0.82 |
| Periplasmic |
78.3 |
0.73 |
-
Composition of DIpeptide
Dipeptide
composition is considered as better feature as compared to amino-acid
composition as it encapsulates global as well as local information of the
sequence. In order to implement information about frequency as well as local
order of residues in proteins, we also constructed SVM module based on
dipeptide composition.The dipeptide composition based SVM module encompasses the information about amino acid composition along local order of amino acid.It uses the fixed pattern length of a vector with 400 dimensions. The dipeptide
composition based SVM module with the RBF kernel (g=300,
C=2) was predicted with an overall accuracy
of 86%.
| Subcellular localization |
Accuracy (%) |
MCC |
| Cytoplasmic |
87.1 |
0.78 |
| Extracellular |
73.7 |
0.79 |
| Inner-membrane |
85.8 |
0.89 |
| Outer-membrane |
93.8 |
0.88 |
| Periplasmic |
84.0 |
0.77 |
-
PSI-BLAST
The
performance of the PSI-BLAST based module was also evaluated through 5-fold
cross-validation.The performance of this module is poorer as compared to
other modules developed in the present study. The SVM module based on this
approach was able to predict the subcellular localization of the proteins
with overall accuracy of 68%.
| Subcellular localization |
Accuracy (%) |
| Cytoplasmic |
34.3 |
| Extracellular |
79.5 |
| Inner-membrane |
59.7 |
| Outer-membrane |
93.8 |
| Periplasmic |
65.5 |
-
Hybrid based approach
A hybrid module based on all features of the proteins and output of PSI-BLAST
was developed. This hybrid module (g=25,
C=4) achieved an overall accuracy of 91.2%, which is 5-8% higher than
individual compositions based modules. It proves hybrid module is able to encapsulate more information, which successfully
improves the reliability of prediction accuracy. These results confirmed
that detection of subcellular localization of proteins requires wide range
of information about a protein.
| Subcellular localization |
Accuracy (%) |
MCC |
| Cytoplasmic |
90.7 |
0.86 |
| Extracellular |
86.8 |
0.88 |
| Inner-membrane |
90.3 |
0.90 |
| Outer-membrane |
95.2 |
0.95 |
| Periplasmic |
90.6 |
0.89 |
-
Reliability Index
In order to
confirm the prediction reliability RI assignment was carried out for the
hybrid module. As depicted from the RI curve, good accuracies that is 90% and
98.1% was obtained with RI=4 and 5 respectively. It has also been observed
that ~74% of the sequences have RI=5. Hence, the present method can annotate
subcellular localization of prokaryotic proteins more reliably.

Comparison with existing methods
The performance of the hybrid module developed
in the present study was compared with methods such as CELLO, PSORT-B, which
were also developed from the same data set. It has been observed that overall
performance of the hybrid module is nearly 2% higher than CELLO and 16% higher
than that of PSORT-B. Hence it can be mentioned here that present method is
more accurate for the subcellular localization of prokaryotic proteins.