Supplementary Information |
The supplementary material gives the information about the ab initio programs and BLAST programs used for developing EGPred. The results observed on combining similarity information from different sources and ab initio predictions are also given below.
Databases selected:
Sequence comparison against protein databases require clean, comprehensive, representative and non-redundant set of protein sequences. Databases from NCBI like Non-Redundant (NR) and RefSeq, and SWISS-PROT protein databases are three of the most useful databases for the present study. Presently NR database contains 1,609,203 protein and peptide sequences, SWISS-PROT contains approxiamtely 130 thousand protein sequences while RefSeq database contains more than 61 thousand protein sequences. Of the SWISS-PROT and NR protein databases, SWISS-PROT is more stringent in terms of sequences that are included in the database and subsequent curation. In terms of reliability for experimental purposes, SWISS-PROT would prove more beneficial than NR database. On the other hand, RefSeq protein database contains sequences from Homo sapiens, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Danio rerio and Saccharomyces cerevisiae. It provides a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq provide a stable reference for gene identification, characterization and comparative analyses. For the present study, SWISS-PROT and RefSeq protein databases were chosen for sequence comparison.
The presence of protein product from any of the 195 HMR195 sequences would give the similarity method an advantage over ab initio methods. To remove any such advantage, all the translated products that were coded by any of the HMR195 sequences were removed from the databases. For e.g. the complete list of 194 protein sequences that were coded by HMR195 dataset sequences and were removed from RefSeq protein database is given here.
Primarily, the predictions from the ab initio programs are used as template on which 'corrections' are to be made based on the level of confidence with which predictions are obtained from the similarity-based approach. Two important parameters from BLASTX that are considered valuable for determining the confidence of predictions are the Expectation-value (E-value) and the percent identity (PID) of aligned query to database sequence. Rules were derived for determining the threshold for using each of these two parameters on four different exon types--initial exons, internal exons, terminal exons and single exons. Steps for deriving these rules are as follows.
From each of two approaches (ab initio and similarity-based) predictions were obtained for the HMR195 dataset.
The predicted exons from the ab initio (ab) programs and the similarity-based (sim) approach were grouped into four based on the exon type predicted--initial exons (I), internal exons (R), terminal exons (T) and single exons (S).
In each group, for every ab initio predicted exon, the exon type of the corresponding prediction from similarity-based approach is selected and categorized into sixteen sub-groups---Iab/Isim, Iab/Rsim, Iab/Tsim, Iab/Ssim, Rab/Isim, Rab/Rsim, Rab/Tsim, Rab/Ssim, Tab/Isim, Tab/Rsim, Tab/Tsim, Tab/Ssim, Sab/Isim, Sab/Rsim, Sab/Tsim, Sab/Ssim.
In each sub group, thresholds for E-value and PID were separately derived.
For deriving the thresholds in case of NNSPLICE program following procedure was used. In each sub group where the BLASTX predicted exons are NOT of the (S) type, the effect of increase in score from 0 to 1 is studied on accuracy of combination method. For each sub group, a separate threshold if obtained. In most sub-groups, a optimal threshold of 0.5 is obtained for consideration in the combination step except in case the BLASTX predicts an (I) type while the ab initio program predicts a (S) type where the optimal score for NNSPLICE is observed as 0.9 for affecting the combination.
DATABASE | # NO PREDICTIONS | NUCLEOTIDE LEVEL | EXON LEVEL | ||||||||
SEN | SPE | AC | CC | CR | ME | WE | Esen | Espe | |||
SWISS-PROT | 6 | 0.90 | 0.59 | 0.64 | 0.62 | 0.02 | 0.07 | 0.53 | 0.04 | 0.02 | |
RefSeq | 1 | 0.88 | 0.64 | 0.69 | 0.66 | 0.02 | 0.10 | 0.49 | 0.04 | 0.02 |