Supplementary Information
[HOME] [SUMBISSION FORM] [CONTACT] [TEAM] [UPDATES] [HELP] [RESULTS]
DNAHome Page


EGPred Predictions on Human Chromosome 13

The supplementary material gives the information about the ab initio programs and BLAST programs used for developing EGPred. The results observed on combining similarity information from different sources and ab initio predictions are also given below.

  1. Programs included for using combination method:

    Ab initio programs selected:
    1. Genscan (Burge and Karlin, 1997)
    2. HMMgene (Krogh, 1997)
    3. Exon Union-Intesection (EUI) method (Rogic et al., 2002)
    4. Exon Union-Intesection with reading Frame consistency (EUI-Frame) method(Rogic et al., 2002)
    5. Gene Intersection (GI) method (Rogic et al., 2002)
    Similarity search program selected:
    Splice site program selected:
    Combination program incorporating similarity information selected:
  2. Databases & Datasets selected for demonstrating combination method:

    Datasets selected:

    1. The HMR195 195 mammalian gene sequences dataset (Rogic et al., 2001)
    2. The Burset/Guigo 570 vertebrate gene sequences dataset (Burset and Guigo, 1996)


    Databases selected:
    Sequence comparison against protein databases require clean, comprehensive, representative and non-redundant set of protein sequences. Databases from NCBI like Non-Redundant (NR) and RefSeq, and SWISS-PROT protein databases are three of the most useful databases for the present study. Presently NR database contains 1,609,203 protein and peptide sequences, SWISS-PROT contains approxiamtely 130 thousand protein sequences while RefSeq database contains more than 61 thousand protein sequences. Of the SWISS-PROT and NR protein databases, SWISS-PROT is more stringent in terms of sequences that are included in the database and subsequent curation. In terms of reliability for experimental purposes, SWISS-PROT would prove more beneficial than NR database. On the other hand, RefSeq protein database contains sequences from Homo sapiens, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Danio rerio and Saccharomyces cerevisiae. It provides a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq provide a stable reference for gene identification, characterization and comparative analyses. For the present study, SWISS-PROT and RefSeq protein databases were chosen for sequence comparison.

    1. RefSeq Protein Database (Pruitt and Malgott, 2001)
    2. SWISS-PROT Protein Database (Boeckmann et al., 2001)
    3. Intron Sequence Database (Sakharkar et al., 2002)


    The presence of protein product from any of the 195 HMR195 sequences would give the similarity method an advantage over ab initio methods. To remove any such advantage, all the translated products that were coded by any of the HMR195 sequences were removed from the databases. For e.g. the complete list of 194 protein sequences that were coded by HMR195 dataset sequences and were removed from RefSeq protein database is given here.



  3. Rules for merging conflicting evidences from multiple sources:

    Primarily, the predictions from the ab initio programs are used as template on which 'corrections' are to be made based on the level of confidence with which predictions are obtained from the similarity-based approach. Two important parameters from BLASTX that are considered valuable for determining the confidence of predictions are the Expectation-value (E-value) and the percent identity (PID) of aligned query to database sequence. Rules were derived for determining the threshold for using each of these two parameters on four different exon types--initial exons, internal exons, terminal exons and single exons. Steps for deriving these rules are as follows.

    1. From each of two approaches (ab initio and similarity-based) predictions were obtained for the HMR195 dataset.

    2. The predicted exons from the ab initio (ab) programs and the similarity-based (sim) approach were grouped into four based on the exon type predicted--initial exons (I), internal exons (R), terminal exons (T) and single exons (S).

    3. In each group, for every ab initio predicted exon, the exon type of the corresponding prediction from similarity-based approach is selected and categorized into sixteen sub-groups---Iab/Isim, Iab/Rsim, Iab/Tsim, Iab/Ssim, Rab/Isim, Rab/Rsim, Rab/Tsim, Rab/Ssim, Tab/Isim, Tab/Rsim, Tab/Tsim, Tab/Ssim, Sab/Isim, Sab/Rsim, Sab/Tsim, Sab/Ssim.

    4. In each sub group, thresholds for E-value and PID were separately derived.

    5. For deriving the thresholds in case of NNSPLICE program following procedure was used. In each sub group where the BLASTX predicted exons are NOT of the (S) type, the effect of increase in score from 0 to 1 is studied on accuracy of combination method. For each sub group, a separate threshold if obtained. In most sub-groups, a optimal threshold of 0.5 is obtained for consideration in the combination step except in case the BLASTX predicts an (I) type while the ab initio program predicts a (S) type where the optimal score for NNSPLICE is observed as 0.9 for affecting the combination.



  4. Results on using BLASTX on SWISS-PROT and RefSeq Database:



    SEN: Sensitivity SPE: Specificity AC: Approximate Correlation
    CC: Correlation coefficient CR: proportion of correctly predicted exons ME: proportion of missed exons
    WE: proportion of wrongly predicted exons Esen: Exon level sensitivity Espe: Exon level specificity


    DATABASE# NO
    PREDICTIONS
    NUCLEOTIDE LEVELEXON LEVEL
    SENSPEACCCCRMEWEEsenEspe
    SWISS-PROT60.900.590.640.620.020.070.530.040.02
    RefSeq10.880.640.690.660.020.100.490.040.02


  5. BLASTX performance on HMR195 dataset and Burset/Guigo dataset

    SEN: Sensitivity SPE: Specificity AC: Approximate Correlation CC: Correlation coefficient
    CR: number of correctly predicted exons PC: number of partially correct exons OL: number of overlapping exons ME: number of missed exons WE: number of wrong exons
    ESEN: Exon level sensitivity ESPE: Exon level specificity EAVG: Average of Exon Sensitivity and Exon Specificity

    BLASTX represents inclusion of information from BLASTX search against RefSeq database. BLASTN represents inclusion of information from BLASTX against RefSeq database. NNSPLICE represents inclusion of splice site information from NNSPLICE program in a region +/- 50 bp of exons from similarity search against proteins.

    Table 1

  6. Performance of ab initio programs on HMR195 dataset on adding similarity information


    NNSPLICE information is also incorporated into row for results from ab initio programs combined with BLASTX and BLASTN.
    Table 2

  7. Performance of ab initio programs on Burset/Guigo dataset on adding similarity information


    NNSPLICE information is also incorporated into row for results from ab initio programs combined with BLASTX and BLASTN.
    Table 3