Supplementary information for EGPred

Supplementary Information

[HOME] [SUMBISSION FORM] [CONTACT] [TEAM] [UPDATES] [HELP] [RESULTS]

EGPred Predictions on Human Chromosome 13

INTRODUCTION

The supplementary material gives the information about the ab initio programs and BLAST programs used for developing EGPred. The results observed on combining similarity information from different sources and ab initio predictions are also given below.

Programs included for using combination method:

Ab initio programs selected:
1. Genscan (Burge and Karlin, 1997)
2. HMMgene (Krogh, 1997)
3. Exon Union-Intesection (EUI) method (Rogic et al., 2002)
4. Exon Union-Intesection with reading Frame consistency (EUI-Frame) method(Rogic et al., 2002)
5. Gene Intersection (GI) method (Rogic et al., 2002)
Similarity search program selected:
- BLAST family of similarity search programs (Altschul et al., 1997)
Splice site program selected:
- NNSPLICE (Reese et al., 1997) splice site prediction program is selected for prediction of acceptor and donor sites
Combination program incorporating similarity information selected:
- GenomeScan (Yeh et al., 2001), an extension of Genscan (Burge and Karlin, 1997) program.
Databases & Datasets selected for demonstrating combination method:

Datasets selected:
1. The HMR195 195 mammalian gene sequences dataset (Rogic et al., 2001)
2. The Burset/Guigo 570 vertebrate gene sequences dataset (Burset and Guigo, 1996)
Databases selected:
Sequence comparison against protein databases require clean, comprehensive, representative and non-redundant set of protein sequences. Databases from NCBI like Non-Redundant (NR) and RefSeq, and SWISS-PROT protein databases are three of the most useful databases for the present study. Presently NR database contains 1,609,203 protein and peptide sequences, SWISS-PROT contains approxiamtely 130 thousand protein sequences while RefSeq database contains more than 61 thousand protein sequences. Of the SWISS-PROT and NR protein databases, SWISS-PROT is more stringent in terms of sequences that are included in the database and subsequent curation. In terms of reliability for experimental purposes, SWISS-PROT would prove more beneficial than NR database. On the other hand, RefSeq protein database contains sequences from Homo sapiens, Drosophila melanogaster, Mus musculus, Rattus norvegicus, Danio rerio and Saccharomyces cerevisiae. It provides a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq provide a stable reference for gene identification, characterization and comparative analyses. For the present study, SWISS-PROT and RefSeq protein databases were chosen for sequence comparison.
1. RefSeq Protein Database (Pruitt and Malgott, 2001)
2. SWISS-PROT Protein Database (Boeckmann et al., 2001)
3. Intron Sequence Database (Sakharkar et al., 2002)
The presence of protein product from any of the 195 HMR195 sequences would give the similarity method an advantage over ab initio methods. To remove any such advantage, all the translated products that were coded by any of the HMR195 sequences were removed from the databases. For e.g. the complete list of 194 protein sequences that were coded by HMR195 dataset sequences and were removed from RefSeq protein database is given here.
Rules for merging conflicting evidences from multiple sources:

Primarily, the predictions from the ab initio programs are used as template on which 'corrections' are to be made based on the level of confidence with which predictions are obtained from the similarity-based approach. Two important parameters from BLASTX that are considered valuable for determining the confidence of predictions are the Expectation-value (E-value) and the percent identity (PID) of aligned query to database sequence. Rules were derived for determining the threshold for using each of these two parameters on four different exon types--initial exons, internal exons, terminal exons and single exons. Steps for deriving these rules are as follows.
1. From each of two approaches (ab initio and similarity-based) predictions were obtained for the HMR195 dataset.
2. The predicted exons from the ab initio (ab) programs and the similarity-based (sim) approach were grouped into four based on the exon type predicted--initial exons (I), internal exons (R), terminal exons (T) and single exons (S).
3. In each group, for every ab initio predicted exon, the exon type of the corresponding prediction from similarity-based approach is selected and categorized into sixteen sub-groups---I_ab/I_sim, I_ab/R_sim, I_ab/T_sim, I_ab/S_sim, R_ab/I_sim, R_ab/R_sim, R_ab/T_sim, R_ab/S_sim, T_ab/I_sim, T_ab/R_sim, T_ab/T_sim, T_ab/S_sim, S_ab/I_sim, S_ab/R_sim, S_ab/T_sim, S_ab/S_sim.
4. In each sub group, thresholds for E-value and PID were separately derived.
5. For deriving the thresholds in case of NNSPLICE program following procedure was used. In each sub group where the BLASTX predicted exons are NOT of the (S) type, the effect of increase in score from 0 to 1 is studied on accuracy of combination method. For each sub group, a separate threshold if obtained. In most sub-groups, a optimal threshold of 0.5 is obtained for consideration in the combination step except in case the BLASTX predicts an (I) type while the ab initio program predicts a (S) type where the optimal score for NNSPLICE is observed as 0.9 for affecting the combination.
Results on using BLASTX on SWISS-PROT and RefSeq Database:

SEN: Sensitivity SPE: Specificity AC: Approximate Correlation
CC: Correlation coefficient CR: proportion of correctly predicted exons ME: proportion of missed exons
WE: proportion of wrongly predicted exons Esen: Exon level sensitivity Espe: Exon level specificity

DATABASE # NO
PREDICTIONS NUCLEOTIDE LEVEL EXON LEVEL

SEN SPE AC CC CR ME WE Esen Espe

SWISS-PROT 6 0.90 0.59 0.64 0.62 0.02 0.07 0.53 0.04 0.02

RefSeq 1 0.88 0.64 0.69 0.66 0.02 0.10 0.49 0.04 0.02
BLASTX performance on HMR195 dataset and Burset/Guigo dataset

SEN: Sensitivity SPE: Specificity AC: Approximate Correlation CC: Correlation coefficient
CR: number of correctly predicted exons PC: number of partially correct exons OL: number of overlapping exons ME: number of missed exons WE: number of wrong exons
ESEN: Exon level sensitivity ESPE: Exon level specificity EAVG: Average of Exon Sensitivity and Exon Specificity

BLASTX represents inclusion of information from BLASTX search against RefSeq database. BLASTN represents inclusion of information from BLASTX against RefSeq database. NNSPLICE represents inclusion of splice site information from NNSPLICE program in a region +/- 50 bp of exons from similarity search against proteins.
Performance of ab initio programs on HMR195 dataset on adding similarity information

NNSPLICE information is also incorporated into row for results from ab initio programs combined with BLASTX and BLASTN.
Performance of ab initio programs on Burset/Guigo dataset on adding similarity information

NNSPLICE information is also incorporated into row for results from ab initio programs combined with BLASTX and BLASTN.

DATABASE	# NO PREDICTIONS	NUCLEOTIDE LEVEL				EXON LEVEL
DATABASE	# NO PREDICTIONS	SEN	SPE	AC	CC	CR	ME	WE	Esen	Espe
SWISS-PROT	6	0.90	0.59	0.64	0.62	0.02	0.07	0.53	0.04	0.02
RefSeq	1	0.88	0.64	0.69	0.66	0.02	0.10	0.49	0.04	0.02