Evaluation of The MHC Binding Peptide Prediction Methods

MHCBench
Evaluation of MHC Binding Peptide Prediction Algorithms

Home

Evaluation

Help

Developers

About

Threshold Independent [TIE]

Threshold Dependent [TDE]

DATASETS

Selection of a defined and appropriate sized data set is essentially the frst step in developing a good algorithm. Without this the development and evaluation will lead to misguiding results. A number of MHC binding peptide prediction methods are available. Their performance is either tested on a set which is very small (<100 peptides) or using a set that is not available. In this respect we have assembeled the forllowing sets. The results of evaluation using these sets is available at this server and hence the user can evaluate the performance of their algorithms by downloding the sets and can also comapre it with the already available algorithms.

MHC CLASS-II ALLELE
- HLA-DRB1*0401
  - SET-I [Download]
  - SET-II [Download]
  - SET-IIIa [Download]
  - SET-IIIb [Download]
  - SET-IVa [Download]
  - SET-IVb [Download]
  - SET-Va [Download]
  - SET-Vb [Download]

A BRIEF DESCRIPTION OF DATASETS

SET-I: Non-redundant peptides

SET-II: Natural peptides

SET-III: Non-homologues peptides

SET-IIIa:Selected form SET-I
SET-IIIb:Selected form SET-II

SET-IV: Balanced binders and non-binders

SET-IVa:Selected form SET-I
SET-IVb:Selected form SET-II

SET-V: Recent peptides

SET-Va:Selected form SET-I
SET-Vb:Selected form SET-II

The peptides were collected from the MHCPEP (504 binders), MHCBN (556 binders and 83 non-binders) and the published literature of Marshal et al., 1995 (48 binders and 7 nonbinders), Honeyman et al., 1999 (22 binders and 46 nonbinders), Geluk et al., 1998 (49 binders and 71 non-binders), O?Sullivan et al., 1990 (70 non-binders), Texier et al., 2000 (39 binders and 10 non-binders), Harfouch-Hammoud et al., 1999 (7 binders and 42 non-binders), and Borras-Cuesta et al., 2000 (8 binders and 27 non-binders), and Fridkis-Hareli et al., 2001 (12 binders and 23 non-binders). The duplicate entries were removed. The collected peptides were further filtered. These peptides were filtered to produce the sets.

Set1 (Non Redundant Peptides): From the original collection of peptides, we removed a peptide with undetermined amino acids, and a peptide with length less than nine residues. This set contain 1017 peptides that have been experimentally verified as binders or non-binders to HLA-DRB1*0401. This is the largest number of peptides used for evaluation of prediction methods for MHC class II binding peptides.

Set2 (Natural Peptides): A number of experimental studies use poly A or poly G backbone peptides to deduce the MHC the binding motif (Fridkis-Hareli et al., 2001). Such studies generate peptide that may not exit in nature. In other words may not be encountered by the MHC molecule. The algorithm predicting these peptides would be at advantage. Thus, the evaluation performed using these peptides would be biased towards one or the other algorithm. The set 1 was filtered to remove the non-natural peptides. Those peptides do not have a source antigen. Briefly we removed 121 sequences with poly ?A?, poly ?G? and 2 poly ?S? backbone. Further 171 peptide, which were generated form the mutational, or substitution analysis and 52 phage display peptides were also removed. This set contain 673 peptides.

Set3 (Non-Homologues Peptides): The experimental methods to study the MHC peptide interactions involve truncation, substitution, or mutations in a base peptide (O'Sullivan et al., 1990). Such studies generate close homologues. These are peptides with a difference of one or two amino acids. We removed 427 and 178 sequences with more than 80 percent identity (PID) from set1 and 2 respectively. The PID was calculated from the number of amino acids that are same in given two peptides. For example for sequences ?ACDEFGHIKLMNPQRST? and ?DEFGGIKLMN?, the bigger sequence was divided in to a window size equal to length of smaller sequences (if both are equal then full lengths are compared). Here we have ?ACDEFGHIKL?, ?CDEFGHIKLM?, ?DEFGHIKLMN?,??. ?IKLMNPQRST?. For each identity, a score is added and finally PID is calculated. Here, between ?DEFGHIKLMN and DEFGGIKLMN?, nine out of ten residues are identical therefore, the PID is 90%. The resulting set 3(a) and 3 (b) contain 590 and 495 non-homologues peptides respectively from the non-redundant and natural peptide data sets.

Set4 (Balanced Binders and Non-Binders): An ideal test set should contain equal number of binders and non-binders. In absence of this, the evaluation parameters will show bias. It is difficult to find MHC binders, and still difficult is to find non-binders. Therefore, the sets used in evaluation of these algorithms contain more number of binders than non-binders. We randomly selected 323 and 292 binders from set 1 and set2 in order to study the effect of data set balancing on predictive performance.

Set5 (Recent Ligands): It is well known that MHC binding peptides contain binding motifs (Rammensee et al., 1995). Each algorithm tries to learn these motifs. Since the algorithms were developed using the information available at that time, it is necessary to test their performance on recent data. We collected 117 and 85 peptides according to set1 and set2 criterion, capable of binding to HLA-DRB1*0401 alleles (Harfouch-Hammoud et al., 1999; Texier et al., 2000; Borras-Cuesta et al., 2000; Fridkis-Hareli et al., 2001). The size of this set is relatively small so care should be taken in comparing the results.