|
GWFASTA has options to carry out standard blast searches against databases such as Non-redundant protein sequence database, swissprot protein database, database of proteins available in PDB and proteins that are patented. Among the nucleotide databases GWFASTA has facility to carry out FASTA search against Non-Redundant nucleotide database, Organism specific sequence databases such as Ecoli, Yeast and Drosophila nucleotide databases, Mitochondrial nucleotide database, Sequence tagged sites (STS), Vector nucleotide databse, Nucleotide sequences in PDB and Patented nucleotide sequences database.
Genomes of Different Microbes and Eukaryotes have been maintained and are available for FASTA searches. Options are provided for multiple selection of genomes. The users can select either the whole lot of genomes (microbial or eukaryotic) or can multiple select two or more genomes according to need of the user. The GWFASTA for genomes is limited to "nucleotide-to-nucleotide" or "blastn" searchs for simplicity.
GWFASTA searches for similar sequences in the protein database for individual or `user- selected' multiple genomes. Total protein in different microbial and eukaryotic genomes are being maintained separately and together for this purpose.
The users can extract top five FASTA hits from the database searched and use alignment software ClustalW inbuilt in the server for Multiple Alignment. The server extracts the sequences in fasta format which the clustalW program recognizes and outputs two files; an alignment file which is displayed on the screen in PIR format and a tree file which can be viewed on-line too with a click of a button.
MView is a tool for converting the results of a sequence database search (FASTA, FASTA, etc.) into the form of a coloured multiple alignment of hits stacked against the query. Alternatively, an existing multiple alignment (MSF, PIR, CLUSTALW, etc.) can be processed. In either case, the output is simply HTML, so the result is platform independent and does not require a separate application or applet to be loaded. MView is NOT a multiple alignment program, nor is it a general purpose alignment editor.
An on-line Tree viewer has been provided for the benefit of the users. The GWFASTA server generates a tree file after multiple alignment using CLUSTALW. This phylogenetic tree file can be viewed on screen. Phylodendron a web-based Phylogenetic tree printer at http://iubio.bio.indiana.edu/ is utilized to process your tree file and show it on screen. The original software and webserver has been developed by D.G. Gilbert Indiana University.
The DNA or Protein sequence can be pasted into the text area. Or a file containing nucleotide
or amino acids sequence can be uploaded using this option.
In case the Format of the sequence is any of the standard ones (EMBL, FASTA, GENBANK, etc.) then `Format Type' should be selected to `Standard Format (Readable by READSEQ)'. The GWFASTA server uses READSEQ program developed by D.G. Gilbert Indiana University to convert the format of your sequence to fasta. In case the input sequence is just plain text, set the `Format Type' to Plain Text (Single Letter Code). By default the server takes only single letter code of amino acids or nucleotide bases. The server will read the three letter code of amino acids as three different residues. The server also has the capability to ignore all the non-standard characters such as ,*%!@$% etc.
Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
fasta34, fasta34_t - scan a protein or DNA sequence library for similar sequences.
tfasta34, fastx34_t - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.
tfastx34, tfastx34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.
tfasty34, tfasty34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.
fastx34, fastx34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.
fasty34, fasty34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.
ssearch34, ssearch34_t - compare a protein or DNA sequence to a sequence database using the Smith-Waterman algorithm.
Limit the number of scores and alignments shown based on the expected number of scores. For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well, but unrelated sequences ( 1 - 10 per search) will have scores in this renage as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias.
ktup = 1 or 2. (or 1 to 6 for DNA sequences) (ktup of 2 is about 5 times faster than ktup = 1) Change this value to limit the word-length the the search should use. A word-length of 2 is sensitive enough for most protein database searches. The thumb rule is that the larger the word-length the less sensitive, but faster the search will be. For DNA searches a ktup of 6 is the default.
Users can select a particular region from the input sequence data for similarity search. This option saves the user the job of continuously trimming and editing their sequence in case they want to restrict their search to a particular region.
Fasta3 uses the same scoring matrices as Blast1.4/2.0. Several scoring matrix files are included in the standard distribution. FASTA uses different kinds of Substitution matrices for similarity searches. It is well known that certain amino acids can substitute easily for one another in related proteins, presumably because of their similar physicochemical properties. These can be considered in calculating alignment scores in a flexible manner through the use of a substitution matrix, in which the score for any pair of amino acids can be easily looked up.
Two most used matrices are PAM [Point accepted Mutation] and BLOSUM [BLOCKS substitution] matrices.
PAM substitution matrices-
The first substitution matrices to gain widespread usage (Dayhoff et al., 1978). It is a unit to quantify the amount of evolutionary change in a protein change in a protein sequence. 1 PAM unit is the amount of evolution which will change 1% of amino acids in a protein sequence. Although PAM250 was the only published PAM matrix, the underlying mutation data can be extrapolated to other PAM distances to produce a family of matrices. When aligning sequences that are highly divergent, best results are obtained at higher PAM values, such as PAM200 or PAM250. Matrices constructed from lower PAM values can be used if the sequences have a greater degree of similarity.
BLOSUM subsitution matrices-
Constructed by Henikoff and Henikoff, 1992. The underlying data for BLOSUM are derived from BLOCKS database (Henikoff and Henikoff, 1991), which contains local multiple alignments ("blocks") involving distantly related sequences. There is numbered series of BLOSUM matrices, with the number refering to the maximum level of identity that sequences may have. For eg, with BLOSUM62 matrix, sequences having at least 62% identity are merged into a single sequence, so that the substitution frequencies are more heavily influenced by sequences that are more divergent than this cutoff.
Genome Wide FASTA against genome of various Prokaryotic or Eukaryotic organisms or their proteome are summarized in a table. An example for such an output is given. If any particular genome reports a hit, the sequence can be extracted from the database for further processing. The table gives the indication whether the sequence is present in the alphabetically listed genome or not; If reported present the e-value of the hit and the score are given.
FASTA 2.0 and 2.1 uses the dust low-complexity filter for blast
GWFASTA also gives an option after FASTA searches against protein databases for compositional analysis of the Query seuence and the selected FASTA hits. The Tabular output gives the percent composition of the different amino acids in the different proteins, their maximum & minimum across the selected proteins and their average. Polar, Nonpolar, Positive, Negative, Hydrophobic and Hydrophilic compositional analysis of the proteins are also given.
GWFASTA can also do FASTA with multiple queries. The input can be given just once. Individual searches against protein databases can then be done after choosing the particular sequence from the given list. The list is generated from your multiple sequence input. The option accepts sequence only in standard formats readable by READSEQ program. Steps (i) Input sequences in the field or upload the file comtaining the sequences, (ii) Select different parameters and database for FASTA, (iii) Select the Batch Mode option, (iv) Run analysis.
GWFASTA has integrated different post-alignment processing servers so as to help users in further refining their data.
Java-based Alignment analysis server:
Jalview is a Java-based alignment editor developed by Michele Clamp (EBI). It help in editing and beautifying the alignment from CLUSTALW.
Analysis of Multiple Aligned Sequences:
AMAS is developed by Geoff Barton at EBI and is a program to analyze multiple alignment of protein sequences.
Protein Sequence Alignment:
PSA is a server developed locally by Raghava for presentation of physico-chemical properties of amino acids graphically/tabulate along its primary structure. This helps in understanding the function of a protein.