Welcome to GWFASTA Help and Documentation Page

[Standard FASTA] [Genomic FASTA] [Proteome FASTA] [Multiple Alignment] [Phylogenetic Tree]

HOME


General Architechture of GWFASTA

Figure.1. General Architechture of GWFASTA web server.



GWFASTA OPTIONS
  1. STANDARD:

    GWFASTA has options to carry out standard blast searches against databases such as Non-redundant protein sequence database, swissprot protein database, database of proteins available in PDB and proteins that are patented. Among the nucleotide databases GWFASTA has facility to carry out FASTA search against Non-Redundant nucleotide database, Organism specific sequence databases such as Ecoli, Yeast and Drosophila nucleotide databases, Mitochondrial nucleotide database, Sequence tagged sites (STS), Vector nucleotide databse, Nucleotide sequences in PDB and Patented nucleotide sequences database.

  2. GENOME WIDE FASTA AGAINST GENOME DATABASES (finished and unfinished):

    Genomes of Different Microbes and Eukaryotes have been maintained and are available for FASTA searches. Options are provided for multiple selection of genomes. The users can select either the whole lot of genomes (microbial or eukaryotic) or can multiple select two or more genomes according to need of the user. The GWFASTA for genomes is limited to "nucleotide-to-nucleotide" or "blastn" searchs for simplicity.

The currently maintained genomes are:

MICROBIAL GENOMES
1 Aeropyrum_pernix
2 Agrobacterium_tumefaciens_C58_Cereon
3 Agrobacterium_tumefaciens_C58_UWash
4 Aquifex_aeolicus
5 Archaeoglobus_fulgidus
6 Bacillus_halodurans
7 Bacillus_subtilis
8 Borrelia_burgdorferi
9 Brucella_melitensis
10 Buchnera_aphidicola_Sg
11 Buchnera_sp
12 Campylobacter_jejuni
13 Caulobacter_crescentus
14 Chlamydia_muridarum
15 Chlamydia_trachomatis
16 Chlamydophila_pneumoniae_AR39
17 Chlamydophila_pneumoniae_CWL029
18 Chlamydophila_pneumoniae_J138
19 Chlorobium_tepidum_TLS
20 Clostridium_acetobutylicum
21 Clostridium_perfringens
22 Corynebacterium_glutamicum
23 Deinococcus_radiodurans
24 Escherichia_coli_K12
25 Escherichia_coli_O157H7
26 Escherichia_coli_O157H7_EDL933
27 Fusobacterium_nucleatum
28 Haemophilus_influenzae
29 Halobacterium_sp
30 Helicobacter_pylori_26695
31 Helicobacter_pylori_J99
32 Lactococcus_lactis
33 Listeria_innocua
34 Listeria_monocytogenes
35 Mesorhizobium_loti
36 Methanobacterium_thermoautotrophicum
37 Methanococcus_jannaschii
38 Methanopyrus_kandleri
39 Methanosarcina_acetivorans
40 Methanosarcina_mazei
41 Mycobacterium_leprae
42 Mycobacterium_tuberculosis_CDC1551
43 Mycobacterium_tuberculosis_H37Rv
44 Mycoplasma_genitalium
45 Mycoplasma_pneumoniae
46 Mycoplasma_pulmonis
47 Neisseria_meningitidis_MC58
48 Neisseria_meningitidis_Z2491
49 Nostoc_sp
50 Pasteurella_multocida
51 Pseudomonas_aeruginosa
52 Pyrobaculum_aerophilum
53 Pyrococcus_abyssi
54 Pyrococcus_furiosus
55 Pyrococcus_horikoshii
56 Ralstonia_solanacearum
57 Rickettsia_conorii
58 Rickettsia_prowazekii
59 Salmonella_typhi
60 Salmonella_typhimurium_LT2
61 Sinorhizobium_meliloti
62 Staphylococcus_aureus_MW2
63 Staphylococcus_aureus_Mu50
64 Staphylococcus_aureus_N315
65 Streptococcus_agalactiae_2603
66 Streptococcus_pneumoniae_R6
67 Streptococcus_pneumoniae_TIGR4
68 Streptococcus_pyogenes
69 Streptococcus_pyogenes_MGAS315
70 Streptococcus_pyogenes_MGAS8232
71 Streptomyces_coelicolor
72 Sulfolobus_solfataricus
73 Sulfolobus_tokodaii
74 Synechocystis_PCC6803
75 Thermoanaerobacter_tengcongensis
76 Thermoplasma_acidophilum
77 Thermoplasma_volcanium
78 Thermosynechococcus_elongatus
79 Thermotoga_maritima
80 Treponema_pallidum
81 Ureaplasma_urealyticum
82 Vibrio_cholerae
83 Xanthomonas_campestris
84 Xanthomonas_citri
85 Xylella_fastidiosa
86 Yersinia_pestis_CO92
87 Yersinia_pestis_KIM
EUKARYOTIC GENOMES
1 A_thaliana
2 Anopheles_gambiae
3 D_rerio
4 C_elegans
5 Encephalitozoon_cuniculi
6 H_sapiens
7 M_musculus
8 S_pombe
9 Saccharomyces_cerevisiae
10 D_melanogaster
11 P_falciparum
  1. Genomic proteins:

    GWFASTA searches for similar sequences in the protein database for individual or `user- selected' multiple genomes. Total protein in different microbial and eukaryotic genomes are being maintained separately and together for this purpose.

  2. Multiple Alignment:

    The users can extract top five FASTA hits from the database searched and use alignment software ClustalW inbuilt in the server for Multiple Alignment. The server extracts the sequences in fasta format which the clustalW program recognizes and outputs two files; an alignment file which is displayed on the screen in PIR format and a tree file which can be viewed on-line too with a click of a button.

  3. Mview:

    MView is a tool for converting the results of a sequence database search (FASTA, FASTA, etc.) into the form of a coloured multiple alignment of hits stacked against the query. Alternatively, an existing multiple alignment (MSF, PIR, CLUSTALW, etc.) can be processed. In either case, the output is simply HTML, so the result is platform independent and does not require a separate application or applet to be loaded. MView is NOT a multiple alignment program, nor is it a general purpose alignment editor.

  4. Phylogenetic Tree:

    An on-line Tree viewer has been provided for the benefit of the users. The GWFASTA server generates a tree file after multiple alignment using CLUSTALW. This phylogenetic tree file can be viewed on screen. Phylodendron a web-based Phylogenetic tree printer at http://iubio.bio.indiana.edu/ is utilized to process your tree file and show it on screen. The original software and webserver has been developed by D.G. Gilbert Indiana University.

  5. Pasting your sequence or Uploading a sequence file:

    The DNA or Protein sequence can be pasted into the text area. Or a file containing nucleotide
    or amino acids sequence can be uploaded using this option.

  6. Format Type:

    In case the Format of the sequence is any of the standard ones (EMBL, FASTA, GENBANK, etc.) then `Format Type' should be selected to `Standard Format (Readable by READSEQ)'. The GWFASTA server uses READSEQ program developed by D.G. Gilbert Indiana University to convert the format of your sequence to fasta. In case the input sequence is just plain text, set the `Format Type' to Plain Text (Single Letter Code). By default the server takes only single letter code of amino acids or nucleotide bases. The server will read the three letter code of amino acids as three different residues. The server also has the capability to ignore all the non-standard characters such as ,*%!@$% etc.

  7. Gap Initiation Penalty:

    Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

  8. Gap Extension Penalty:

    Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

  9. Choice:

    fasta34, fasta34_t - scan a protein or DNA sequence library for similar sequences.

    tfasta34, fastx34_t - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.

    tfastx34, tfastx34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.

    tfasty34, tfasty34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.

    fastx34, fastx34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.

    fasty34, fasty34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.

    ssearch34, ssearch34_t - compare a protein or DNA sequence to a sequence database using the Smith-Waterman algorithm.

  10. Expect value:

    Limit the number of scores and alignments shown based on the expected number of scores. For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well, but unrelated sequences ( 1 - 10 per search) will have scores in this renage as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias.

  11. Ktup:

    ktup = 1 or 2. (or 1 to 6 for DNA sequences) (ktup of 2 is about 5 times faster than ktup = 1) Change this value to limit the word-length the the search should use. A word-length of 2 is sensitive enough for most protein database searches. The thumb rule is that the larger the word-length the less sensitive, but faster the search will be. For DNA searches a ktup of 6 is the default.

  12. Select Region:

    Users can select a particular region from the input sequence data for similarity search. This option saves the user the job of continuously trimming and editing their sequence in case they want to restrict their search to a particular region.

  13. Weight matrix:

    Fasta3 uses the same scoring matrices as Blast1.4/2.0. Several scoring matrix files are included in the standard distribution. FASTA uses different kinds of Substitution matrices for similarity searches. It is well known that certain amino acids can substitute easily for one another in related proteins, presumably because of their similar physicochemical properties. These can be considered in calculating alignment scores in a flexible manner through the use of a substitution matrix, in which the score for any pair of amino acids can be easily looked up.
    Two most used matrices are PAM [Point accepted Mutation] and BLOSUM [BLOCKS substitution] matrices.
    PAM substitution matrices-
    The first substitution matrices to gain widespread usage (Dayhoff et al., 1978). It is a unit to quantify the amount of evolutionary change in a protein change in a protein sequence. 1 PAM unit is the amount of evolution which will change 1% of amino acids in a protein sequence. Although PAM250 was the only published PAM matrix, the underlying mutation data can be extrapolated to other PAM distances to produce a family of matrices. When aligning sequences that are highly divergent, best results are obtained at higher PAM values, such as PAM200 or PAM250. Matrices constructed from lower PAM values can be used if the sequences have a greater degree of similarity.
    BLOSUM subsitution matrices-
    Constructed by Henikoff and Henikoff, 1992. The underlying data for BLOSUM are derived from BLOCKS database (Henikoff and Henikoff, 1991), which contains local multiple alignments ("blocks") involving distantly related sequences. There is numbered series of BLOSUM matrices, with the number refering to the maximum level of identity that sequences may have. For eg, with BLOSUM62 matrix, sequences having at least 62% identity are merged into a single sequence, so that the substitution frequencies are more heavily influenced by sequences that are more divergent than this cutoff.

  14. Genome Wise Summary:

    Genome Wide FASTA against genome of various Prokaryotic or Eukaryotic organisms or their proteome are summarized in a table. An example for such an output is given. If any particular genome reports a hit, the sequence can be extracted from the database for further processing. The table gives the indication whether the sequence is present in the alphabetically listed genome or not; If reported present the e-value of the hit and the score are given.

  15. complexity:

    FASTA 2.0 and 2.1 uses the dust low-complexity filter for blast

  16. compositional analysis:

    GWFASTA also gives an option after FASTA searches against protein databases for compositional analysis of the Query seuence and the selected FASTA hits. The Tabular output gives the percent composition of the different amino acids in the different proteins, their maximum & minimum across the selected proteins and their average. Polar, Nonpolar, Positive, Negative, Hydrophobic and Hydrophilic compositional analysis of the proteins are also given.

  17. FASTA in Batch Mode:

    GWFASTA can also do FASTA with multiple queries. The input can be given just once. Individual searches against protein databases can then be done after choosing the particular sequence from the given list. The list is generated from your multiple sequence input. The option accepts sequence only in standard formats readable by READSEQ program. Steps (i) Input sequences in the field or upload the file comtaining the sequences, (ii) Select different parameters and database for FASTA, (iii) Select the Batch Mode option, (iv) Run analysis.

  18. Alignment editor and analyses servers:

    GWFASTA has integrated different post-alignment processing servers so as to help users in further refining their data.

    Java-based Alignment analysis server:
    Jalview is a Java-based alignment editor developed by Michele Clamp (EBI). It help in editing and beautifying the alignment from CLUSTALW.
    Analysis of Multiple Aligned Sequences: AMAS is developed by Geoff Barton at EBI and is a program to analyze multiple alignment of protein sequences.
    Protein Sequence Alignment: PSA is a server developed locally by Raghava for presentation of physico-chemical properties of amino acids graphically/tabulate along its primary structure. This helps in understanding the function of a protein.