GWFASTA HELP AND DOCUMENTATION

Welcome to GWFASTA Help and Documentation Page

[Standard FASTA] [Genomic FASTA] [Proteome FASTA] [Multiple Alignment] [Phylogenetic Tree]

HOME

General Architechture of GWFASTA

Figure.1. General Architechture of GWFASTA web server.

GWFASTA OPTIONS

STANDARD:
GWFASTA has options to carry out standard blast searches against databases such as Non-redundant protein sequence database, swissprot protein database, database of proteins available in PDB and proteins that are patented. Among the nucleotide databases GWFASTA has facility to carry out FASTA search against Non-Redundant nucleotide database, Organism specific sequence databases such as Ecoli, Yeast and Drosophila nucleotide databases, Mitochondrial nucleotide database, Sequence tagged sites (STS), Vector nucleotide databse, Nucleotide sequences in PDB and Patented nucleotide sequences database.
GENOME WIDE FASTA AGAINST GENOME DATABASES (finished and unfinished):
Genomes of Different Microbes and Eukaryotes have been maintained and are available for FASTA searches. Options are provided for multiple selection of genomes. The users can select either the whole lot of genomes (microbial or eukaryotic) or can multiple select two or more genomes according to need of the user. The GWFASTA for genomes is limited to "nucleotide-to-nucleotide" or "blastn" searchs for simplicity.

The currently maintained genomes are:

MICROBIAL GENOMES
1	Aeropyrum_pernix
2	Agrobacterium_tumefaciens_C58_Cereon
3	Agrobacterium_tumefaciens_C58_UWash
4	Aquifex_aeolicus
5	Archaeoglobus_fulgidus
6	Bacillus_halodurans
7	Bacillus_subtilis
8	Borrelia_burgdorferi
9	Brucella_melitensis
10	Buchnera_aphidicola_Sg
11	Buchnera_sp
12	Campylobacter_jejuni
13	Caulobacter_crescentus
14	Chlamydia_muridarum
15	Chlamydia_trachomatis
16	Chlamydophila_pneumoniae_AR39
17	Chlamydophila_pneumoniae_CWL029
18	Chlamydophila_pneumoniae_J138
19	Chlorobium_tepidum_TLS
20	Clostridium_acetobutylicum
21	Clostridium_perfringens
22	Corynebacterium_glutamicum
23	Deinococcus_radiodurans
24	Escherichia_coli_K12
25	Escherichia_coli_O157H7
26	Escherichia_coli_O157H7_EDL933
27	Fusobacterium_nucleatum
28	Haemophilus_influenzae
29	Halobacterium_sp
30	Helicobacter_pylori_26695
31	Helicobacter_pylori_J99
32	Lactococcus_lactis
33	Listeria_innocua
34	Listeria_monocytogenes
35	Mesorhizobium_loti
36	Methanobacterium_thermoautotrophicum
37	Methanococcus_jannaschii
38	Methanopyrus_kandleri
39	Methanosarcina_acetivorans
40	Methanosarcina_mazei
41	Mycobacterium_leprae
42	Mycobacterium_tuberculosis_CDC1551
43	Mycobacterium_tuberculosis_H37Rv
44	Mycoplasma_genitalium
45	Mycoplasma_pneumoniae
46	Mycoplasma_pulmonis
47	Neisseria_meningitidis_MC58
48	Neisseria_meningitidis_Z2491
49	Nostoc_sp
50	Pasteurella_multocida
51	Pseudomonas_aeruginosa
52	Pyrobaculum_aerophilum
53	Pyrococcus_abyssi
54	Pyrococcus_furiosus
55	Pyrococcus_horikoshii
56	Ralstonia_solanacearum
57	Rickettsia_conorii
58	Rickettsia_prowazekii
59	Salmonella_typhi
60	Salmonella_typhimurium_LT2
61	Sinorhizobium_meliloti
62	Staphylococcus_aureus_MW2
63	Staphylococcus_aureus_Mu50
64	Staphylococcus_aureus_N315
65	Streptococcus_agalactiae_2603
66	Streptococcus_pneumoniae_R6
67	Streptococcus_pneumoniae_TIGR4
68	Streptococcus_pyogenes
69	Streptococcus_pyogenes_MGAS315
70	Streptococcus_pyogenes_MGAS8232
71	Streptomyces_coelicolor
72	Sulfolobus_solfataricus
73	Sulfolobus_tokodaii
74	Synechocystis_PCC6803
75	Thermoanaerobacter_tengcongensis
76	Thermoplasma_acidophilum
77	Thermoplasma_volcanium
78	Thermosynechococcus_elongatus
79	Thermotoga_maritima
80	Treponema_pallidum
81	Ureaplasma_urealyticum
82	Vibrio_cholerae
83	Xanthomonas_campestris
84	Xanthomonas_citri
85	Xylella_fastidiosa
86	Yersinia_pestis_CO92
87	Yersinia_pestis_KIM
EUKARYOTIC GENOMES
1	A_thaliana
2	Anopheles_gambiae
3	D_rerio
4	C_elegans
5	Encephalitozoon_cuniculi
6	H_sapiens
7	M_musculus
8	S_pombe
9	Saccharomyces_cerevisiae
10	D_melanogaster
11	P_falciparum

Genomic proteins:
GWFASTA searches for similar sequences in the protein database for individual or `user- selected' multiple genomes. Total protein in different microbial and eukaryotic genomes are being maintained separately and together for this purpose.
Multiple Alignment:
The users can extract top five FASTA hits from the database searched and use alignment software ClustalW inbuilt in the server for Multiple Alignment. The server extracts the sequences in fasta format which the clustalW program recognizes and outputs two files; an alignment file which is displayed on the screen in PIR format and a tree file which can be viewed on-line too with a click of a button.
Mview:
MView is a tool for converting the results of a sequence database search (FASTA, FASTA, etc.) into the form of a coloured multiple alignment of hits stacked against the query. Alternatively, an existing multiple alignment (MSF, PIR, CLUSTALW, etc.) can be processed. In either case, the output is simply HTML, so the result is platform independent and does not require a separate application or applet to be loaded. MView is NOT a multiple alignment program, nor is it a general purpose alignment editor.
Phylogenetic Tree:
An on-line Tree viewer has been provided for the benefit of the users. The GWFASTA server generates a tree file after multiple alignment using CLUSTALW. This phylogenetic tree file can be viewed on screen. Phylodendron a web-based Phylogenetic tree printer at http://iubio.bio.indiana.edu/ is utilized to process your tree file and show it on screen. The original software and webserver has been developed by D.G. Gilbert Indiana University.
Pasting your sequence or Uploading a sequence file:
The DNA or Protein sequence can be pasted into the text area. Or a file containing nucleotide
or amino acids sequence can be uploaded using this option.
Format Type:
In case the Format of the sequence is any of the standard ones (EMBL, FASTA, GENBANK, etc.) then `Format Type' should be selected to `Standard Format (Readable by READSEQ)'. The GWFASTA server uses READSEQ program developed by D.G. Gilbert Indiana University to convert the format of your sequence to fasta. In case the input sequence is just plain text, set the `Format Type' to Plain Text (Single Letter Code). By default the server takes only single letter code of amino acids or nucleotide bases. The server will read the three letter code of amino acids as three different residues. The server also has the capability to ignore all the non-standard characters such as ,*%!@$% etc.
Gap Initiation Penalty:
Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).
Gap Extension Penalty:
Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).
Choice:
fasta34, fasta34_t - scan a protein or DNA sequence library for similar sequences.

tfasta34, fastx34_t - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.

tfastx34, tfastx34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.

tfasty34, tfasty34_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations.

fastx34, fastx34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.

fasty34, fasty34_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames.

ssearch34, ssearch34_t - compare a protein or DNA sequence to a sequence database using the Smith-Waterman algorithm.
Expect value:
Limit the number of scores and alignments shown based on the expected number of scores. For protein searches, library sequences with E() values < 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well, but unrelated sequences ( 1 - 10 per search) will have scores in this renage as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias.
Ktup:
ktup = 1 or 2. (or 1 to 6 for DNA sequences) (ktup of 2 is about 5 times faster than ktup = 1) Change this value to limit the word-length the the search should use. A word-length of 2 is sensitive enough for most protein database searches. The thumb rule is that the larger the word-length the less sensitive, but faster the search will be. For DNA searches a ktup of 6 is the default.
Select Region:
Users can select a particular region from the input sequence data for similarity search. This option saves the user the job of continuously trimming and editing their sequence in case they want to restrict their search to a particular region.
Weight matrix:
Fasta3 uses the same scoring matrices as Blast1.4/2.0. Several scoring matrix files are included in the standard distribution. FASTA uses different kinds of Substitution matrices for similarity searches. It is well known that certain amino acids can substitute easily for one another in related proteins, presumably because of their similar physicochemical properties. These can be considered in calculating alignment scores in a flexible manner through the use of a substitution matrix, in which the score for any pair of amino acids can be easily looked up.
Two most used matrices are PAM [Point accepted Mutation] and BLOSUM [BLOCKS substitution] matrices.
PAM substitution matrices-
The first substitution matrices to gain widespread usage (Dayhoff et al., 1978). It is a unit to quantify the amount of evolutionary change in a protein change in a protein sequence. 1 PAM unit is the amount of evolution which will change 1% of amino acids in a protein sequence. Although PAM250 was the only published PAM matrix, the underlying mutation data can be extrapolated to other PAM distances to produce a family of matrices. When aligning sequences that are highly divergent, best results are obtained at higher PAM values, such as PAM200 or PAM250. Matrices constructed from lower PAM values can be used if the sequences have a greater degree of similarity.
BLOSUM subsitution matrices-
Constructed by Henikoff and Henikoff, 1992. The underlying data for BLOSUM are derived from BLOCKS database (Henikoff and Henikoff, 1991), which contains local multiple alignments ("blocks") involving distantly related sequences. There is numbered series of BLOSUM matrices, with the number refering to the maximum level of identity that sequences may have. For eg, with BLOSUM62 matrix, sequences having at least 62% identity are merged into a single sequence, so that the substitution frequencies are more heavily influenced by sequences that are more divergent than this cutoff.
Genome Wise Summary:
Genome Wide FASTA against genome of various Prokaryotic or Eukaryotic organisms or their proteome are summarized in a table. An example for such an output is given. If any particular genome reports a hit, the sequence can be extracted from the database for further processing. The table gives the indication whether the sequence is present in the alphabetically listed genome or not; If reported present the e-value of the hit and the score are given.
complexity:
FASTA 2.0 and 2.1 uses the dust low-complexity filter for blast
compositional analysis:
GWFASTA also gives an option after FASTA searches against protein databases for compositional analysis of the Query seuence and the selected FASTA hits. The Tabular output gives the percent composition of the different amino acids in the different proteins, their maximum & minimum across the selected proteins and their average. Polar, Nonpolar, Positive, Negative, Hydrophobic and Hydrophilic compositional analysis of the proteins are also given.
FASTA in Batch Mode:
GWFASTA can also do FASTA with multiple queries. The input can be given just once. Individual searches against protein databases can then be done after choosing the particular sequence from the given list. The list is generated from your multiple sequence input. The option accepts sequence only in standard formats readable by READSEQ program. Steps (i) Input sequences in the field or upload the file comtaining the sequences, (ii) Select different parameters and database for FASTA, (iii) Select the Batch Mode option, (iv) Run analysis.
Alignment editor and analyses servers:
GWFASTA has integrated different post-alignment processing servers so as to help users in further refining their data.

Java-based Alignment analysis server:
Jalview is a Java-based alignment editor developed by Michele Clamp (EBI). It help in editing and beautifying the alignment from CLUSTALW.
Analysis of Multiple Aligned Sequences: AMAS is developed by Geoff Barton at EBI and is a program to analyze multiple alignment of protein sequences.
Protein Sequence Alignment: PSA is a server developed locally by Raghava for presentation of physico-chemical properties of amino acids graphically/tabulate along its primary structure. This helps in understanding the function of a protein.