These notes focus on methods to create, evaluate and use protein multiple alignments. They are based on updates of tutorials I presented in the ISMB-97 conference at Greece in June 1997 and in the EMBO practical course on Functional genomics in plants at Israel in February 1999. Several publicly available methods and databases are detailed (all are free and most can be used on the WWW).
Multiply aligning protein sequences is a relatively advanced step in their analysis. The alignment can reveal information not found by methods analyzing single sequences. Nevertheless, one should initially use 'single sequence' analysis methods. It is better to go from the simple to the complex. For example, before trying to identify new family members by querying a database with a multiple alignment of the family one should try straight-forward database searches with the single sequences. The results from the single sequence searches can provide more sequences to the alignment together with confirming the multiple alignment and the results of searches made with it.
Do not confuse the terms homology and similarity. Sequences
either have or not have a common ancestor. Thus, sequences can either be
homologous or not, but they cannot be 70% homologous". However, sequences
can be similar by different degree and therefore be "70% similar". In
addition, a statement like that is not informative unless we know what
is the significance of this similarity, is it across the whole
sequence/region or just in conserved regions, and by what method (program)
was it found. Proteins that have significant sequence similarity are most
Reeck GR, et al.: "Homology" in proteins and nucleic acids: a terminology muddle and a way out of it" Cell, 50:667 (1987).
How to find sequences of related proteins?
Frequently we only have a single protein sequence. Querying sequence databases with the sequence we have can identify other members. The search should be repeated with every found member until no more sequences are found. A keyword search of the database can be done when only the protein name is known and when we wish to verify that all sequences were found. Scientific literature is also a source for identifying related sequences, especially very diverged members that might be missed in a single sequence search.
|BLAST is a good choice for sequence-to-sequence searches. The search can be done through the WWW and e-mail servers on several extensive protein and sequence databases ("http://www.ncbi.nlm.nih.gov/BLAST" and "email@example.com"). The programs are also available for UNIX computers at "ftp://ncbi.nlm.nih.gov/blast". Updated databases for searching are found at "ftp://ncbi.nlm.nih.gov/blast/db".|
Deciding if a sequence belongs to a group relies on the significance of its similarity to other members, on whatever is known about the protein, and most importantly on the researcher's knowledge of the protein(s) s/he is studying. There is no substitute for these insights and experience.
How many sequences are needed?
The more sequences we have the better. Multiple alignments of two and three sequences have limited usefulness. Try to use as many sequences as you have (but see below about the redundancy issue). If you think your sequences form sub-groups than try to also separately align these.
Excluding redundant sequences
Redundant sequences are separate sequences that are highly similar to each other. The extreme example is a set of identical sequences. Obviously only one sequence of the set should be used, the rest do not contribute to finding the alignment. Worse, the redundant sequences will bias the alignment toward their own features. Ideally all sequences in the aligned group should have a comparable similarity to each other. (This similarity can be assessed from the single sequence database searches.) A good rule of thumb in cases of varied similarities is to leave only a single sequence from each set with more than 70-80% intra-sequence similarity. This threshold could be modified for different cases and the results evaluated (see below).
Sequences removed from the alignment step can later be joined to the alignment if they have any difference in the aligned regions and if the resulting alignment can be sequence weighted (see below).
This tutorial will concentrate on methods for local multiple alignments.
|WWW location||E-mail server2||Program source3|
|Global multiple alignment|
"ftp://ftp.ebi.ac.uk/pub/software/ unix/clustalw" (UNIX) vax/clustalw" (VAX) mac/clustalw" (Mac) dos/clustalw" (DOS)
|Local multiple alignment|
|MACAW||-||-||"ftp://ncbi.nlm.nih.gov/pub/macaw" (Windows and Mac)|
Experimental data can be used in evaluating, and even constructing, multiple alignments. For example, if we know the catalytic site in the aligned proteins we expect the sites to be aligned together and may 'force' that alignment. Such manual alignments can serve as a seed to an alignment with more sequences.
|Local multiple alignments (blocks) from different programs can be joined or used together. Another approach is 'divide and conquer'. Blocks present in all sequences divide them into separate parts, in each of which more blocks can be searched for.|
Multiple alignments of many sequences and those with different sequence weights are difficult to visualize. Sequence logos are a graphical way for presenting multiple alignments.
ID ADH_IRON_1; BLOCK AC BL00913C; distance from previous block=(56,76) DE Iron-containing alcohol dehydrogenases proteins. BL HHG motif; width=22; seqs=11; 99.5%=492; strength=1428 ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL 66 FUCO_ECOLI ( 262) VHGMAHPLGAFYNTPHGVANAI 44 GLDA_BACST ( 259) HNGFTALEGEIHHLTHGEKVAF 100 GLDA_ECOLI ( 269) VHNGLTAIPDAHHYYHGEKVAF 100 MEDH_BACMT ( 259) VHSISHQVGGVYKLQHGICNSV 78 ADH1_CLOAB ( 258) CHSMAHKTGAVFHIPHGCANAI 47 ADHE_ECOLI ( 721) CHSMAHKLGSQFHIPHGLANAL 47 ADH2_ZYMMO ( 261) VHAMAHQLGGYYNLPHGVCNAV 36 ADH4_YEAST ( 263) VHALAHQLGGFYHLPHGVCNAV 41 ADHA_CLOAB ( 266) CHPMEHELSAYYDITHGVGLAI 50 ADHB_CLOAB ( 266) VHLMEHELSAYYDITHGVGLAI 49 //
|A tree made from the three blocks in the iron containing alcohol dehydrogenases family. Bootstrap values are for 100 trials. The tree was calculated from the blocks with the ClustalW program and drawn with the TreeView program.|
Programs for making and drawing trees from multiple alignments:|
Multiple alignments are powerful tools for identifying new members of the aligned group. It is possible to query databases of multiple alignments with single sequences and to query sequence databases with multiple alignments. It has been shown that such searches are more sensitive and selective than sequence-to-sequence searches. A simple (but very effective !) 'hybrid' approach is to use a properly made consensus sequence (cobbler, see BlockMaker above).
|WWW location||E-mail server||Program source|
|Blimps||"http://bioinfo.weizmann.ac.il/blocks/blocks_search.html"||"firstname.lastname@example.org"||"ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blimps" (UNIX)||MAST||"http://www.sdsc.edu/MEME/meme/website/mast-intro.html"||-||"ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/" (UNIX)|
Blimps - a program to query both protein and nucleotide sequence databases with protein blocks and vice versa. The queries are single sequences or blocks. The program is available on the WWW and by e-mail server (send a message with the word "help" to find out the usage of the e-mail server) for searching multiple alignment databases with single sequences. It is available for installation on UNIX systems.
MAST - a program to query sequence databases with blocks. Protein or nucleotide databases are queried with protein blocks. The query can be obtained from the MEME or BlockMaker programs (see above). The query can be a single block or all the blocks of a protein family. The program receives input through the WWW and returns it by e-mail. It is available for installation on UNIX systems.
LAMA - a program to search blocks databases with block queries. Queries can be obtained from the Blocks database, BlockMaker program or by reformatting multiple alignments ("http://bioinformatics.weizmann.ac.il/blocks/process_blocks.html"). The program receives input through the WWW and returns it interactively or by e-mail.
PCR primer design
Design of degenerate PCR primers is emerging as a major use for multiple alignments. PCR can identify the sequence of a gene in genomic or other DNA from two short flanking segments (primers). Conserved sequence regions are (by definition) a good source for primer design. When designing primers the conservation of the regions, the degeneracy of the genetic code and parameters of the PCR reaction must be considered. The Blocks WWW server designs PCR primers for each family in the database, for sequence groups submitted to be aligned and for multiple alignment submitted to be reformatted. These primers are degenerate at the 3' end and consensus at the 5' end (codehop- COnsensus DEgenerate Hybrid Oligonucleotide Primers). The design is fully automatic but the user can set the requested Tm, genetic code and bias the primers toward some of the sequences. codehop primers were shown more effective than simple degenerate primers in various cases.