Introduction to making and using protein multiple alignments

(tutorial notes, March 1999)

Preface

Multiple alignments of protein sequences are important tools in studying proteins. The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families.

These notes focus on methods to create, evaluate and use protein multiple alignments. They are based on updates of tutorials I presented in the ISMB-97 conference at Greece in June 1997 and in the EMBO practical course on Functional genomics in plants at Israel in February 1999. Several publicly available methods and databases are detailed (all are free and most can be used on the WWW).


Page last modified April 1999
Shmuel Pietrokovski <pietro@weizmann.ac.il>

Before starting

Multiple alignment should be tried for any group of related proteins we are studying. There are currently several good automatic and freely available tools that simplify the procedure and the initial alignment is relatively fast. An initial alignment will inform us if the procedure is relevant for our purpose. The two extreme cases are non-informative:
  1. sequences that have only recently diverged from each other and did not change much. These sequences will be very similar across their entire length. We cannot tell apart regions that are conserved due to their importance from regions that did not have time to diverge.
  2. sequences that are very diverged from each other (or perhaps are not related to begin with) and have no similar sequence regions. It still is possible that the proteins are related and have corresponding regions, but we cannot identify these by their sequence. This can change if we have more data (more sequences and/or experimental data) or use better alignment methods.

Multiply aligning protein sequences is a relatively advanced step in their analysis. The alignment can reveal information not found by methods analyzing single sequences. Nevertheless, one should initially use 'single sequence' analysis methods. It is better to go from the simple to the complex. For example, before trying to identify new family members by querying a database with a multiple alignment of the family one should try straight-forward database searches with the single sequences. The results from the single sequence searches can provide more sequences to the alignment together with confirming the multiple alignment and the results of searches made with it.

Preparing the data

What is a group of related proteins?
Protein sequences can be related by homology or convergence. In both cases multiple alignment can be attempted. Homologous proteins have a common ancestor and typically a common function. Converged proteins independently evolve to have common sequence features that typically preform a common function or have a common structure. Examples of convergence are the various hydrolase enzymes that cleave ester and polypeptide bonds by a common catalytic triad structure and the different archaeal, eubacterial and eukaryotic proteins that have a helix-turn-helix DNA binding domain.

Do not confuse the terms homology and similarity. Sequences either have or not have a common ancestor. Thus, sequences can either be homologous or not, but they cannot be 70% homologous". However, sequences can be similar by different degree and therefore be "70% similar". In addition, a statement like that is not informative unless we know what is the significance of this similarity, is it across the whole sequence/region or just in conserved regions, and by what method (program) was it found. Proteins that have significant sequence similarity are most often homologous.
Reeck GR, et al.: "Homology" in proteins and nucleic acids: a terminology muddle and a way out of it" Cell, 50:667 (1987).

How to find sequences of related proteins?
Frequently we only have a single protein sequence. Querying sequence databases with the sequence we have can identify other members. The search should be repeated with every found member until no more sequences are found. A keyword search of the database can be done when only the protein name is known and when we wish to verify that all sequences were found. Scientific literature is also a source for identifying related sequences, especially very diverged members that might be missed in a single sequence search.

BLAST is a good choice for sequence-to-sequence searches. The search can be done through the WWW and e-mail servers on several extensive protein and sequence databases ("http://www.ncbi.nlm.nih.gov/BLAST" and "blast@ncbi.nlm.nih.gov"). The programs are also available for UNIX computers at "ftp://ncbi.nlm.nih.gov/blast". Updated databases for searching are found at "ftp://ncbi.nlm.nih.gov/blast/db".

Deciding if a sequence belongs to a group relies on the significance of its similarity to other members, on whatever is known about the protein, and most importantly on the researcher's knowledge of the protein(s) s/he is studying. There is no substitute for these insights and experience.

How many sequences are needed?
The more sequences we have the better. Multiple alignments of two and three sequences have limited usefulness. Try to use as many sequences as you have (but see below about the redundancy issue). If you think your sequences form sub-groups than try to also separately align these.

Excluding redundant sequences
Redundant sequences are separate sequences that are highly similar to each other. The extreme example is a set of identical sequences. Obviously only one sequence of the set should be used, the rest do not contribute to finding the alignment. Worse, the redundant sequences will bias the alignment toward their own features. Ideally all sequences in the aligned group should have a comparable similarity to each other. (This similarity can be assessed from the single sequence database searches.) A good rule of thumb in cases of varied similarities is to leave only a single sequence from each set with more than 70-80% intra-sequence similarity. This threshold could be modified for different cases and the results evaluated (see below).

Sequences removed from the alignment step can later be joined to the alignment if they have any difference in the aligned regions and if the resulting alignment can be sequence weighted (see below).

Multiply aligning protein sequences

Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps.

Global vs. Local Multiple alignments

This tutorial will concentrate on methods for local multiple alignments.

Different available1 multiple alignment tools

WWW location E-mail server2 Program source3
Global multiple alignment
ClustalW "http://www2.ebi.ac.uk/clustalw/" -
"ftp://ftp.ebi.ac.uk/pub/software/
  unix/clustalw"  (UNIX)
  vax/clustalw"   (VAX)
  mac/clustalw"   (Mac)
  dos/clustalw"   (DOS)
Local multiple alignment
BlockMaker "http://bioinfo.weizmann.ac.il/blocks" "Blockmaker@blocks.fhcrc.org" "/ftp://ncbi.nlm.nih.gov/repository/blocks/unix/protomat" (UNIX)
MEME "http://www.sdsc.edu/MEME" - "ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/" (UNIX)
MACAW - - "ftp://ncbi.nlm.nih.gov/pub/macaw" (Windows and Mac)

  1. The internet is dynamic by nature and the availability and addresses of these tools are likely to change over time.
  2. There maybe e-mail servers for these programs that I am not aware of.
  3. There are additional internet sites with program sources for most programs.

Evaluating local multiple alignments

Some programs give quantitative measures for the significance of the alignment. These are usually based on the chance occurrence of such alignments and depend on the size and composition of the aligned sequences. Empirical measures are also extremely useful for deciding the 'correctness' of the multiple alignment. Consistency is a powerful measure for correct multiple alignments. If the same alignment is found in the sequence-to-sequence searches and various multiple alignment methods it is most probably correct. One pitfall to avoid is biassed sequence composition that may lead to trivial alignments.

Experimental data can be used in evaluating, and even constructing, multiple alignments. For example, if we know the catalytic site in the aligned proteins we expect the sites to be aligned together and may 'force' that alignment. Such manual alignments can serve as a seed to an alignment with more sequences.

Local multiple alignments (blocks) from different programs can be joined or used together. Another approach is 'divide and conquer'. Blocks present in all sequences divide them into separate parts, in each of which more blocks can be searched for. multiple alignment by divide and conquer

Uses of multiple alignment

The basic information from a multiple alignment of protein sequences is the position and nature of the conserved regions in each member of the group. Conserved sequence regions correspond to functionally and structurally important parts of the protein. We often only know the sequence-to-function relation for one or two members of the group. Multiple alignments let us transfer that knowledge to the other members in the group. Hypotheses about functional importance or specific roles can then be directly tested by mutagenesis and truncation experiments.

Viewing
Multiple alignments of many sequences and those with different sequence weights are difficult to visualize. Sequence logos are a graphical way for presenting multiple alignments.
ID   ADH_IRON_1; BLOCK
AC   BL00913C; distance from previous block=(56,76)
DE   Iron-containing alcohol dehydrogenases proteins.
BL   HHG motif; width=22; seqs=11; 99.5%=492; strength=1428
ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL  66

FUCO_ECOLI ( 262) VHGMAHPLGAFYNTPHGVANAI  44

GLDA_BACST ( 259) HNGFTALEGEIHHLTHGEKVAF 100

GLDA_ECOLI ( 269) VHNGLTAIPDAHHYYHGEKVAF 100

MEDH_BACMT ( 259) VHSISHQVGGVYKLQHGICNSV  78

ADH1_CLOAB ( 258) CHSMAHKTGAVFHIPHGCANAI  47
ADHE_ECOLI ( 721) CHSMAHKLGSQFHIPHGLANAL  47

ADH2_ZYMMO ( 261) VHAMAHQLGGYYNLPHGVCNAV  36
ADH4_YEAST ( 263) VHALAHQLGGFYHLPHGVCNAV  41

ADHA_CLOAB ( 266) CHPMEHELSAYYDITHGVGLAI  50
ADHB_CLOAB ( 266) VHLMEHELSAYYDITHGVGLAI  49
//
BL00913C LOGO
Block and logo of a conserved region in iron containing alcohol dehydrogenases. The block is first transformed into a position specific scoring matrix (PSSM) that allows for the sequence weights and expected frequencies of different amino acids (aa). The logo shows the aa present in each alignment position. The higher the aa and the stack the more conserved they are. The conservation is shown in bits and the aa are shaded according to their properties. The conserved histidines probably bind the ferrous ion(s) required for these enzymes activity.


A different graphical view of multiply aligned sequences is by a tree relating their sequence similarity. This is very useful when the aligned sequences are of several functional subtypes and we wish to know to which one does our sequence/s belong. A way to estimate the significance of a tree is by bootstrap values. Simply put, these values show how many times was each bifurcation (branching point) observed with different models of the input data. The higher the fraction of the bootstrap value (number of observations/number of trials) the more confident we can be that the sequences emerging from that branch point cluster together.

A tree made from the three blocks in the iron containing alcohol dehydrogenases family. Bootstrap values are for 100 trials. The tree was calculated from the blocks with the ClustalW program and drawn with the TreeView program. BL00913C tree

Programs for making and drawing trees from multiple alignments:


Searching
Multiple alignments are powerful tools for identifying new members of the aligned group. It is possible to query databases of multiple alignments with single sequences and to query sequence databases with multiple alignments. It has been shown that such searches are more sensitive and selective than sequence-to-sequence searches. A simple (but very effective !) 'hybrid' approach is to use a properly made consensus sequence (cobbler, see BlockMaker above).

Different available multiple alignment search programs

WWW location E-mail server Program source
Blimps "http://bioinfo.weizmann.ac.il/blocks/blocks_search.html" "blocks@howard.fhcrc.org" "ftp://ncbi.nlm.nih.gov/repository/blocks/unix/blimps" (UNIX)
MAST "http://www.sdsc.edu/MEME/meme/website/mast-intro.html" - "ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/" (UNIX)
LAMA "http://bioinfo.weizmann.ac.il/blocks" - E-mail "mailto:pietro@weizmann.ac.il"

Blimps - a program to query both protein and nucleotide sequence databases with protein blocks and vice versa. The queries are single sequences or blocks. The program is available on the WWW and by e-mail server (send a message with the word "help" to find out the usage of the e-mail server) for searching multiple alignment databases with single sequences. It is available for installation on UNIX systems.

MAST - a program to query sequence databases with blocks. Protein or nucleotide databases are queried with protein blocks. The query can be obtained from the MEME or BlockMaker programs (see above). The query can be a single block or all the blocks of a protein family. The program receives input through the WWW and returns it by e-mail. It is available for installation on UNIX systems.

LAMA - a program to search blocks databases with block queries. Queries can be obtained from the Blocks database, BlockMaker program or by reformatting multiple alignments ("http://bioinformatics.weizmann.ac.il/blocks/process_blocks.html"). The program receives input through the WWW and returns it interactively or by e-mail.

PCR primer design
Design of degenerate PCR primers is emerging as a major use for multiple alignments. PCR can identify the sequence of a gene in genomic or other DNA from two short flanking segments (primers). Conserved sequence regions are (by definition) a good source for primer design. When designing primers the conservation of the regions, the degeneracy of the genetic code and parameters of the PCR reaction must be considered. The Blocks WWW server designs PCR primers for each family in the database, for sequence groups submitted to be aligned and for multiple alignment submitted to be reformatted. These primers are degenerate at the 3' end and consensus at the 5' end (codehop- COnsensus DEgenerate Hybrid Oligonucleotide Primers). The design is fully automatic but the user can set the requested Tm, genetic code and bias the primers toward some of the sequences. codehop primers were shown more effective than simple degenerate primers in various cases.


Selected articles

BLAST sequence alignment programs- ClustalW global multiple sequence alignment- BlockMaker local multiple protein sequences alignment- The Blocks database and its uses- COBBLER consensus sequences and evaluation of search methods- Codehop PCR primer design- LAMA block to block comparison-