Inteins - protein introns

This site is mirrored at

Last update - June, 2004.

Hint domains superfamily new


Bacterial intein-like (BIL) domains  new

Inteins are selfish DNA elements inserted in-frame and translated together with their host proteins. This precursor protein undergoes an autocatalytic protein splicing reaction resulting in two products: the host protein and the intein.

protein-splicing schematic
Protein-splicing scheme. A precursor protein is shown on the right, with intein protein-splicing domain shown in red and the host protein flanks (exteins) shown in blue. The intein protein-splicing domain autocatalyzes its excision and the ligation of its two flanks.

Protein splicing was shown to occur in heterologous organisms, in different in-vitro systems and with various natural and engineered host flanks. Hence, protein splicing is probably independent of specific host cell factors.

Most inteins have an endonuclease domain inserted in the protein splicing domain. The endonuclease activity of these inteins can mediate the specific transposition of their gene into unoccupied integration sites of intein-less homologs (homing).

Inteins have a diverse and sporadic distribution across species and proteins. They occur in all three domains of life but so far have been found in just (relatively) few species. Inteins are currently known in more then 50 types of proteins with diverse in functions. These proteins include metabolic enzymes, DNA and RNA polymerases, proteases, ribonucleotide reductases, and vacuolar-type ATPases. Inteins integration points also vary in structure in function. Their only apparent common feature is being in highly conserved protein motifs.

Intein protein domain family is part of the Hint superfamily, termed after the characteristic structure fold first identified in Hedgehogs and Intein protein domains (Hall et al. '97). Four characterized Hint domain families are currently known: Hog Hints, inteins, and two types of Bacterial intein-like (BIL) domains. Together with sharing the same structure fold and common sequence features, Hint domains have similar biochemical activities. The domains post-translationally process the proteins in which they are present by protein-splicing, self-cleavage or ligation activities.

This site mainly introduces inteins and some their sister families. It explores the relation between the activities of these domains, their sequence motifs, and protein structure. We also show how are these related to the different biological roles and evolution modes of inteins and intein-like domains. A database of inteins is maintained at New England Biolabs. Intein registry, publications, sequence search and information on their mechanism can be found there.

Intein proteins contain a number of conserved sequence motifs (blocks). The motifs can be grouped in three domains according to their location and inferred function. Intein structures show that the inteins protein-splicing and endonuclease active sites are formed from conserved motifs. The intein's domain organization, deduced by sequence analysis, exactly corresponds to the structural domains.
Domain structure of a typical intein 
with a LAGLIDADG type endonuclease domain

N domain EN domain (optional) C domain

scale: - 8 amino acids, = motif region
Protein splicing motifs Positions of the protein splicing motifs (in red) in an intein structure (Mxe GyrA). In green are N2 and N4 structural motifs. A single amino acid N-terminal to the intein is shown in blue. This intein does not have an endonuclease domain.
The protein-splicing N-terminal (N) domain spans about 100-150 aa. Mutations in this domain affect the first step in protein splicing, an N-S/O acyl shift in the peptide bond connecting the N terminus of the intein and the N-terminal flank (intein flanks are termed exteins). The N domain motifs are similar in sequence and function to motifs found at the Hint domain of  Hog regions found in the Hedgehog animal developmental proteins and a few related nematode protein families. Hog Hint domains self-process their precursor proteins, cleaving themselves off the N-terminal parts of the proteins. Cleavage was shown to utilize a cholesterol molecule that is consequently covalently attached to the cleavage site in the N-terminal part modulating the activity of that part ( Porter et al. '96). The relation between inteins and Hog Hint domains was first determined by their similarity in sequence and function.  Determination of protein structures from both families provided final proof for the common origin of inteins and Hog Hint domains.
More on the structure and relation of inteins and Hog Hint domains.

The intein protein-splicing C-terminal (C) domain is composed of the two adjacent motifs in the C-terminal 25-40 aa (including the conserved aa immediately C' to the intein). Residues in these motifs are necessary for the catalyzing the next steps of protein splicing: the branch formation and its resolution.

Most (but not all) inteins also include a central endonuclease (EN) domain. The EN domain is usually of the LAGLIDADG (dodecapeptide) homing endonucleases type. Intein LAGLIDADG EN domains are characterized by 4 motifs that probably form the endonuclease active site (Duan et al. '97). An intein from the cyanobacteria Synechocystis species PCC6803 (Ssp gyrB) has a different type of endonuclease domain. In this intein the endonuclease domain contains an HNH motif. This motif is found in various homing and other endonucleases (Shub et al. '94, Gorbalenya '94).

The endonuclease domain is optional in inteins. Mutations in it affect the intein endonuclease activity but not the protein splicing activity, some inteins are missing this domain, and inteins were shown to protein splice with this domain removed (Chong and Xu '97, Derbyshire et al. '97).

Functional inteins with no EN domain (minimal inteins), the relation of the protein splicing domain to other Hint domains and the presence of different EN domains in inteins all indicate that the primeval inteins had no EN domains. Different EN domains, perhaps from homing endonucleases,  and DNA binding domains invaded intein genes to form the typical present day intein. Some present day minimal inteins clearly lost their EN domain (such as Mxe_gyrA, see Telenti et al. '97 and Klabunde et al. '98) and some maybe never acquired one.

Evolution of  intein genes

Inteins are found in all three domains of life: Archaea, Bacteria, and . EukaryotesHowever their distribution is sporadic in species and in hosts. Some species have no inteins, some just one and Methanococcus jannaschii has nineteen. For species with completely sequenced genomes like E.coli, M.jannaschii and S. cerevisiae we know the total number of inteins in the strain sequenced. For other species we can only estimate their number. Intein distribution seems most varied in archaea. This table compares the inteins found in archaea with fully sequenced genomes.

One major group of organisms where inteins are not known in is multicellular , eukaryotesboth metazoa and plants. The multicellular red alga Porphyra does contain an intein but in its chloroplast genome. The reasons for this absence are not clear. Inteins may yet be found in these organisms and only turn out to be scarcer or perhaps difficult to detect. It is interesting to note that some intein-containing organisms, such as Mycobacteria tuberculosis and the CIV virus, are intra-cellular pathogens of metazoa. Thus, the opportunity for intein invasion into animal genomes does exist ( more details).

intein distribution

Intein distribution May 2001. red pixel Species with inteins, black pixel fully sequenced genome species with no identified inteins, gray pixel partially sequenced genome species with no identified inteins. The tree is based on phylogenetic data of Baldauf et al. '00 and Nelson et al. '00 . Species abbreviations: Bacteria- Aae: Aquifex aeolicus, Tma: Thermotoga maritima, Dvu: Desulfovibrio vulgaris, Hpy: Helicobacter pylori, Cje: Campylobacter jejuni, Ccr: Caulobacter crescentus, Rpr: Rickettsia prowazekii, Nme: Neisseria meningitidis, Ngo: Neisseria gonorrhoeae, Xfa: Xylella fastidiosa, Vch: Vibrio cholerae, Hin: Haemophilus influenzae, Eco: Escherichia coli, Pae: Pseudomonas aeruginosa, Bsu: Bacillus subtilis, Bha: Bacillus halodurans, Mpn: Mycoplasma pneumoniae, Mge: Mycoplasma genitalium, Uur: Ureaplasma urealyticum, Sau: Staphylococcus aureus, Cac: Clostridium acetobutylicum, Mtu: Mycobacterium tuberculosis, Mle: Mycobacterium leprae, Mav: Mycobacterium avium, Sco: Streptomyces coelicolor, Cpn: Chlamydia pneumoniae, Ctr: Chlamydia trachomatis, Ssp: Synechocystis sp. PCC6803, Pma: Prochlorococcus marinus, Pgi: Porphyromonas gingivalis, Cte: Chlorobium tepidum, Tde: Treponema denticola, Tpa: Treponema pallidum, Bbu: Borrelia burgdorferi, Det: Dehalococcoides ethenogenes, Dra: Deinococcus radiodurans, Archaea- Pae: Pyrobaculum aerophilum , Ape: Aeropyrum pernix, Sso: Sulfolobus solfataricus, Tac: Thermoplasma acidophilum, Hsp: Halobacterium sp. NRC-1, Pfu: Pyrococcus furiosus, Pho: Pyrococcus horikoshii, Pab: Pyrococcus abyssi, Afu: Archaeoglobus fulgidus, Mth: Methanobacterium thermoautotrophicum, Mja: Methanococcus jannaschii, Mba: Methanosarcina barkeri, Eukaryotes- Cel: Caenorhabditis elegans, Dme: Drosophila melanogaster, Hsa: Homo sapiens, Cal: Candida albicans, Sce: Saccharomyces cerevisiae, Spo: Schizosaccharomyces pombe, cpPpu: Porphyra purpurea chloroplast, Ath: Arabidopsis thaliana, Pfa: Plasmodium falciparum, Tbr: Trypanosoma brucei.

Some protein families, such as ribonucleotide reductases and archaeal DNA polymerase type B, are more prone to contain inteins. These proteins contain inteins in different organisms and in different integration sites. Some of the ribonucleotide reductases and most of the DNA polymerases with inteins contain more than one intein.

Currently (June 2004) about 200 inteins are identified in more than100 different species and strains, at more than 50 various families of protein hosts (details here).

Inteins found at homologous integration sites are most probably homologous too. However, it is not clear in which cases this relation between the inteins is due to vertical transfer (the usual inheritance, from an organism to its progeny) or horizontal transfer (movement of DNA across species). Inteins at homologous integration sites are termed intein alleles.

Inteins are known to integrate at more than 65 different sites. About half of these have two or more alleles (details here).

Additional information on inteins can be found at the pages listed at the top of this intein home page.

Page last modified June 2004
Shmuel Pietrokovski <>