Alternative Splicing Provides a Broad Menu of Proteins for Cells
Seventeen years ago, the completion of the Human Genome Project revealed that there are around 20,000 protein-coding genes in the human genome—a puzzling result, given our intricate biology. Thanks to the advancement of large-scale proteomic studies over the decade following that milestone, researchers realized that some human cells contain billions of different polypeptides. Researchers realized that each gene can encode an array of proteins. The process of alternative splicing, which had first been observed 26 years before the Human Genome Project was finished, allows a cell to generate different RNAs, and ultimately different proteins, from the same gene. Since its discovery, it has become clear that alternative splicing is common and that the phenomenon helps explain how limited numbers of genes can encode organisms of staggering complexity. While fewer than 40 percent of the genes in a fruit fly undergo alternative splicing, more than 90 percent of genes are alternatively spliced in humans.
Alternative splicing helps to explain how limited numbers of genes can encode organisms of staggering complexity.
Astoundingly, some genes can be alternatively spliced to generate up to 38,000 different transcript isoforms, and each of the proteins they produce has a unique function. Like the chapters of a book, coding segments of the genome, known as exons, appear in series, and alternative splicing works by including or leaving out some of these genomic passages. Some chapters are required—that is, they are found in every transcript—and some are optional, so-called alternative exons. The differential splicing of these regions from an RNA transcript creates customized and condensed genetic messages. Molecular editors control the complicated flurry of exon selection by recognizing the chapters needed for a given protein and discarding the others. The final arrangement of exons in a spliced RNA molecule shapes the resulting protein’s structure and function.
Although much remains to be learned about how these molecular editors work, it is now clear that they can have serious consequences for a protein’s story, influencing whether it leads to healthy development or to disease.
The discovery of RNA splicing
In 1941, George Beadle and Edward Tatum established the field of molecular biology with their one gene–one enzyme hypothesis, which was later refined to one gene–one polypeptide. Yet exactly how a gene encoded a protein was still unclear. In the late 1950s, Francis Crick presented his central dogma of molecular biology, a unifying paradigm in which genetic information flows from DNA to RNA to protein. According to this model, RNA serves as an intermediate, suggesting that the molecule is simply a disposable DNA copy. Yet RNA’s role would turn out to be far more complex and important than that of a middleman.
Many devastating illnesses appear to be caused partly by defects in RNA splicing.
In a series of experiments in 1977, Sue Berget, then a postdoc in Phil Sharp’s lab at MIT, demonstrated that viral messenger RNA (mRNA) is split—that is, it’s discontinuous relative to the original DNA sequence. Berget garnered this insight by isolating a viral gene and its corresponding mRNA and then combining the two molecules so that, with some chemical encouragement, the complementary sequences would base pair. Any noncomplementary sequences would be excluded, forming loops of single-stranded DNA that protruded from the double-stranded molecule. Berget, Sharp, and their colleagues used electron microscopy, the highest-resolution technique at the time, to visualize the RNA-DNA hybrid, and observed many such loops.
That same year, Rich Roberts and colleagues at Cold Spring Harbor Laboratory independently made the same finding. Sharp and Roberts would later be jointly awarded the Nobel Prize in Physiology or Medicine for the discovery of split genes. In 1978, Wally Gilbert, a colleague of Sharp, coined the terms intron (intragenic region) and exon (expressed region) to describe this novel concept of “genes in pieces.” This was not exclusive to viruses, either. The process of removing introns and joining coding regions together appeared to be conserved in virtually all organisms in the animal kingdom. The discovery of this basic mechanism, known as RNA splicing, introduced an important additional step to the central dogma and raised questions about how cells coordinate this process.
Biochemists in the 1980s tried to tackle this question. Using gradient sedimentation and chromatography techniques, they purified large splicing complexes and combined them in vitro to reconstitute the RNA-snipping process. The burgeoning popularity of mass spectrometry throughout the 1990s, paired with the growing number of genomes uploaded in sequence repositories, enabled the identification of individual splicing components. These days, we know that the assembled complex, the spliceosome, is a massive molecular machine composed of five small nuclear RNAs (snRNAs) at the core, which may be aided by an array of more than 80 accessory proteins. Together, these snRNA-protein complexes form small nuclear ribonucleoproteins (snRNPs, pronounced “snurps”) that comprise the spliceosome. As an mRNA’s molecular editor, the spliceosome discriminates introns from exons and catalyzes their removal to link exons and assemble a protein. (See illustration below.)
Still, from an evolutionary perspective, the idea of RNA splicing seemed bizarre to some researchers. In September of 2003, the Encyclopedia of DNA Elements (ENCODE) project was launched to identify the functional elements in the human genome, and the effort ignited controversies as to whether introns were genetic “junk” that the cell invested precious energy and resources to transcribe only to trash prior to translation. Alternative splicing gave these seemingly nonfunctional elements an essential role in gene expression, as evidence emerged over the next few years that there are sequences housed within introns that can help or hinder splicing activity. These enhancer and silencer sequences are recognized by RNA-binding proteins (RBPs) whose presence affects spliceosome docking and assembly. The RBPs allow exons or portions of exons to be combined or skipped in unique patterns, such that a single transcript can be spliced into several possible mature mRNA isoforms, or splice variants, each translated into proteins with potentially diverse functions. This overturned Beadle and Tatum’s hypothesis and illustrated that there was perhaps much more to the splicing story than had thus far been discovered.
How Alternative Splicing Works
While some details of the mechanisms of splicing remain to be worked out, it’s known that mature, edited mRNAs result from an interplay between multiple factors within and outside the transcript itself. Among these is the spliceosome, the machinery that carries out the splicing.
Each splicing event requires three components: the splice donor, a GU nucleotide sequence at one end of the intron; a splice acceptor, an AG nucleotide sequence at the opposite end; and a branch point, an A approximately 20–40 nucleotides away from the splice acceptor. These three “splice sites” are recognized by two core small nuclear RNAs (snRNAs) of the spliceosome, U1 and U2, followed by a protein, U2AF. The binding of these molecules to a transcript recruits a complex of three more snRNAs—U4, U5, and U6—which facilitates the splicing reaction.
A variety of factors affect how transcripts from a particular gene are spliced. Exon recognition by the spliceosome can be influenced by RNA binding proteins (RBPs), which bind to enhancer and silencer motifs within the mRNA and help or hinder spliceosome recognition of the splice sites. And because pre-mRNAs are frequently spliced as they’re transcribed, the speed of transcription by RNA polymerase II further tunes the window of opportunity for splice site recognition by the spliceosome.
Not long after the biochemical mechanism underlying RNA splicing was pieced together, more scientists jumped onto the splicing bandwagon and set out to study its functional consequences. Some of the earliest accounts came in the late 1980s, when several groups studying Drosophila melanogaster development independently noted that the genes involved in the fly’s sex determination cascade have female- and male-specific splice isoforms that determine the fly’s sexual fate. The field then began to recognize that alternative splicing wields extraordinary power in shaping development and tissue identity. Over the following decade, researchers published isolated examples featuring the functional roles of splice isoforms in other model organisms, from yeast and worms to mice and rats.
Then, the race was on to study splicing regulation in humans. In late 2008, three separate teams led by Tom Cooper at Baylor College of Medicine, Chris Burge at MIT, and Ben Blencowe at the University of Toronto published landmark papers on genome-wide splicing patterns across a host of human tissues and cell lines. Collectively, their studies revealed that every tissue in the body is characterized by a unique set of splicing events. Four years later, the Burge lab took an evolutionary approach to compare alternative splicing among higher-order vertebrate species, including the rhesus macaque and cow. They found that brain, heart, and skeletal muscle present with the most highly conserved and tissue-specific alternative splicing patterns, further underscoring the functional importance of tissue-specific alternative splicing.
New developments
In general, splicing patterns change during development. Intriguingly, genes that are spliced are, more often than not, expressed at similar levels in all organs and across all developmental stages. This suggests that splicing can tune the production of proteins that result from these uniformly expressed genes to different contexts with regulators that modulate splicing depending on tissue type and stage of development. Indeed, RNA-binding proteins come and go as development unfolds, and they assume the role of molecular switches of alternative splicing events. The vast number of potential interaction combinations between enhancer and silencer sequences and the RBPs that recognize them inspired the field to adopt the idea of a splicing code—that certain RBPs bind to certain RNA motifs to produce a given edit. Current efforts are focused on cracking that code. But defining a set of RBP targets is exceedingly complex, as RBPs can recognize multiple motifs depending on the biological context.
The intricate and precise action of RBPs controls alternative splicing networks, groups of transcripts from different genes that are each targeted by one or more of the same RBPs. A network can coordinate a specific cellular function that contributes to development or to tissue homeostasis. In recent years, groups of researchers have concentrated on unraveling these splicing networks. Among other researchers, the Burge and Cooper labs continued their long-standing collaboration to tackle this task in mice. The two groups sequenced RNA to track gene expression and the abundance of the various transcript isoforms during cardiac muscle development, and they observed that the conversion from fetal to adult heart cell function parallels a transition from fetal to adult splicing profiles. As a postdoc in the Cooper lab, one of us, Jimena Giudice, found that numerous differentially spliced genes encode proteins involved in intracellular trafficking, and these splicing events are controlled by two RBPs: CELF and MBNL. All signs pointed to a splicing network. Follow-up work revealed that the expression levels of CELF and MBNL are inversely tied to one another during muscle development, and that they antagonistically regulate more than 1,000 pre-mRNA transcripts, some of which are translated into proteins critical for muscle contraction.
Since the early efforts to describe splicing, the textbook view of the process has been that it occurs post-transcriptionally. However, researchers are challenging this view by demonstrating that RNA polymerase II (RNAPII) dynamics have the potential to influence spliceosome assembly, perhaps coupling transcription to splicing. Karla Neugebauer and her lab at Yale University champion this model and use biochemical and computational approaches to study the phenomenon. Recently, they developed a single-molecule intron tracking (SMIT) technique to measure splicing kinetics and found that introns are spliced as soon as they emerge from RNAPII. Last year, an international team of researchers published on the in vivo consequences of such co-transcriptional splicing, showing that mouse embryonic stem cells with a knocked-in gene for a slow-transcribing version of RNAPII exhibit neuronal differentiation defects due to the failure to properly splice genes involved in synapse signaling. This suggested that the rate at which RNAPII transcribes RNA affects how that RNA is spliced. Researchers are also exploring the possibility that chromatin architecture and epigenetics serve as another layer of splicing regulation by modulating the rate of RNAPII transcription.
Despite a collection of cases teasing apart the mechanism of alternative splicing and highlighting its functional consequences, the number of uncharacterized splicing events is immense, and the pages documenting the physiological importance of alternative splicing largely remain blank.
Splicing Matters
Titin, which codes for a protein in muscle, is one example of a gene whose pre-mRNA transcript can be spliced in multiple ways to yield different protein isoforms. During development of the fetal heart, more exons are left in during splicing, which produces a relatively long, springy protein. In adult hearts, an RNA-binding protein called RBM20 associates with long stretches of the mRNA transcript during splicing, forcing the spliceosome to cut out those bits of DNA. The result is a relatively short, stiff protein. If RBM20 is missing or defective in adult hearts, these hearts will produce more fetal, springy titin protein relative to the stiff adult version. This is thought to reduce the capacity of the heart to contract, contributing to a condition known as dilated cardiomyopathy.
Mis-splicing in disease
More than one-third of disease-causing mutations map to sites bound by the spliceosome or RBPs, or to RBP-encoding gene regions. Therefore, mis-splicing has a strong potential to be implicated in disease. Parkinson’s, progeria, cardiomyopathy, spinal muscular atrophy, myotonic dystrophy, breast cancer, ovarian cancer—these devastating illnesses and many more appear to be caused partly by defects in RNA splicing, emphasizing the range of crippling effects that can stem from even slightly tipping the balance of protein isoform expression.
One scenario involves the titin (TTN) gene, which holds the record for the highest number of exons—a whopping 363—among all mammalian genes and encodes the largest known protein in the human body, weighing in at 4.2 megadaltons. The TTN protein is a molecular spring that contributes to the elasticity of heart muscle. Over the course of cardiac development, there is a gradual increase in the frequency of TTN exon skipping by the spliceosome, and these exons are thus spliced out from the mRNA. (See illustration on opposite page.) The transition from the fetal to the adult cardiac titin isoform is part of the normal developmental program and is orchestrated by an RBP called RBM20. That RBP promotes skipping where it’s found, thus inducing a shift in protein expression from a long, elastic TTN isoform to a short, stiff isoform. Using a rodent model, an international cohort of researchers and physicians demonstrated that the absence of RBM20 causes TTN mis-splicing, leading to the buildup of long, elastic TTN and phenotypes resembling the decreased heart contractility seen in humans with dilated cardiomyopathy induced by mutations in RBM20. The results strongly suggest that TTN mis-splicing contributes to RBM20-linked cardiomyopathy.
Another example is DMD, the gene that encodes the dystrophin protein, which is important for muscle integrity and force transmission. Mutational variants in DMD are notoriously associated with Duchenne muscular dystrophy, a disease that severely impairs muscle function. (See “Mending Muscle,” September 2018.) One disease-causing DMD mutation is a multiexon deletion that commonly results in a frameshift starting at exon 51. Splicing the remaining exons together results in a shortened dystrophin protein with compromised function. The lack of fully functioning dystrophin protein causes muscle weakness and atrophy, which drastically limit the physical abilities of people suffering from this disease.
Some therapies currently in development for Duchenne muscular dystrophy and other diseases aim to correct defects in splicing to alleviate symptoms. For example, researchers have been able to partially recover dystrophin function by using antisense oligo-nucleotides that prevent the spliceosome from recognizing exons downstream of the deletion. By hiding these regions from the spliceosome, the exons will be skipped. This can then restore the reading frame and produce a near-full-length protein. Two years ago, scientists at Japan’s National Center of Neurology and Psychiatry reported results from a Phase 1 trial hinting at the oligonucleotides’ safety and capacity to induce DMD exon-skipping in patients with Duchenne muscular dystrophy.
Understanding the story behind each protein in our bodies has turned out to be far more complex than reading our DNA. Although the basic splicing mechanism was uncovered more than 40 years ago, working out the interplay between splicing and physiology continues to fascinate us. We hope that advanced knowledge of how alternative splicing is regulated and the functional role of each protein isoform during development and disease will lay the groundwork for the success of future translational therapies.