Shapiro–Senapathy algorithm

The different types of splicing mutations in genes. Mutations within the splicing regions of genes can lead to a defective transcript and protein. Depending on where exactly the mutation occurs and which "cryptic" splice site near the original site is chosen for splicing, the specific defect in the transcript and protein will vary. Frequently, splicing mutations will lead to exon skipping, intron inclusion, exon extension/truncation, and premature termination in the resulting transcript. The various defects in the transcript will in turn result in different kinds of disruption in the amino acid sequence of the protein.

The ShapiroSenapathy algorithm (S&S) is an algorithm for predicting splice junctions in genes of animals and plants. This algorithm has been used to discover disease-causing splice site mutations and cryptic splice sites.

The algorithm

A splice site is the border between an exon and intron in a gene. These sites contain a particular sequence motif, which is necessary for recognition and processing by the RNA splicing machinery.

The S&S algorithm uses sliding windows of eight nucleotides, corresponding to the length of the splice site sequence motif, to identify these conserved sequences and thus potential splice sites. Using a weighted table of nucleotide frequencies, the S&S algorithm outputs a consensus-based percentage for the possibility of the window containing a splice site.

The S&S algorithm serves as the basis of other software tools, such as Human Splicing Finder, Splice-site Analyzer Tool, dbass (Ensembl), Alamut, and SROOGLE.

Cancer gene discovery using S&S

By using the S&S algorithm, mutations and genes that cause many different forms of cancer have been discovered. For example, genes causing commonly occurring cancers including breast cancer, ovarian cancer, colorectal cancer, leukemia, head and neck cancers, prostate cancer, retinoblastoma, squamous cell carcinoma, gastrointestinal cancer, melanoma, liver cancer, Lynch syndrome, skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas (bone tumor), Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified.

Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and Fanconi anemia (e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.

Table 1. Mutations in the donor and acceptor splice sites in different genes

Discovery of genes causing inherited disorders using S&S

Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes (e.g., PTPN22, TCF1 (HCF-1A)), hypertension (e.g., LDL, LDLR, LPL), Marfan syndrome (e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.

Table 2. Mutations in the donor and acceptor splice sites in different genes causing inherited disorders

Genes causing immune system disorders

More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia.

Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair.

Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping.

Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript.

Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.

Application of S&S in hospitals for clinical practice and research

Applying the S&S technology platform in modern clinical genomics research hasadvance diagnosis and treatment of human diseases.

In the modern era of Next Generation Sequencing (NGS) technology, S&S is applied in clinical practice extensively. Clinicians and molecular diagnostic laboratories apply S&S using various computational tools including HSF, SSF, and Alamut. It is aiding in the discovery of genes and mutations in patients whose disease are stratified or when the disease in a patient is unknown based on clinical investigations.

In this context, S&S has been applied on cohorts of patients in different ethnic groups with various cancers and inherited disorders. A few examples are given below.

Cancers

Inherited disorders

S&S - the first algorithm for identifying splice sites, exons and split genes

Dr. Senapathy's original objective in developing a method for identifying splice sites was to find complete genes in raw uncharacterized genomic sequence that could be used in the human genome project. In the landmark paper with this objective, he described the basic method for identifying the splice sites within a given sequence based on the Position Weight Matrix (PWM) of the splicing sequences in different eukaryotic organism groups for the first time. He also created the first exon detection method by defining the basic characteristics of an exon as the sequence bounded by an acceptor and a donor splice sites that had S&S scores above a threshold, and by an ORF that was mandatory for an exon. An algorithm for finding complete genes based on the identified exons was also described by Dr. Senapathy for the first time.

Dr. Senapathy demonstrated that only deleterious mutations in the donor or acceptor splice sites that would drastically make the protein defective would reduce the splice site score (later known as the Shapiro–Senapathy score), and other non-deleterious variations would not reduce the score. The S&S method was adapted for researching the cryptic splice sites caused by mutations leading to diseases. This method for detecting deleterious splicing mutations in eukaryotic genes has been used extensively in disease research in the humans, animals and plants over the past three decades, as described above.

The basic method for splice site identification, and for defining exons and genes was subsequently used by researchers in finding splice sites, exons and eukaryotic genes in a variety of organisms. These methods also formed the basis of all subsequent tools development for discovering genes in uncharacterized genomic sequences. It also was used in a different computational approaches including machine learning and neural network, and in alternative splicing research.

Discovering the mechanisms of aberrant splicing in diseases

The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become “weak” compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA (exon-skipping). On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable (intron inclusion). A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.

Examples of splicing aberrations

An example of splicing aberration (exon skipping) caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.

ExampleofColorectalCancer
Exon Skipping caused by a donor mutation in the gene MLH1 leading to colorectal cancer. The generation of a mRNA from a split gene involves the transcription of the gene into the primary RNA transcript, and the precise removal of the introns and the joining of the exons from the primary RNA transcript. A deleterious mutation within the splicing signals (donor or acceptor splice sites) can affect the recognition of the correct splice junction and lead to an aberration in the joining of the authentic exons. Depending on if the mutation occurs within the donor or the acceptor site, and the particular base that is mutated within the splice sequence, the aberration could lead to the skipping of a complete or partial exon, or the inclusion of a partial intron or a cryptic exon in the mRNA produced by the splicing process. Either of these situations will usually lead to a premature stop codon in the mRNA and result in a completely defective protein. The S&S algorithm aids in determining which splice site and exon in a gene are mutated, and the S&S score of the mutated splice site aids in determining the type of splicing aberration and the resulting mRNA structure and sequence. The example gene MLH1 affected in colorectal cancer is shown in the figure. It was found using the S&S algorithm that a mutation in the donor splice site in exon 8 led to the skipping of the exon 8. The mRNA thus lacks the sequence corresponding to exon 8 (sequence positions are shown in the figure). This causes a frame shift in the mRNA coding sequence at amino acid position 226, leading to premature protein truncation at amino acid position 233. This mutated protein is completely defective, which has led to colorectal cancer in the patient.

S&S in cryptic splice sites research and medical applications

The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely.

Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript.

Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.

S&S in animal and plant genomics research

S&S has also been used in RNA splicing research in many animals and plants.

The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis. The splicing and exon–intron junction prediction coincided with the GT/AG rule (S&S) in the Molecular characterization and evolution of carnivorous sundew (Drosera rotundifolia L.) class V b-1,3-glucanase. Unspliced (LSDH) and spliced (SSDH) transcripts of NAD+ dependent sorbitol dehydroge nase (NADSDH) of strawberry (Fragaria ananassa Duch., cv. Nyoho) were investigated for phytohormonal treatments.

Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish. Diminution of rbm24a or rbm24b gene products by morpholino knockdown resulted in significant disruption of somite formation in mouse and zebrafish. Dr.Senapathy algorithm used extensively to study intron-exon organization of fut8 genes. The intron-exon boundaries of Sf9 fut8 were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.

References

Uses material from the Wikipedia article Shapiro–Senapathy algorithm, released under the CC BY-SA 4.0 license.