Personal tools
Home Rätsch Lab Supplements PALMA: mRNA to Genome Alignments using Large Margin Algorithms PALMA: Perfect Alignments using Large Margin Algorithms

PALMA: Perfect Alignments using Large Margin Algorithms

by Gunnar Rätsch, Bettina Hepp, Uta Schulze and Cheng Soon Ong.

The paper (pdf) was accepted at GCB 2006. Please also check out our extended work under consideration for publication in Bioinformatics.

Software

PALMA aligns two genetic sequences 'the best way' according to its underlying algorithm and trained parameters. The main python script takes two FASTA files, the "target" (e.g. a DNA sequence, part of the genome) and "query" (e.g. a cDNA or EST sequence), creates an alignment basically using dynamic programming (written in c++), and gives back the result (after backtracking) in psl like format. The alignments are local ones, so if no alignment has been found it's because no alignment got a positive alignment score.

Licensing Information

PALMA is licensed under the GPL version 2 or any later version (cf. COPYING).

Releases

Download the source code of PALMA Version 0.2 here (Mac OS X and Linux supported):

Contact

In case of comments, problems, questions etc. feel free to contact any of the authors:

Abstract

Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task. We present a novel approach based on large margin learning that combines kernel based splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm – called PALMA – tunes the parameters of the model such that the true alignment scores higher than all other alignments. In an experimental study on the alignments of mRNAs containing artificially generated micro-exons, we show that our algorithm drastically outperforms all other methods: It perfectly aligns all 4358 sequences on an hold-out set, while the best other method misaligns at least 90 of them. Moreover, our algorithm is very robust against noise in the query sequence: when deleting, inserting, or mutating up to 50% of the query sequence, it still aligns 95% of all sequences correctly, while other methods achieve less than 36% accuracy.

Results

Comparison of different methods for aligning mRNAs to genomic DNA. |result|

We considered the particularly difficult task of aligning exon triples with short middle exons (2-50nt) in the presence of noise. Almost all methods perform reasonably well when the query perfectly matches the target. For blat and sim4 the error rates drastically increase when adding noise to the query sequence. Only exalin and PALMA (with and without splice site information) have low error rates for noise levels of at most 20%. However, only PALMA (with splice sites) achieves 0% error rate for aligning queries with up to 10% noise. When deleting, inserting or mutating up to 50% of the query sequence, PALMA (with splice sites) still aligns 95% of all sequences correctly, while the other methods achieve less than 36% accuracy. For these large noise levels the splice site information helps to reduce the error rate considerably. But also in the low noise cases the splice site predictions help to accurately identify very short exons that can be found ambiguously in the intronic regions (0.4% of the test sequences).

Supplementary material

Training the splice site model and also the large margin combination requires separate sequence data sets. For the splice site model, we used genes that were EST confirmed but without full length cDNA support (set 1). We consider a random subset of 40% of all cDNA confirmed genes without evidence for alternative splicing for training the large margin combination (set 2). The remaining 20% and 40% were used for hyper-parameter tuning (set 3) and final evaluation (set 4) respectively.

For details see below.

Processing of Sequence Databases

We collected all known C. elegans ESTs from Wormbase [HCC04] (release WS120; 236,893 sequences) and dbEST [BT93] (as of February 22, 2004; 231,096 sequences). Using blat [Ken02] we aligned them against the genomic DNA (release WS120). The alignment was used to confirm exons and introns. We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the consensus GT/AG or GC/AG at the 5' and 3' ends, then we tried to achieve this by shifting the boundaries up to 2 base pairs (bp). If this still did not lead to the consensus, we split the sequence into two parts and considered each subsequence separately. In a next step we merged consistent alignments, if they shared at least one complete exon or intron. This lead to a set of 124,442 unique EST-based sequences.

We repeated the above procedure with all known cDNAs from Wormbase (release WS120; 4,855 sequences). These sequences only contain the coding part of the mRNA. We used their ends as annotation for start and stop codons.

We clustered the sequences in order to obtain independent training, validation and test sets. In the beginning each of the above EST and cDNA sequences were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters match to the same genomic location (this includes many forms of alternative splicing). We obtained 21,086 clusters, while 4072 clusters contained at least one cDNA.

For set 1 we chose all clusters not containing a cDNA (17215), for set 2 we chose 40% of the clusters containing at least one cDNA (1536). For set 3 we used 20% of clusters with cDNA (775). The remaining 40% of clusters with at least one cDNA (1,560) were used as set 4. Sets 2-4 were filtered to remove confirmed alternative splice forms. This left 1,177 cDNA sequences for testing in set 4 with an average of 4.8 exons per gene and 2,313bp from the 5' to the 3' end.

Artificial Microexon Dataset

Based on sets 2-4 described above we created sets of consecutive exon triples from the confirmed transcripts in these sets. This lead to 4604, 2257 and 4358 triples. In a first processing step we shortened the middle exons to a random length between 2nt and 50nt (uniformly distributed). To do so, we removed the correct number of nucleotides from the center of the middle exon - from the query as well as the DNA. This leaves the splice sites mostly functional. In a second step we added varying amounts of noise. For a given noise level p and a query sequence of length L, we first replaced p*L/3 positions with a random letter (Sigma={A,C,G,T,N}). Then we deleted the same number of non-overlapping positions in the sequence and added the same number of random nucleotides at random positions. We used p=0%, 1%, 10%, 20%, 50%.

Datasets

These archives contain fasta-files including the DNA sequences and query sequences (EST/cDNA) for all noise levels. The true exon boundaries are provided. See the README files (example) for details.

References

[BT93]M.S. Boguski and T.M. Lowe C.M. Tolstoshev. dbEST–Database for ”Expressed Sequence Tags”. Nat Genet., 4(4):332–3, 1993.
[HCC04]T.W. Harris, N. Chen, F. Cunningham, et al. WormBase: a multi-species resource for nematode biology and genomics. Nucl. Acids Res., 32, 2004. Database issue:D411-7.
[Ken02]W.J. Kent. BLAT–the BLAST-like alignment tool. Genome Res, 12(4):656–664, April 2002.

System Message: ERROR/3 (<string>, line 106)

Error in "image" directive: "right" is not a valid value for the "align" option within a substitution definition. Valid values for "align" are: "top", "middle", "bottom".

image:: palma_result
   :height: 522
   :width: 540
   :scale: 50
   :alt: Comparison of different methods for aligning mRNAs to genomic DNA
   :align: right

System Message: WARNING/2 (<string>, line 106)

Substitution definition "result" empty or invalid.

.. |result| image:: palma_result
   :height: 522
   :width: 540
   :scale: 50
   :alt: Comparison of different methods for aligning mRNAs to genomic DNA
   :align: right

Docutils System Messages

System Message: ERROR/3 (<string>, line 45); backlink

Undefined substitution referenced: "result".
Document Actions