Personal tools
Home Rätsch Lab Supplements Accurate Splice Site Detection in C. elegans

Accurate Splice Site Detection in C. elegans (Supplementary Material)

Supplementary Material to the paper "Accurate Splice Site Detection in C. elegans" by Gunnar Rätsch and Sören Sonnenburg.

Download paper: pdf gz
  • Gunnar Rätsch (homepage) (contact me in case of trouble with this page)
  • Sören Sonnenburg (homepage)
KMCB Book appeared in Kernel Methods in Computational Biology B. Schölkopf, K. Tsuda and J.-P. Vert Editors, MIT press link
This page contains additional material to the above mentioned paper. We tried to document exactly
  1. which data sets where used,
  2. what the model selection results were and
  3. provide an implementation of the Weighted Degree Kernel.

In Section 1 we provide the virtual gene list from which acceptor and donor sites have been derived. This data can be found in Section 2. Model selection results for Splice Site Recognition is provided in Section 3 while Section 4 provides the data to evaluate complete Splice Forms for that model selection results can be found in Section 5. The Weighted Degree Kernel Implementation is found in Section 6.

  1. Training, Validation and Test sets of "virtual genes"

    These genes were used to generate the splice data set and to perform the comparison with genscan. The files contain gene strings in one line, followed by two lines of
    gene_start     intron_end+1   intron_end+1
    intron_start+1 intron_start+1 gene_end+2
    
    i.e. gene_start is on atg, intron_start on gt, intron end on agx and gene end on tagxx. so the data looks like this:
    tccgaatatcaatgtga...
    571 738 1287 2018 
    683 939 1449 2144 
    tccgaatatcaatgtg...
    571 695 868 
    648 818 1031
    ...
    
    Download:
  2. Training, Validation and Test data for Acceptor and Donor splice sites

    The data looks like this
    -1 TTCTGAAGAAGACGATGACGAAGACGAAGGAGAAGCCGTTGCAGAACTTGTCACAAAGTG
    -1 CCAACCTAATCGTTATACATATGTATTTACAGTCGCAAATGACAATTGAACAAATAAATG
    	....
    +1 AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
    	....
    
    whereas the number -1 means no splice site while +1 means splice site. Then after a space the sequence follows. Download:
  3. Model Selection Results for Splice Site Prediction

    (selected for largest validation ROC) All files result files names *.{tst|dat} contain a line about the actual validation or test error followed by the actual classifier output.
    validation error = 0.014181
    
    -12.143139
    -10.286769
    ...
    
    Readily trained SVMs are saved in the following format:
    b=-3.577909
    alphas=[
    	     2 -1.000000
    	    13 +0.373805
    	    57 +1.000000
    	    68 -0.332549
    	    85 -1.000000
    			...
    ]
    
    Here b is the bias term and alphas contain pairs of index and value, where index is the index to a nonzero support vector and value the product of the lagrange multiplier and label of that support vector. Results:
  4. Sequences used for Evaluation of Splice Form Prediction

    • Alignment-Reference-Examples

      (250 of each class):
      Files contain plain sequences:
      AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
      TTTTGTGGACAAGTTAGAGCAAACGATTATAGATGCAGCGACAGAGGGATTTGGAATCAA
      TGAGGTAAAAATTTAAACTGTGAAAATTTCAGCGTATCTTCGAAATCTAGTGGAAAGCGC
      

      Download

      These examples where used in testing; format as in Section 1. The genscan_exon_no_pred_test.asc.gz file contains two columns with as many rows as test genes. A one in the first column denotes genescan correctly found the gene start and end (zero otherwise). The number in the second column is the predicted number of exons, e.g.
      0   3
      1   3
      ...
      
    • Download test data sets
      test_genes.asc.gz (all sequences including start and end of exons)
      genscan_exon_no_pred_test.asc.gz (the number of exons genscan predicted and a bitvector explaining whether genscan found the right start and end)

      We used the following constants:

      min_exon_len8
      min_intron_len35
      max_pos1000

  5. Model-Selection for alpha, sigma_a, sigma_b:

    • Positional Weight Matrixes

      sigmoid_a0.45
      sigmoid_b-0.9
      alpha-3.75
      used model parameters (may differ from above)
      order pseudo_p pseudo_n
      acceptor311e-6
      donor310100
    • Weighted Degree Kernel

      sigmoid_a0.75
      sigmoid_b-0.9375
      alpha1.7
      used model parameters (may differ from above)
      C degree
      acceptor23
      donor13
    • Locality Improved Kernel

      sigmoid_a0.75
      sigmoid_b-0.75
      alpha1.0
      used model parameters (may differ from above!)
      degree width C
      acceptor4152
      donor3105
  6. Implementation of the WD Kernel

    Download Implementation wd_kernel.cpp
    Please not that the Shogun toolbox contains an easy-to-use version of that kernel.

Document Actions