Margin-based Prediction of Polymorphic Regions

Article and citation

The paper is available from the journal website. The supplementary material is available for download here.

To cite polymorphic region data please refer to:

Georg Zeller, Richard M Clark, Korbinian Schneeberger, Anja Bohlen, Detlef Weigel and Gunnar Rätsch (2008) Detecting Polymorphic Regions in the Arabidopsis thaliana Genome with Resequencing Microarrays. Genome Res. 2008 18: 918-929.

Summary

In a previous project, we identified single nucleotide polymorphisms (SNPs) as the most common form of natural sequence variation in the model plant Arabidopsis thaliana using whole-genome resequencing with high-density oligonucleotide arrays [1]. On these arrays hybridization signals from nearly one billion features were measured for each of 20 wild strains (accessions) of A. thaliana, including the reference accession Col-0 (with a known genome sequence of about 125 Mb).

Recall for SNPs is typically high in regions of low to moderate polymorphism density. However, for regions of clustered SNPs, which are often accompanied by indels, neighboring polymorphisms (at a distance <25 bp) disrupt the signal for SNP detection. Thus, regions with very few SNP calls can either indicate high similarity to the reference or densely clustered polymorphisms that went mostly undetected [1].

We developed a machine learning method (margin-based prediction of polymorphic regions-mPPR) to reliably recognize the pattern of suppressed intensity that results from clustered polymorphisms or deletions. Our label sequence learning algorithm is an extension of Hidden Markov Support Vector Machines (HM SVMs) [2] which are conceptually similar to Hidden Markov Models, but trained with discriminative learning techniques inspired by SVMs.

On the genomic scale we detected between 240,000 and 361,000 polymorphic regions per accession comprising between 5.3% and 8.5% of the genome. For these predictions we estimated a false discovery rate of <10% and a sensitivity of 55%.

Polymorphism Levels:

Accession	Number of PRs	% genome in PRs
Bay-0	271644	6.3
Bor-4	276256	6.1
Br-0	276913	6.5
Bur-0	284143	6.6
C24	293558	6.7
Cvi-0	361184	8.5
Est-1	240538	5.3
Fei-0	277788	6.4
GOT-7	284596	6.5
Ler-1	302450	7.0
Lov-5	320648	7.3
NFA-8	283544	6.5
RRS-10	260721	5.9
RRS-7	275700	6.3
Shakhdara	304471	7.4
TAMM-2	307564	7.2
Ts-1	303340	7.0
Tsu-1	272438	6.2
Van-0	281600	6.6

Visualization

Polymorphic region predictions are visualized in a Generic Genome Browser also displaying SNP data from [1].

Thanks to NASC polymorphic region predictions are also available as DAS tracks for the AtEnsemble genome browser.

Available data

Polymorphic region predicitons are available for download from the TAIR FTP server

Available software

The software is available in two versions:

A version that has been used to produce the results in the paper. It can be downloaded here: http://raetschlab.org/suppl/mppr/mppr-0.1.tar.gz. This version is available for academic use only.
An open source toolbox called HMSVM with an improved and easier-to-use implementation of the algorithm is available at: http://raetschlab.org/suppl/mppr/hmsvm-0.1.tar.gz.

Contact

Please contact either Georg Zeller or Gunnar Rätsch if you have questions about this software.

References

[1]	(1, 2, 3) Clark, Schweikert, Toomajian, Ossowski, Zeller, Shinn, Warthmann, Hu, Fu, Hinds, Chen, Frazer, Huson, Schoelkopf, Nordborg, Raetsch, Ecker, Weigel. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science, 317(5836):338u0096342, 2007.

[2]	Tsochantaridis, Joachims, Hofmann, and Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453u00961484, 2005.