MiTie: Simultaneous RNA-Seq-based Transcript Identification and Quantification in Multiple Samples
High throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape pose profound computational challenges.
We propose a novel framework (MiTie) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data, and show how to find the optimal solution using Mixed Integer Programming. MiTie can a) take advantage of known transcripts, b) reconstruct and quantify transcripts simultaneously in multiple samples, as well as c) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction.
We present an extensive study based on realistic, simulated RNA-Seq data and compare MiTie with state-of-the-art approaches: It proves to be significantly more sensitive and overall more accurate. Moreover, MiTie yields substantial performance gains when used with multiple samples. We applied our system to 38 fruit fly modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity w.r.t. annotated transcripts. Our results corroborate that a probabilistically motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction.
- The system has now been ported to C++, although not all features are available yet.
- MiTie now supports the glpk solver. This is significantly slower than CPLEX, but we are currently working on a speedup, which should soon solve this issue.
- We now provide a configure script to make the setup more convenient.
You can checkout the code from GitHub: firstname.lastname@example.org:ratschlab/MiTie.git
The the code is now written in C++ with shell script wrappers. We convert the transcript prediction task into a general mixed integer optimization problem and use third party solvers for this. We currently support CPLEX and glpk. Due to the significantly better runtime we recommend using CPLEX if you have many samples or many complex splice structures. This is a proprietary software, but free academic licenses can be obtained.
change into the src directory and type
mitie_example.sh demonstrates two strategies (with and without a genome annotation) to run MiTie on a tiny toy data set.
MiTie stores splice graphs in an HDF5 file. You can run the transcript_prediction for each graph in that file in parallel.
We are happy to hear about your suggestions to improve the software. If you encounter any difficulties please let us know. We are also happy to implement additional features, if they are of general interest.
Alignment files in bam format for all simulated samples and the transcript annotation we used for prediction (in gtf format) can be downloaded from: http://cbio.mskcc.org/public/raetschlab/software/mitie