On 11/27/2011 09:08 PM, Bastien Chevreux wrote:
On Nov 27, 2011, at 19:56 , Christoph Hahn wrote:Actually it worked nicely, but my problem is that there are still lots of gaps and the mt genome is far from complete. As the reference is not closely related I think I would need to allow more missmatches in the mapping assembly. Is there an easy way of doing this? Maybe by not just using the standard SOLEXA_SETTINGS?Ouch. I have the feeling that - due to the "not closely related" - this is not as trivial as it sounds. Difference on the nucleotide level is sometimes tricky to get right on NGS data: either one is thorough, then things take ages, or one is quick an one loses things. The standard mapping settings of MIRA for Solexa are set to be quite diligent (it allows 15% errors), but your application needs a somewhat different approach I think. The following approach is based on grabbing reads with "mirabait" which have a 31 nucleotide stretch identical to something from your MT. 1. MIRA will have cleaned the Solexa reads as well as it could, it is advisable to take these instead of the raw Illumina data. For this, in the directory with your mapping assembly, grab the file *_assembly/*_chkpt/readpool.maf and convert it back to FASTQ, performing a hard clip: convert_project -f maf -t FASTQ -C readpool.maf mynewl8data 2. Grab the result from your mapping and convert the consensus from your strain to FASTA. That is extract your strain specific sequence from the resulting assembly (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" salarismtlane8_out.maf iteration1) 3. pick the iteration1_salarismtlane8.fasta, that is the current reference 4. each stretch of "X" or "@" in the FASTA above should be replace by a stretch of, say, 300 "N". Do that with a script. Name it whatever you like, I'll use "ref_iter1_backbone_in.fasta" 5. now bait out all reads from the totality of your Illumina reads which seem to match that consensus: mirabait -k 31 -n 1 ref_iter1_backbone_in.fasta mynewl8data.fastq mymtreads_iteration1.fastq 5a. bonus step: if you have paired end: write a script which looks through "mymtreads_iteration1.fastq" and grabs all pairs from "mynewl8data.fastq" to form a new "mymtreads_iteration1.fastq" 6. map the reads "mymtreads_iteration1.fastq" to "ref_iter1_backbone_in.fasta" like you did initially for all reads. To save some time, you can switch off all MIRA clipping (these reads were already clipped!) via "--noclipping -CL:pec=no" 7. rinse, repeat, go back to step 2 (Two! not one) until you are satisfied. Satisfied means: the number of reads chosen by mirabait in a given iteration is not significantly higher than in the previous iteration. The above works because MIRA, in each mapping, extend a bit the ends of the reference into stretches containing Ns with reads mapping to the refernce. The result should be a FASTQ file containing only reads from the mitochondrium. Those you can then assemble de-novo (or map with very weak alignment parameters) using MIRA or any other assembler. IMPORTANT NOTE: this approach will fail in cases where parts of the MT DNA have been integrated into the genome. More specifically: any 31mer identical in the MT DNA and genome DNA will break the approach and one will have to resort to more complex filtering. Hope that helps, Bastien
Dear Bastien and Mira users,I am using the brilliant approach described above and I think I am making quite some progress. If you dont mind I have a few more questions, nevertheless:
-) After your step 2 (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" derjavinoidesmtlane8_out.maf iteration1) I get a whole bunch of files:
iteration1_AllStrains.padded.fasta iteration1_AllStrains.padded.fasta.qual iteration1_AllStrains.unpadded.fasta iteration1_AllStrains.unpadded.fasta.qual iteration1_default.padded.fasta iteration1_default.padded.fasta.qual iteration1_default.unpadded.fasta iteration1_default.unpadded.fasta.qual iteration1_derjavinoidesmtlane8.padded.fasta iteration1_derjavinoidesmtlane8.padded.fasta.qual iteration1_derjavinoidesmtlane8.unpadded.fasta iteration1_derjavinoidesmtlane8.unpadded.fasta.qual iteration1_derjavinoides_mt.padded.fasta iteration1_derjavinoides_mt.padded.fasta.qual iteration1_derjavinoides_mt.unpadded.fasta iteration1_derjavinoides_mt.unpadded.fasta.qualI was continuing with the iteration1_derjavinoidesmtlane8.padded.fasta, although I am not sure if it would maybe be safer to continue with the unpadded file. What confuses me are all the other files. What are they? THey are obviously all some variants of the reference..
-) the file with the trimmed reads that I obtained from the first mapping attempt with Mira (mynewl8data.fastq) as well as the file I get from mirabait (mymtreads_iteration1.fastq) both start with the reference and are then followed by several sequences (header e.g. @rr_####50####) before the actual reads. Apparently these @rr_#### sequences are all part of the reference.. what exactly is it?
-) Also I tried to use mirabait to identify reads that map onto the the sequence of the host organism, but unfortunately it seems as if the reference sequences are too long. Is there a way of dealing with this, apart from cutting the reference in smaller bits? This is the error message I get: "Read gi|354459049|gb|AGKD01000001.1| is 194200 bp long and thus longer than MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."
Thanks for your help! much obliged! Christoph -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html