On Nov 27, 2011, at 19:56 , Christoph Hahn wrote: > Actually it worked nicely, but my problem is that there are still lots of > gaps and the mt genome is far from complete. As the reference is not closely > related I think I would need to allow more missmatches in the mapping > assembly. Is there an easy way of doing this? Maybe by not just using the > standard SOLEXA_SETTINGS? Ouch. I have the feeling that - due to the "not closely related" - this is not as trivial as it sounds. Difference on the nucleotide level is sometimes tricky to get right on NGS data: either one is thorough, then things take ages, or one is quick an one loses things. The standard mapping settings of MIRA for Solexa are set to be quite diligent (it allows 15% errors), but your application needs a somewhat different approach I think. The following approach is based on grabbing reads with "mirabait" which have a 31 nucleotide stretch identical to something from your MT. 1. MIRA will have cleaned the Solexa reads as well as it could, it is advisable to take these instead of the raw Illumina data. For this, in the directory with your mapping assembly, grab the file *_assembly/*_chkpt/readpool.maf and convert it back to FASTQ, performing a hard clip: convert_project -f maf -t FASTQ -C readpool.maf mynewl8data 2. Grab the result from your mapping and convert the consensus from your strain to FASTA. That is extract your strain specific sequence from the resulting assembly (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" salarismtlane8_out.maf iteration1) 3. pick the iteration1_salarismtlane8.fasta, that is the current reference 4. each stretch of "X" or "@" in the FASTA above should be replace by a stretch of, say, 300 "N". Do that with a script. Name it whatever you like, I'll use "ref_iter1_backbone_in.fasta" 5. now bait out all reads from the totality of your Illumina reads which seem to match that consensus: mirabait -k 31 -n 1 ref_iter1_backbone_in.fasta mynewl8data.fastq mymtreads_iteration1.fastq 5a. bonus step: if you have paired end: write a script which looks through "mymtreads_iteration1.fastq" and grabs all pairs from "mynewl8data.fastq" to form a new "mymtreads_iteration1.fastq" 6. map the reads "mymtreads_iteration1.fastq" to "ref_iter1_backbone_in.fasta" like you did initially for all reads. To save some time, you can switch off all MIRA clipping (these reads were already clipped!) via "--noclipping -CL:pec=no" 7. rinse, repeat, go back to step 2 (Two! not one) until you are satisfied. Satisfied means: the number of reads chosen by mirabait in a given iteration is not significantly higher than in the previous iteration. The above works because MIRA, in each mapping, extend a bit the ends of the reference into stretches containing Ns with reads mapping to the refernce. The result should be a FASTQ file containing only reads from the mitochondrium. Those you can then assemble de-novo (or map with very weak alignment parameters) using MIRA or any other assembler. IMPORTANT NOTE: this approach will fail in cases where parts of the MT DNA have been integrated into the genome. More specifically: any 31mer identical in the MT DNA and genome DNA will break the approach and one will have to resort to more complex filtering. Hope that helps, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html