[mira_talk] Re: change stringency in Mira mapping assembly

On 11/27/2011 09:08 PM, Bastien Chevreux wrote:
On Nov 27, 2011, at 19:56 , Christoph Hahn wrote:
Actually it worked nicely, but my problem is that there are still lots of gaps 
and the mt genome is far from complete. As the reference is not closely related 
I think I would need to allow more missmatches in the mapping assembly. Is 
there an easy way of doing this? Maybe by not just using the standard 
SOLEXA_SETTINGS?
Ouch. I have the feeling that - due to the "not closely related" - this is not 
as trivial as it sounds. Difference on the nucleotide level is sometimes tricky to get 
right on NGS data: either one is thorough, then things take ages, or one is quick an one 
loses things.

The standard mapping settings of MIRA for Solexa are set to be quite diligent 
(it allows 15% errors), but your application needs a somewhat different 
approach I think.

The following approach is based on grabbing reads with "mirabait" which have a 
31 nucleotide stretch identical to something from your MT.

1. MIRA will have cleaned the Solexa reads as well as it could, it is advisable 
to take these instead of the raw Illumina data. For this, in the directory with 
your mapping assembly, grab the file *_assembly/*_chkpt/readpool.maf and 
convert it back to FASTQ, performing a hard clip:  convert_project -f maf -t 
FASTQ -C readpool.maf mynewl8data

2. Grab the result from your mapping and convert the consensus from your strain to FASTA. 
That is extract your strain specific sequence from the resulting assembly 
(convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" 
salarismtlane8_out.maf iteration1)

3. pick the iteration1_salarismtlane8.fasta, that is the current reference

4. each stretch of "X" or "@" in the FASTA above should be replace by a stretch of, say, 300 
"N". Do that with a script. Name it whatever you like, I'll use "ref_iter1_backbone_in.fasta"

5. now bait out all reads from the totality of your Illumina reads which seem 
to match that consensus:
        mirabait -k 31 -n 1 ref_iter1_backbone_in.fasta mynewl8data.fastq 
mymtreads_iteration1.fastq

5a. bonus step: if you have paired end: write a script which looks through 
"mymtreads_iteration1.fastq" and grabs all pairs from "mynewl8data.fastq" to form a new 
"mymtreads_iteration1.fastq"

6. map the reads "mymtreads_iteration1.fastq" to "ref_iter1_backbone_in.fasta" like you 
did initially for all reads. To save some time, you can switch off all MIRA clipping (these reads were 
already clipped!) via "--noclipping -CL:pec=no"

7. rinse, repeat, go back to step 2 (Two! not one) until you are satisfied. 
Satisfied means: the number of reads chosen by mirabait in a given iteration is 
not significantly higher than in the previous iteration.

The above works because MIRA, in each mapping, extend a bit the ends of the 
reference into stretches containing Ns with reads mapping to the refernce. The 
result should be a FASTQ file containing only reads from the mitochondrium.

Those you can then assemble de-novo (or map with very weak alignment 
parameters) using MIRA or any other assembler.

IMPORTANT NOTE: this approach will fail in cases where parts of the MT DNA have 
been integrated into the genome. More specifically: any 31mer identical in the 
MT DNA and genome DNA will break the approach and one will have to resort to 
more complex filtering.

Hope that helps,
   Bastien

Dear Bastien and Mira users,

I am using the brilliant approach described above and I think I am making quite some progress. If you dont mind I have a few more questions, nevertheless:

-) After your step 2 (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" derjavinoidesmtlane8_out.maf iteration1) I get a whole bunch of files:

iteration1_AllStrains.padded.fasta
iteration1_AllStrains.padded.fasta.qual
iteration1_AllStrains.unpadded.fasta
iteration1_AllStrains.unpadded.fasta.qual
iteration1_default.padded.fasta
iteration1_default.padded.fasta.qual
iteration1_default.unpadded.fasta
iteration1_default.unpadded.fasta.qual
iteration1_derjavinoidesmtlane8.padded.fasta
iteration1_derjavinoidesmtlane8.padded.fasta.qual
iteration1_derjavinoidesmtlane8.unpadded.fasta
iteration1_derjavinoidesmtlane8.unpadded.fasta.qual
iteration1_derjavinoides_mt.padded.fasta
iteration1_derjavinoides_mt.padded.fasta.qual
iteration1_derjavinoides_mt.unpadded.fasta
iteration1_derjavinoides_mt.unpadded.fasta.qual

I was continuing with the iteration1_derjavinoidesmtlane8.padded.fasta, although I am not sure if it would maybe be safer to continue with the unpadded file. What confuses me are all the other files. What are they? THey are obviously all some variants of the reference..

-) the file with the trimmed reads that I obtained from the first mapping attempt with Mira (mynewl8data.fastq) as well as the file I get from mirabait (mymtreads_iteration1.fastq) both start with the reference and are then followed by several sequences (header e.g. @rr_####50####) before the actual reads. Apparently these @rr_#### sequences are all part of the reference.. what exactly is it?

-) Also I tried to use mirabait to identify reads that map onto the the sequence of the host organism, but unfortunately it seems as if the reference sequences are too long. Is there a way of dealing with this, apart from cutting the reference in smaller bits? This is the error message I get: "Read gi|354459049|gb|AGKD01000001.1| is 194200 bp long and thus longer than MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."

Thanks for your help!

much obliged!
Christoph

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: