[mira_talk] Re: change stringency in Mira mapping assembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 27 Nov 2011 21:08:27 +0100

On Nov 27, 2011, at 19:56 , Christoph Hahn wrote:
> Actually it worked nicely, but my problem is that there are still lots of 
> gaps and the mt genome is far from complete. As the reference is not closely 
> related I think I would need to allow more missmatches in the mapping 
> assembly. Is there an easy way of doing this? Maybe by not just using the 
> standard SOLEXA_SETTINGS?

Ouch. I have the feeling that - due to the "not closely related" - this is not 
as trivial as it sounds. Difference on the nucleotide level is sometimes tricky 
to get right on NGS data: either one is thorough, then things take ages, or one 
is quick an one loses things.

The standard mapping settings of MIRA for Solexa are set to be quite diligent 
(it allows 15% errors), but your application needs a somewhat different 
approach I think.

The following approach is based on grabbing reads with "mirabait" which have a 
31 nucleotide stretch identical to something from your MT.

1. MIRA will have cleaned the Solexa reads as well as it could, it is advisable 
to take these instead of the raw Illumina data. For this, in the directory with 
your mapping assembly, grab the file *_assembly/*_chkpt/readpool.maf and 
convert it back to FASTQ, performing a hard clip:  convert_project -f maf -t 
FASTQ -C readpool.maf mynewl8data

2. Grab the result from your mapping and convert the consensus from your strain 
to FASTA. That is extract your strain specific sequence from the resulting 
assembly (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS -CO:fnicpst=yes" 
salarismtlane8_out.maf iteration1)

3. pick the iteration1_salarismtlane8.fasta, that is the current reference

4. each stretch of "X" or "@" in the FASTA above should be replace by a stretch 
of, say, 300 "N". Do that with a script. Name it whatever you like, I'll use 
"ref_iter1_backbone_in.fasta"

5. now bait out all reads from the totality of your Illumina reads which seem 
to match that consensus: 
       mirabait -k 31 -n 1 ref_iter1_backbone_in.fasta mynewl8data.fastq 
mymtreads_iteration1.fastq

5a. bonus step: if you have paired end: write a script which looks through 
"mymtreads_iteration1.fastq" and grabs all pairs from "mynewl8data.fastq" to 
form a new "mymtreads_iteration1.fastq"

6. map the reads "mymtreads_iteration1.fastq" to "ref_iter1_backbone_in.fasta" 
like you did initially for all reads. To save some time, you can switch off all 
MIRA clipping (these reads were already clipped!) via "--noclipping -CL:pec=no"

7. rinse, repeat, go back to step 2 (Two! not one) until you are satisfied. 
Satisfied means: the number of reads chosen by mirabait in a given iteration is 
not significantly higher than in the previous iteration.

The above works because MIRA, in each mapping, extend a bit the ends of the 
reference into stretches containing Ns with reads mapping to the refernce. The 
result should be a FASTQ file containing only reads from the mitochondrium.

Those you can then assemble de-novo (or map with very weak alignment 
parameters) using MIRA or any other assembler.

IMPORTANT NOTE: this approach will fail in cases where parts of the MT DNA have 
been integrated into the genome. More specifically: any 31mer identical in the 
MT DNA and genome DNA will break the approach and one will have to resort to 
more complex filtering.

Hope that helps,
  Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: