[mira_talk] Re: MIRA and gap penalty

From: Laurent Abi-Rached <Laurent.Abi-Rached@xxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Thu, 08 Nov 2012 11:48:38 +0100

Hello Bastien,

Thank you for the detailed answer, I understand now why MIRA generatesthese alignments!

This 'Sanger legacy' is making it a little more tricky to call SNP inthis case however as the signal for the two polymorphisms will be'diluted': in this example, the individual is heterozygous for these twoSNP and only a handful of reads are well aligned for the non-referenceallele. So the two SNP may be missed but thanks to your explanations Ihave some ideas now on how to avoid the problem ('tweaking' thereference sequence is probably the easiest approach).


Best regards,
Laurent


On 11/1/2012 1:08 PM, Bastien Chevreux wrote:

On Oct 31, 2012, at 10:31 , Laurent Abi-Rached<Laurent.Abi-Rached@xxxxxxxxxxx<mailto:Laurent.Abi-Rached@xxxxxxxxxxx>> wrote:
I have a question regarding how gap penalties are set in MIRA:indeed, while conducting a batch of tests, I noticed that MIRA wouldsometimes favor gaps over mismatches; for example, instead ofgenerating the alignment #1 below with two mismatches and no gaps,MIRA would create alignment #2 with no mismatches but two gaps.
Alignment #1 (expected)
AGCCCCACGTGC CA TGGAGGGACATGGAA
AGCCCCACGTGC AG TGGAGGGACATGGAA

Alignment #2 (generated by MIRA)
AGCCCCACGTG CCA* TGGAGGGACATGGAA
AGCCCCACGTG *CAG TGGAGGGACATGGAA
Hi Laurent,
later on in the mail you also mention that this is a mapping assemblyand this is also important to know for the rest of this answer.
First of all: the match/mismatch/gap scores in MIRA are 10 / -10 /-20. There are also a couple of other scores, but these are notimportant for what we're looking at now. These score cannot be changedby the user. Nowhere, and I don't plan to implement that either. Also,unlike BLAST, gap penalties are at the base quite linear: n gaps meansa score of n*(-20). The egp modifier makes the score or longer gapsworse to discourage alignments of repeats which have as differenceonly a gap stretch..
The way the alignment is implemented means that - in principle, andlike foreseen by standard Smith-Waterman alignments - mismatches arefavoured over gaps. There is one very slight change to this ruleimplemented in MIRA: single gaps at the very end of sequences getpreferred over mismatches. E.g.: MIRA creates
...TTTTGCCAG...
...TTTTGC*A

instead of

...TTTTGCCAG...
...TTTTGCA
The reason for this behaviour were observations in Sanger times thatespecially toward the ends of sequences, the base callers often misseda base in the traces. The same applied later on for 454 sequences. So,instead of creating a mismatch, I wrote the aligner to prefer a singlegap if it is at the very end of a sequence.
The above considerations are valid both for de-novo and mappingassemblies and for de-novo assemblies they work extremely well. Formapping assemblies with Illumina sequences, this can create someunexpected results which may at first look funny but are nonethelessvalid from an alignment perspective.
When mapping Illumina sequences, MIRA takes advantage of the fact thatthese sequences are *extremely* and *incredibly* reliable (at least ifthe GC is <= 55 to 60%). Normally, after a basic clipping, between 92%to 95% percent of the clipped sequences have no sequencing error atall. None! And so the mapping algorithm will
- first map all reads with 100% match
- then map all reads with 1 mismatch
- then reads with one gap
- then reads with two mismatches
- then reads with 1 gap and one mismatch
- then reads with 2 gaps
- then all reads with n gaps / mismatches, starting with n=3 and goingup to the n=solexa_hack_max_errors parameter
And now you can probably piece together yourself what happens.

Assume your reference and mapped strains are like this:

ref  ..TTTGCC AG CCC..
mut  ..TTTGCC CA CCC..
Invariably, at higher coverages, the following will happen once all100% matching reads were mapped: the read one error will be mapped.And here's what happens for read "r1":
ref  ..TTTG*CCAGCCC..
r1   ..TTTGCCGA
You see that a gap is inserted in the reference and - due toparticularly bad luck - the Smith-Waterman inserted the gap not "whereyou expected because of the double mutation, but at a mathematicallyequivalent position somewhat up-/downstream a homopolymer. The damageis now done as subsequently added reads will have to somehowcounterbalance this gap. They will and the resulting alignment willlook "funny", but in the end the resulting consensus for the mappeddata will be correct nonetheless.
This particular alignment happens only in cases like the one youdepicted above: two neighbouring mismatches, where at least one of thebases is the same, but shifted by one position, and the other base ispart of a homopolymer.
At the moment there is nothing a user can do to prevent this kind ofproblem. I have a couple of ideas on how to prevent thatprogrammatically, but did not have time to implement it as I shiftedthis task down the priority list. This is also because once MIRAcomputes the alignment of the mapped data, things pan out exactly asexpected.
Hope this explanation helps you to understand how things work and whyimplementing different scoring schemes would not help.
Best,
  Bastien

References:
- [mira_talk] Re: MIRA and gap penalty
  - From: Bastien Chevreux

[mira_talk] Re: MIRA and gap penalty

Other related posts: