[mira_talk] Re: MIRA and gap penalty

  • From: Laurent Abi-Rached <Laurent.Abi-Rached@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 08 Nov 2012 11:48:38 +0100

Hello Bastien,

Thank you for the detailed answer, I understand now why MIRA generates these alignments!

This 'Sanger legacy' is making it a little more tricky to call SNP in this case however as the signal for the two polymorphisms will be 'diluted': in this example, the individual is heterozygous for these two SNP and only a handful of reads are well aligned for the non-reference allele. So the two SNP may be missed but thanks to your explanations I have some ideas now on how to avoid the problem ('tweaking' the reference sequence is probably the easiest approach).

Best regards,
Laurent


On 11/1/2012 1:08 PM, Bastien Chevreux wrote:
On Oct 31, 2012, at 10:31 , Laurent Abi-Rached <Laurent.Abi-Rached@xxxxxxxxxxx <mailto:Laurent.Abi-Rached@xxxxxxxxxxx>> wrote:
I have a question regarding how gap penalties are set in MIRA: indeed, while conducting a batch of tests, I noticed that MIRA would sometimes favor gaps over mismatches; for example, instead of generating the alignment #1 below with two mismatches and no gaps, MIRA would create alignment #2 with no mismatches but two gaps.

Alignment #1 (expected)
AGCCCCACGTGC CA TGGAGGGACATGGAA
AGCCCCACGTGC AG TGGAGGGACATGGAA

Alignment #2 (generated by MIRA)
AGCCCCACGTG CCA* TGGAGGGACATGGAA
AGCCCCACGTG *CAG TGGAGGGACATGGAA

Hi Laurent,

later on in the mail you also mention that this is a mapping assembly and this is also important to know for the rest of this answer.

First of all: the match/mismatch/gap scores in MIRA are 10 / -10 / -20. There are also a couple of other scores, but these are not important for what we're looking at now. These score cannot be changed by the user. Nowhere, and I don't plan to implement that either. Also, unlike BLAST, gap penalties are at the base quite linear: n gaps means a score of n*(-20). The egp modifier makes the score or longer gaps worse to discourage alignments of repeats which have as difference only a gap stretch..

The way the alignment is implemented means that - in principle, and like foreseen by standard Smith-Waterman alignments - mismatches are favoured over gaps. There is one very slight change to this rule implemented in MIRA: single gaps at the very end of sequences get preferred over mismatches. E.g.: MIRA creates

...TTTTGCCAG...
...TTTTGC*A

instead of

...TTTTGCCAG...
...TTTTGCA

The reason for this behaviour were observations in Sanger times that especially toward the ends of sequences, the base callers often missed a base in the traces. The same applied later on for 454 sequences. So, instead of creating a mismatch, I wrote the aligner to prefer a single gap if it is at the very end of a sequence.

The above considerations are valid both for de-novo and mapping assemblies and for de-novo assemblies they work extremely well. For mapping assemblies with Illumina sequences, this can create some unexpected results which may at first look funny but are nonetheless valid from an alignment perspective.

When mapping Illumina sequences, MIRA takes advantage of the fact that these sequences are *extremely* and *incredibly* reliable (at least if the GC is <= 55 to 60%). Normally, after a basic clipping, between 92% to 95% percent of the clipped sequences have no sequencing error at all. None! And so the mapping algorithm will
- first map all reads with 100% match
- then map all reads with 1 mismatch
- then reads with one gap
- then reads with two mismatches
- then reads with 1 gap and one mismatch
- then reads with 2 gaps
- then all reads with n gaps / mismatches, starting with n=3 and going up to the n=solexa_hack_max_errors parameter

And now you can probably piece together yourself what happens.

Assume your reference and mapped strains are like this:

ref  ..TTTGCC AG CCC..
mut  ..TTTGCC CA CCC..

Invariably, at higher coverages, the following will happen once all 100% matching reads were mapped: the read one error will be mapped. And here's what happens for read "r1":

ref  ..TTTG*CCAGCCC..
r1   ..TTTGCCGA

You see that a gap is inserted in the reference and - due to particularly bad luck - the Smith-Waterman inserted the gap not "where you expected because of the double mutation, but at a mathematically equivalent position somewhat up-/downstream a homopolymer. The damage is now done as subsequently added reads will have to somehow counterbalance this gap. They will and the resulting alignment will look "funny", but in the end the resulting consensus for the mapped data will be correct nonetheless.

This particular alignment happens only in cases like the one you depicted above: two neighbouring mismatches, where at least one of the bases is the same, but shifted by one position, and the other base is part of a homopolymer.


At the moment there is nothing a user can do to prevent this kind of problem. I have a couple of ideas on how to prevent that programmatically, but did not have time to implement it as I shifted this task down the priority list. This is also because once MIRA computes the alignment of the mapped data, things pan out exactly as expected.

Hope this explanation helps you to understand how things work and why implementing different scoring schemes would not help.

Best,
  Bastien

Other related posts: