[mira_talk] Re: bug report for -CO:fnicpst

On Freitag 22 Mai 2009 Byron Knoll wrote:
> There appears to be a bug when forcing non IUPAC tags (-CO:fnicpst). When
> examining assembled contigs, there are several cases where the consensus
> base is clearly not set to the majority vote. For example, I have a column
> with 973Gs, 3As, and 2Cs and the consensus base is A. I am running
> mira_2.9.45_dev_linux-gnu_i686_32.

Hello Byron,

hmmmm, not sure whether it's a bug per se or more a case of "something 
unexpected".

As the man page describes, -CO:fnicpst is not a majpority vote per se, but a 
flag for using majority vote when a conflict arises. Now, apparently MIRA saw 
no 
problem in calling an A at this place which overturns the G.

There may be several reasons for this, without seeing the data I'll just give 
a few on top of my head:

- the reads with "G" have no quality and the default quality of '10' has not 
been changed: this would lead to a consensus quality of 11 or 12 for 'G'. If 
the reads with 'A' now do have qualities which would be a lot higher than 10, 
let's assume 22, 27 and 29. So, the 'A' consensus get's a quality so much 
higher (around 30 or 31) than the 'G' that MIRA does not care considering the 
'G' as viable call.

- all reads have no qualities attached (and therefore all bases have the same 
quality), but the reads with 'G' are all in the same direction (either forward 
or reverse). If now in the reads with 'A' there are two in one direction (say, 
forward) and one in the other direction (reverse), then MIRA will give a 
consensus quality of ~11 for the 'G', but a consensus quality of ~22 for the 
'A'. Here too, it's clear for MIRA that it must be 'A' and the 'G' is not 
considered.

Please also have a look at 
  http://www.freelists.org/post/mira_talk/Quality-Values,4
where I gave a short roundup on how MIRA currently calculates qualities.

Now, the reason MIRA uses this approach is that for every sequencing 
technology I've worked with so far (Sanger, 454 and Solexa), there are 
sequencing artefacts that can be overcome only when looking at quality values 
and read orientation in an alignment. Looking at coverage alone or quality 
values alone would not be enough to call the "real" base. This strategy fails 
in some cases, the most distinct one being when working with sequences without 
quality values.

Does this answer your question or do you think that you have a different case? 
If yes, please tell :-)

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: