[mira_talk] Re: How does Mira determine quality scores?

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 29 Jul 2009 22:34:02 +0200
On Mittwoch 29 Juli 2009 David Hesselbom wrote:
> I've run some tests on 454 assemlies from both Mira and Newbler and have
> concluded that the quality scores attributed to homopolymers are very
> different depending on the source, even within the same genome. For
> example, in a homopolymer in a consensus sequence, Newbler quality scores
> are nearly always the same in the neighboring bases and throughout the
> homopolymer itself, except for its last base, which has a very low score
> compared to the rest of the bases in the homopolymer. Supposedly, this is
> because the length of the homopolymer is not certain (the reads do not
> agree), but it's only the last of the bases that is uncertain whether it
> should be there or not.

Hello David,

I actually have no idea exactly how Newbler calculates homopolymer score, but 
from observation I think that once assembled (and when SFFs are available), 
Newbler looks again at the raw intensities (adding them?) to reasses the 
length of the homopolymer. It then probably assigns a specially calculated 
probability to the last base in the run.

> In Mira assemblies, however, all bases in a homopolymer have varying
> quality scores, none of which are very low, and typically, bases in (at
> least) long homopolymers have a lower average score than those surrounding
> the homopolymer, meaning it constitutes a considerable "drop" in the
> quality scores. To me, the Newbler quality scores in homopolymers seem to
> make more sense than the Mira ones, since what we're uncertain about is the
> number of bases in the homopolymer. Since it doesn't matter which base we
> remove within the homopolymer, the low quality score might as well be
> attributed to the last one. Mira  seems to spread out the quality score
> penalty over each base in the homopolymer, though I do not believe this is
> what's actually happening. :)
> I'd like to know why the quality scores are determined so differently by
> Mira and Newbler, and also the details on how Mira does it. For example,
> does it take homopolymers into special consideration?

First, please have a look at //www.freelists.org/post/mira_talk/Quality-
Values,4 which describes the current algorithm for calculating the quality of 
a single consensus base. It's a faster but almost equivalent one to the one 
described in the MIRA thesis (pp. 74 "Consensus and consensus quality 
computation") and also takes into account different sequencing types.

Back to 454. Please remember that phred quality scores are defined to be log 
transformed probability scores for each and every base ... on it's own! Alas, 
the quality scores calculated by the 454 software for homopolymers are, 
especially after the 3rd base in a homopolymer run, not really 2phred" quality 
bases. You probably have seen that they typically start in the higher 30s but 
quickly come down single digit numbers.

Question is: what should an assembler do with that? I think that the easiest 
way to show what happens is to give an example. I'll take two reads for that.

One central pillar of the quality calculation in MIRA is the rule that 
independent observations of a base confirm this base better that non-
independent observations. When a base was read from both directions, one can 
assume independence of observations: it's not the whole truth, but close 
enough. As a side note: observing a base with different sequencing 
technologies also constitutes independent observations.

In 454 data, we're almost always able to work (at each consensus base) with at 
least one forward and one reverse strand, so I'll just show this.

When independence of the two observations is given, the error probabilities 
for these two observations can be multiplied to give an error probability for 
the two observations combined.

Assume two perfectly aligning reads with a homopolymer of length 7:

-> ...  C  G  T  A  A  A  A  A  A  A  C  G  T ...
<- ...  C  G  T  A  A  A  A  A  A  A  C  G  T ...

Here are typical 454 qualities for the reads shown above:

-> ... 36 32 35 37 32 17 12  5  3  2 40 38 39 ...
<- ... 36 32 39  1  3  4 13 27 36 38 38 36 37 ...

The consensus qualities which are now calculated by MIRA look like this (the 
homopolymer bases are underlined)

-- ... 72 64 74 38 35 21 25 32 39 40 78 74 76 ...
                ********************

This is the 'drop' you observe in the quality data of the consensus at 454 
homopolymer sites: from the high 60s and 70s in normal sequencing data to high 
20s and 30s in the homopolymer region.

I know the situation is not ideal but it's the best I could come up with the 
quality scores 454 provides in homopolymers.

If you have a better idea, please feel free to discuss. I'm always on the 
lookout for new things to try :-)

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
References:
- [mira_talk] How does Mira determine quality scores?
  - From: David Hesselbom
[mira_talk] Re: How does Mira determine quality scores?

Other related posts: