[mira_talk] How does Mira determine quality scores?

Bastien,

I've run some tests on 454 assemlies from both Mira and Newbler and have
concluded that the quality scores attributed to homopolymers are very
different depending on the source, even within the same genome. For example,
in a homopolymer in a consensus sequence, Newbler quality scores are nearly
always the same in the neighboring bases and throughout the homopolymer
itself, except for its last base, which has a very low score compared to the
rest of the bases in the homopolymer. Supposedly, this is because the length
of the homopolymer is not certain (the reads do not agree), but it's only
the last of the bases that is uncertain whether it should be there or not.

In Mira assemblies, however, all bases in a homopolymer have varying quality
scores, none of which are very low, and typically, bases in (at least) long
homopolymers have a lower average score than those surrounding the
homopolymer, meaning it constitutes a considerable "drop" in the quality
scores. To me, the Newbler quality scores in homopolymers seem to make more
sense than the Mira ones, since what we're uncertain about is the number of
bases in the homopolymer. Since it doesn't matter which base we remove
within the homopolymer, the low quality score might as well be attributed to
the last one. Mira  seems to spread out the quality score penalty over each
base in the homopolymer, though I do not believe this is what's actually
happening. :)

I'd like to know why the quality scores are determined so differently by
Mira and Newbler, and also the details on how Mira does it. For example,
does it take homopolymers into special consideration?

Thanks,

David Hesselbom
Research assistant
Molecular Evolution
EBC, Uppsala University

Other related posts: