[mira_talk] Re: How does Mira determine quality scores?
- From: Davide Sassera <davide.sassera@xxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Wed, 29 Jul 2009 15:08:56 +0200
I would like to add something on this topic.
I found that often in situations of long homopolymers the presence of
few reads containing "1 more base" overcomes the presence of many more
reads with "1 less base" in the consensus.
Manual corrections shows that the majority "1 less base" reads are
right, so I have to correct the consensus each time this happens.
Could the problem brought up by David Hesselbom be the reason for this
"bug"?
Thanks
Davide
Bastien,
I've run some tests on 454 assemlies from both Mira and Newbler and
have concluded that the quality scores attributed to homopolymers are
very different depending on the source, even within the same genome.
For example, in a homopolymer in a consensus sequence, Newbler quality
scores are nearly always the same in the neighboring bases and
throughout the homopolymer itself, except for its last base, which has
a very low score compared to the rest of the bases in the homopolymer.
Supposedly, this is because the length of the homopolymer is not
certain (the reads do not agree), but it's only the last of the bases
that is uncertain whether it should be there or not.
In Mira assemblies, however, all bases in a homopolymer have varying
quality scores, none of which are very low, and typically, bases in
(at least) long homopolymers have a lower average score than those
surrounding the homopolymer, meaning it constitutes a considerable
"drop" in the quality scores. To me, the Newbler quality scores in
homopolymers seem to make more sense than the Mira ones, since what
we're uncertain about is the number of bases in the homopolymer. Since
it doesn't matter which base we remove within the homopolymer, the low
quality score might as well be attributed to the last one. Mira seems
to spread out the quality score penalty over each base in the
homopolymer, though I do not believe this is what's actually happening. :)
I'd like to know why the quality scores are determined so differently
by Mira and Newbler, and also the details on how Mira does it. For
example, does it take homopolymers into special consideration?
Thanks,
David Hesselbom
Research assistant
Molecular Evolution
EBC, Uppsala University
--
Davide Sassera
Sezione di Patologia Generale e Parassitologia
Dipartimento di Patologia Animale,
Igiene e Sanità Pubblica Veterinaria
Facoltà di Veterinaria
Università degli Studi di Milano
Via Celoria 10, 20133, Milano, ITALY
Tel: +39 0250318094
Fax: +39 0250318095
Other related posts: