On Mittwoch 29 Juli 2009 David Hesselbom wrote: > I've run some tests on 454 assemlies from both Mira and Newbler and have > concluded that the quality scores attributed to homopolymers are very > different depending on the source, even within the same genome. For > example, in a homopolymer in a consensus sequence, Newbler quality scores > are nearly always the same in the neighboring bases and throughout the > homopolymer itself, except for its last base, which has a very low score > compared to the rest of the bases in the homopolymer. Supposedly, this is > because the length of the homopolymer is not certain (the reads do not > agree), but it's only the last of the bases that is uncertain whether it > should be there or not. Hello David, I actually have no idea exactly how Newbler calculates homopolymer score, but from observation I think that once assembled (and when SFFs are available), Newbler looks again at the raw intensities (adding them?) to reasses the length of the homopolymer. It then probably assigns a specially calculated probability to the last base in the run. > In Mira assemblies, however, all bases in a homopolymer have varying > quality scores, none of which are very low, and typically, bases in (at > least) long homopolymers have a lower average score than those surrounding > the homopolymer, meaning it constitutes a considerable "drop" in the > quality scores. To me, the Newbler quality scores in homopolymers seem to > make more sense than the Mira ones, since what we're uncertain about is the > number of bases in the homopolymer. Since it doesn't matter which base we > remove within the homopolymer, the low quality score might as well be > attributed to the last one. Mira seems to spread out the quality score > penalty over each base in the homopolymer, though I do not believe this is > what's actually happening. :) > I'd like to know why the quality scores are determined so differently by > Mira and Newbler, and also the details on how Mira does it. For example, > does it take homopolymers into special consideration? First, please have a look at //www.freelists.org/post/mira_talk/Quality- Values,4 which describes the current algorithm for calculating the quality of a single consensus base. It's a faster but almost equivalent one to the one described in the MIRA thesis (pp. 74 "Consensus and consensus quality computation") and also takes into account different sequencing types. Back to 454. Please remember that phred quality scores are defined to be log transformed probability scores for each and every base ... on it's own! Alas, the quality scores calculated by the 454 software for homopolymers are, especially after the 3rd base in a homopolymer run, not really 2phred" quality bases. You probably have seen that they typically start in the higher 30s but quickly come down single digit numbers. Question is: what should an assembler do with that? I think that the easiest way to show what happens is to give an example. I'll take two reads for that. One central pillar of the quality calculation in MIRA is the rule that independent observations of a base confirm this base better that non- independent observations. When a base was read from both directions, one can assume independence of observations: it's not the whole truth, but close enough. As a side note: observing a base with different sequencing technologies also constitutes independent observations. In 454 data, we're almost always able to work (at each consensus base) with at least one forward and one reverse strand, so I'll just show this. When independence of the two observations is given, the error probabilities for these two observations can be multiplied to give an error probability for the two observations combined. Assume two perfectly aligning reads with a homopolymer of length 7: -> ... C G T A A A A A A A C G T ... <- ... C G T A A A A A A A C G T ... Here are typical 454 qualities for the reads shown above: -> ... 36 32 35 37 32 17 12 5 3 2 40 38 39 ... <- ... 36 32 39 1 3 4 13 27 36 38 38 36 37 ... The consensus qualities which are now calculated by MIRA look like this (the homopolymer bases are underlined) -- ... 72 64 74 38 35 21 25 32 39 40 78 74 76 ... ******************** This is the 'drop' you observe in the quality data of the consensus at 454 homopolymer sites: from the high 60s and 70s in normal sequencing data to high 20s and 30s in the homopolymer region. I know the situation is not ideal but it's the best I could come up with the quality scores 454 provides in homopolymers. If you have a better idea, please feel free to discuss. I'm always on the lookout for new things to try :-) Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html