Hi Bastien, On 26 Sep 2009, at 17:20 , Bastien Chevreux wrote:
hrm, let me guess: Newbler 2.x?
Yes, Newbler 2.0.0.20
What I'd be interested in would be this: do you have some statistics which show the errors broken down by homopolymer length. I strongly suspect that longer homopolymers are more prone to the base calling error, so shifting calling weights according to homopolymer length is probably one possiblesolution.
It doesn't seem so. Actually, it seems that the absolute number of overcalling errors is constant over hp length, but that the number of undercalls is going down with hp length (this is only for mira, numbers of homopolymers with length (rows) and length diff with reference (columns):
-2 -1 0 1 2 All 4 2 80 62502 7 1 5 2 93 22854 5 2 6 2 156 7819 7 0 7 2 110 2596 6 0 8 1 21 621 7 0 9 0 0 49 4 0
Ideally there would also be statistics which show how manygaps/bases were at each erroneous site, but that might be a bit too much toask.
I attach a file to this email, which contains (for mira assembly for one of the genome) each "incorrectly" called HP, and every read as they appear in the assembly file. Hope that helps... If you want more, or even the full dataset (including correct calls), or the scripts, let me know...
Lionel ============================================ Lionel Guy Thunmansgatan 25, SE-75421 Uppsala phone: +46 (0)18 245596 mobile: +46 (0)73 9760618 email: guy.lionel@xxxxxxxxx ============================================