On Dienstag 23 März 2010 Andy wrote: > [...] > Error! The length of read SOLEXA1_0001:1:74:16647:1029#0/1 (101) does not > match the length given in the SSAHA2 file (101) > SSAHA2 line: ALIGNMENT::08 27 SOLEXA1_0001:1:74:16647:1029#0/1 pFLC-I 29 2 > 735 762 C 28 100 101 > > Error! The length of read SOLEXA1_0001:1:74:17068:1020#0/1 (101) does not > match the length given in the SSAHA2 file (101) > SSAHA2 line: ALIGNMENT::50 100 SOLEXA1_0001:1:74:17068:1020#0/1 pCMVSPORT6 > 101 2 811 910 C 100 100 101 > > Error! The length of read SOLEXA1_0001:1:74:18037:1025#0/1 (101) does not > match the length given in the SSAHA2 file (101) > SSAHA2 line: ALIGNMENT::14 32 SOLEXA1_0001:1:74:18037:1025#0/1 pFLC-I 33 2 > 739 770 C 32 100 101 > > Error! The length of read SOLEXA1_0001:1:74:18037:1025#0/1 (101) does not > match the length given in the SSAHA2 file (101) > SSAHA2 line: ALIGNMENT::14 27 SOLEXA1_0001:1:74:18037:1025#0/1 pFLC-I 29 2 > 735 762 C 28 100 101 > > Error! The length of read SOLEXA1_0001:1:74:18070:1023#0/1 (101) does not > match the length given in the SSAHA2 file (101) > SSAHA2 line: ALIGNMENT::50 84 SOLEXA1_0001:1:74:18070:1023#0/1 pCMVSPORT6 > 18 101 797 880 F 84 100 101 Uh oh ... I have the bad feeling that something is broken with the logic I implemented. Just to be sure: can you please look at the reads in question and tell me whether they start with a 'N'? > I'm guessing that SSAHA2 thinks that the reads are 101bp long but mira > thinks that they're 100bp? Long story short: for some aesthetical reasons, mira adds an 'n' in front of most Solexa reads ... except when there's aleready a 'n'. An the clipping routine doesn't account for this special case ... yet. Just need to be sure before I start fixing this. > I noticed also that mira is doing some filtering of the Solexa reads, what > do these mean? > Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12586:2781#0/1 > Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12588:8448#0/1 > Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12595:11576#0/1 > Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12611:16949#0/1 > Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12628:14575#0/1 > Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12634:12867#0/1 Need to document that. Hard: a run of 20 consecutive A or 20 T leads to discarding the read. You see this a lot with bad / low qual Solexa reads. Soft: same as above, but with 12 bases and total % of the same base in complete read >= 80% Yes I know, this sometimes discards good reads, especially at poly-A poly-T sites. Then again: not clipping creates all sorts of very interesting problems I prefer not to have :-) Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html