[mira_talk] Re: SSAHA2 vector screen

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 23 Mar 2010 22:00:27 +0100

On Dienstag 23 März 2010 Andy wrote:
> [...]
> Error! The length of read SOLEXA1_0001:1:74:16647:1029#0/1 (101) does not
> match the length given in the SSAHA2 file (101)
> SSAHA2 line: ALIGNMENT::08 27 SOLEXA1_0001:1:74:16647:1029#0/1 pFLC-I 29 2
> 735 762 C 28 100 101
> 
> Error! The length of read SOLEXA1_0001:1:74:17068:1020#0/1 (101) does not
> match the length given in the SSAHA2 file (101)
> SSAHA2 line: ALIGNMENT::50 100 SOLEXA1_0001:1:74:17068:1020#0/1 pCMVSPORT6
> 101 2 811 910 C 100 100 101
> 
> Error! The length of read SOLEXA1_0001:1:74:18037:1025#0/1 (101) does not
> match the length given in the SSAHA2 file (101)
> SSAHA2 line: ALIGNMENT::14 32 SOLEXA1_0001:1:74:18037:1025#0/1 pFLC-I 33 2
> 739 770 C 32 100 101
> 
> Error! The length of read SOLEXA1_0001:1:74:18037:1025#0/1 (101) does not
> match the length given in the SSAHA2 file (101)
> SSAHA2 line: ALIGNMENT::14 27 SOLEXA1_0001:1:74:18037:1025#0/1 pFLC-I 29 2
> 735 762 C 28 100 101
> 
> Error! The length of read SOLEXA1_0001:1:74:18070:1023#0/1 (101) does not
> match the length given in the SSAHA2 file (101)
> SSAHA2 line: ALIGNMENT::50 84 SOLEXA1_0001:1:74:18070:1023#0/1 pCMVSPORT6
>  18 101 797 880 F 84 100 101

Uh oh ... I have the bad feeling that something is broken with the logic I 
implemented. Just to be sure: can you please look at the reads in question and 
tell me whether they start with a 'N'?

> I'm guessing that SSAHA2 thinks that the reads are 101bp long but mira
> thinks that they're 100bp?

Long story short: for some aesthetical reasons, mira adds an 'n' in front of 
most Solexa reads ... except when there's aleready a 'n'. An the clipping 
routine doesn't account for this special case ... yet. Just need to be sure 
before I start fixing this.

> I noticed also that mira is doing some filtering of the Solexa reads, what
> do these mean?
> Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12586:2781#0/1
> Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12588:8448#0/1
> Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12595:11576#0/1
> Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12611:16949#0/1
> Solexa: Filter out T (hard) SOLEXA1_0001:1:15:12628:14575#0/1
> Solexa: Filter out (A hard) SOLEXA1_0001:1:15:12634:12867#0/1

Need to document that.

Hard: a run of 20 consecutive A or 20 T leads to discarding the read. You see 
this a lot with bad / low qual Solexa reads. 

Soft: same as above, but with 12 bases and total % of the same base in 
complete read >= 80%

Yes I know, this sometimes discards good reads, especially at poly-A poly-T 
sites. Then again: not clipping creates all sorts of very interesting problems 
I prefer not to have :-)

Regards,
  Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: