On 2012-04-04, at 3:53 PM, Peter Cock wrote: > On Wed, Apr 4, 2012 at 8:45 PM, John Nash <john.he.nash@xxxxxxxxx> wrote: >> On 2012-04-04, at 3:27 PM, Peter Cock wrote: >> >>> What were the failing sequences? Perhaps there is something we >>> can suggest after seeing them and in what way they are bad. >>> >> >> Hi Peter, >> >> I thought the same thing but the sequence looks like this (in fastq format): >> >> @HK6K99I01A94O4 >> >> >> + >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! >> >> I thought that 454 was configured not to output this sort of thing… > > That is very strange - did you remove the sequence or was it > missing? If it was missing it was not valid FASTQ. The sequence was directly converted from four sff files (which I got today) to fastq using sff_extract -Q -s PROJ_in.454.fastq -x PROJ_traceinfo_in.454.xml sff1 sff2 etc... I didn't do any other manipulations. > > Also the quality string is odd, an exclamation mark is ASCII > 33 which would be PHRED 1 on the Sanger encoding. This Exactly > suggests it is either a very bad read which should have > failed QC and never made it to your SFF file, or something > has gone very wrong in the SFF to FASTQ conversion. > Again, exactly. I was quite surprised by this. > Which version of sff_extract did you use? It was downloaded from the MIRA sourceforge repository in February 2012, and 'ls' tells me: -rwxr-xr-x. 1 jnash jnash 52453 Sep 9 2011 sff_extract > > It would be worth testing the SFF file in other tools (e.g. the > Roche applications or Biopython) to isolate the problem. > If all the problem reads show this pattern (all their read > qualities are PHRED 1) that should be easy to filter out. I was just downloading some other converters to do that as your email came in :) Thanks John