[mira_talk] Re: Filtering out crappy sequence

On 2012-04-04, at 3:53 PM, Peter Cock wrote:

> On Wed, Apr 4, 2012 at 8:45 PM, John Nash <john.he.nash@xxxxxxxxx> wrote:
>> On 2012-04-04, at 3:27 PM, Peter Cock wrote:
>> 
>>> What were the failing sequences? Perhaps there is something we
>>> can suggest after seeing them and in what way they are bad.
>>> 
>> 
>> Hi Peter,
>> 
>> I thought the same thing but the sequence looks like this (in fastq format):
>> 
>> @HK6K99I01A94O4
>> 
>> 
>> +
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>> 
>> I thought that 454 was configured not to output this sort of thing…
> 
> That is very strange - did you remove the sequence or was it
> missing? If it was missing it was not valid FASTQ.

The sequence was directly converted from four sff files (which I got today) to 
fastq using 

sff_extract -Q -s PROJ_in.454.fastq -x PROJ_traceinfo_in.454.xml sff1 sff2 
etc...

I didn't do any other manipulations.

> 
> Also the quality string is odd, an exclamation mark is ASCII
> 33 which would be PHRED 1 on the Sanger encoding. This

Exactly

> suggests it is either a very bad read which should have
> failed QC and never made it to your SFF file, or something
> has gone very wrong in the SFF to FASTQ conversion.
> 

Again, exactly. I was quite surprised by this.

> Which version of sff_extract did you use?

It was downloaded from the MIRA sourceforge repository in February 2012, and 
'ls' tells me:
-rwxr-xr-x. 1 jnash jnash 52453 Sep  9  2011 sff_extract

> 
> It would be worth testing the SFF file in other tools (e.g. the
> Roche applications or Biopython) to isolate the problem.
> If all the problem reads show this pattern (all their read
> qualities are PHRED 1) that should be easy to filter out.

I was just downloading some other converters to do that as your email came in :)

Thanks
John





Other related posts: