[mira_talk] 454 trimming, sff_extract and SFF to traceinfo.xml

  • From: Peter <peter@xxxxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 5 Jan 2011 14:18:40 +0000

Hi all,

The SFF file format allows for two sets of left/right clipping points
- quality based and adapter based:

http://eutils.ncbi.nih.gov/Traces/trace.fcgi?cmd=show&f=formats&m=main&s=formats

In practice, SFF files straight from the Roche 454 sequencer always
seem to have just quality based trimming (the adapter clipping entries
are zero, meaning no trimming). Perhaps some pipelines add adapter (or
vector or barcode) based trimming values to the SFF file - but I
suspect they are generally left unused (by Roche), and the quality
clipping values serve double duty. After all, the left "quality"
clipping point always seems to account for the tcag key sequence at
the start of every 454 read. Furthermore, in a raw MID barcoded SFF
file the barcodes are not considered in the clipping (i.e. they are
still part of the trimmed read), but after splitting an SFF file by
MID, the "quality" left clipping values ARE changed to trim off the
barcode (and the adapter clipping values remain unused).

i.e. As far as I know, Roche don't use the adapter clipping values in
the SFF spec, instead they use the "quality" clipping values for both
kinds of clipping.

This fits with what Bastien wrote on the list back on 18 May 2010,
>
> ...  the software from
> Roche still (after 5 years) is not able to make the distinction between
> clipping by quality and clipping by adaptor, although they did think of it
> when implementing data structures.

//www.freelists.org/post/mira_talk/Mixed-454-shotgun-and-paired-end-assembly-run-time,1




The NCBI traceinfo.xml also allows for two sets of left/right clipping
points - this time quality based and vector based: CLIP_QUALITY_LEFT,
CLIP_QUALITY_RIGHT and CLIP_VECTOR_LEFT, CLIP_VECTOR_RIGHT.

http://eutils.ncbi.nih.gov/Traces/trace.fcgi?cmd=show&f=rfc&m=main&s=rfc

What puzzles me is why using sff_extract on a typical SFF file (with
"quality" clipping points but not adapter clipping points) produces a
traceinfo.xml file with vector trimming entries and NOT quality
clipping. Is this just a practical solution to the fact that SFF files
from Roche seem to just have a single value for quality+adapter
clipping so this be simply mapped to separate quality+vector clipping
values?



Why I am asking is MIRA can "unclip" or "untrim" reads to try and use
the ends of a read which are labelled as poor quality (MIRA option
-DP:ure for use read extension). To do this, you really need to know
if the clipping information is quality clipping (when it is safe to
extend), or adapter/vector clipping (when you should not extend the
reads). If these two types of information were in the traceinfo.xml
file given to MIRA would it take advantage of this distinction?

From looking at the manual, by default ure is on for Sanger but off
for 454 and the other sequencing technologies. Is there anyway to
specify ure only at the start/end of reads? I'm thinking that for most
454 reads applying ure to the end (3') only might be safe: Left
clipping will normally be for the key sequence (tcag) and any barcode
(mid) which should be respected, but right clipping will usually be
quality clipping and can be unclipped. Except of course where there is
a 3' MID or primer sequence ;)

Peter

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: