[mira_talk] Re: sff_extract question

  • From: Jeremy Volkening <volkening@xxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 03 Jan 2011 17:04:17 -0600

On Mon, 2011-01-03 at 21:21 +0000, Peter wrote:
> On Mon, Jan 3, 2011 at 6:57 PM, Jeremy Volkening
> <volkening@xxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > When I ran sff_extract on my latest data files in preparation for input
> > into mira, the 5' cutoffs in the traceinfo xml file were off by one
> > basepair (these are MID-tagged reads). I imagine this is set by the
> > sequencing lab at the time of the run, but the strange thing is that the
> > cutoffs in the fasta files themselves (shown as lowercase characters)
> > are correct.
> >
> > It's easy enough to do a global replace in the xml file, but I thought I
> > should mention it. I tried to look into the code to see what was going
> > on, by I don't know much python. Is this a bug or an incorrect setting
> > by the sequencing lab?
> >
> > Jeremy
> 
> Are you sure they are off? I don't recall off hand if the NCBI traceinfo
> XML file uses one based on zero based counting but I'd check that
> first.

After looking at the NCBI docs for their TRACEINFO format, it looks like
the output of sff_extract is okay. For example, here are the entries for
a read from the XML and FASTA output:

<trace>
    <trace_name>GS2U6XA01AYYW4</trace_name>
    <clip_vector_left>16</clip_vector_left>
    <clip_vector_right>264</clip_vector_right>
</trace>

>GS2U6XA01AYYW4
gactacgagtagactTCTCCAGGTTGG...

The experiment uses 11 bp MID tags , and the 4 bp key sequence is
present also. The 5' clip <should> be 15 bp. According to NCBI's site:

"The CLIP_VECTOR_LEFT  field indicates the base at the beginning of the
sequence at which the read should be clipped due to vector sequence. The
given value would be the first base of non-vector sequence".

So you are right, everything looks okay so far. But there is definitely
a problem when I run the data through mira and then open it up in gap4.
There is an extra base clipped on the 5' of every read, which I can see
when I make the clipped ends visible. At first I thought this might be a
result of one of the -CL options in mira and I played around with
different settings. Finally I changed all of the CLIP_VECTOR_LEFT values
in the XML file to 15 and re-ran the assembly. This fixed the problem,
although after reading the NCBI docs this is clearly not a correct or
desirable thing to do.

I don't know if the problem is in mira or caf2gap or gap4, but somewhere
along the line an extra 5' base is consistently being clipped. Normally
I would never have noticed, but I happened to be looking at the reads
from the end of a viral genome, where the physical end is clearly
evident in the assembly (~100 reads agreeing to a single bp), and
realized that I was missing a base.

It's not an urgent problem from my point of view, since most people
wouldn't miss this extra base, but I was just wondering if anyone else
could confirm this and if it is a bug?

Jeremy

> 
> Can you use the Roche off instrument apps to check? I think sffinfo
> will show you the read trim points, you could also output the full
> untrimmed sequence as FASTA to check (this will use lower case
> for the bits which are marked for trimming).
> 
> You could also use the Roche tools to make a test SFF file with
> just a couple of problematic reads for testing sff_extract.
> 
> Peter
> 




-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: