[mira_talk] Re: 5' trimming of partial adapters

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 20 Jul 2011 20:15:35 +0200

On Wednesday 20 July 2011 00:00:11 Robert Bruccoleri wrote:
> In some of the genome assembly projects that I'm working on, I see an
> uneven GC content at the beginning (first 10 bases) of my reads. Since
> the library preparation is expected to be unbiased, uneven GC content
> suggests that there is a contaminant sequence at the beginning of some
> of my reads.

Yep, these kind of plots are always used to show what different tools can 
detect. Seen one just yesterday (or the day before) in Vienna.

> Let's assume for the sake of argument that the contaminant sequence is a
> short subsequence of an adapter, but it's too short to identify by
> sequence similarity. Does anyone have any ideas about how to handle the
> problem besides trimming the 5' end?

Have you looked at the data once it passed the clipping stage of MIRA? (you 
can get it in the checkpoint directory once the assembler reaches pass 1 ... 
or simple start MIRA >= 3.4rc1 with the additional parameter -AS:nop=0.
I wonder whether you'd still find artefacts there, can you please check.

If yes, then I'd like to know what the sequence is. Writing a couple of 
additional regex rules to clip 5' should not be very difficult.

> Does the option
> -CL:possible_vector_leftover_clip handle this type of problem?

Hmmm ... that beast was conceived for Sanger. Uses a hell of a lot more 
memory, so I never really tried it on anything but that.

-CL:pec on the other hand should get rid of your problems also on the 5' :-)

B.

Other related posts: