[mira_talk] Re: prescreening illumina sequences

  • From: "Reith, Michael" <Michael.Reith@xxxxxxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Tue, 1 Dec 2009 10:49:33 -0500

Thanks for the comments Bastien.  It turns out that I have 36% N's (from
paired end, 76 bp reads from a lower eukaryote (genome ~35 Mb)).  Two
questions:

1. Since filling up the computer memory with a ton of N's seems like a
waste, I'm going to remove all sequences that don't have at least 10
bases of Phred > 15.  If I also chop off the runs of N's at the 3' end
the remaining sequences, is Mira OK with having different length Solexa
sequences, or should I just let Mira do the chopping?  I'm a bit worried
about bumping up against the memory limit of our machine (32 Gb).

2. Should I be complaining to my sequencing service?  Has anyone had
experience with these read lengths from paired ends?  Is this typical or
indicative of problems with the sequencing and/or base calling?

Thanks for your help.

Mike

-----Original Message-----
From: mira_talk-bounce@xxxxxxxxxxxxx
[mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux
Sent: November 30, 2009 2:24 PM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: prescreening illumina sequences

On Sonntag 29 November 2009 Reith, Michael wrote:
> Just a quick question about prescreening my Illumina sequences.  From
a
>  quick look, it appears that my sequencing service has sent
everything,
>  including sequences that are entirely N's. It's obvious that these
should
>  be removed before starting the Mira assembly, as well as those with
only a
>  few good bases.  My question is how far do I go - do I keep sequences
that
>  have 10 or 15 good bases or should I get rid of everything with more
a few
>  N's?  This is for 72 bp paired reads.

Hello Michael,

if you don't have too many Ns (like, >=20% of your total bases), then
you 
don't need to bother with clipping them (be it by quality or by
character) 
before going into assembly. The proposed-end-clip (-CL:pec) takes care
of that 
and will make sure that only nice sequences are used. It's extremely
effective 
for high-throughput data like Solexa or 454.

The only reason one would like to clip N's before loading is to reduce
memory 
imprint of the sequences. But as I wrote, I wouldn't bother for less
than 20% 
of total seq (and I've never seen Solexa data set with more than a
percent or 
two of N-sequences).

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: