[mira_talk] Re: prescreening illumina sequences

  • From: "Reith, Michael" <Michael.Reith@xxxxxxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Wed, 2 Dec 2009 09:07:52 -0500

Thanks again for the comments Bastien.  It turns out that I got myself
confused - the actual %N is 4.7%.  The 36% is the number of bases with a
quality score of 2 (which all the N's have as well as many bases near
the 3' end of the sequences), which I'm just going to ignore for now as
you suggest.

Cheers,
Mike

-----Original Message-----
From: mira_talk-bounce@xxxxxxxxxxxxx
[mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux
Sent: December 1, 2009 3:47 PM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: prescreening illumina sequences

On Dienstag 01 Dezember 2009 Reith, Michael wrote:
> Thanks for the comments Bastien.  It turns out that I have 36% N's
(from
> paired end, 76 bp reads from a lower eukaryote (genome ~35 Mb)).  Two
> questions:
> 
> 1. Since filling up the computer memory with a ton of N's seems like a
> waste, I'm going to remove all sequences that don't have at least 10
> bases of Phred > 15.

I'd suggest not to do that in a first time. The quality values written
by the 
GERALD pipeline are, well, not the most trustworthy ones. I've seen a
lot of 
cases of perfectly valid sequence with low quality values and, vice
versa, 
sequence with good quality values to be entirely unusable.

> If I also chop off the runs of N's at the 3' end
> the remaining sequences, 

Yes, chop off long runs of N's. I might even implement such a clipping
in MIRA 
one day, but not before releasing 3.0.

> is Mira OK with having different length Solexa
> sequences

MIRA does not care about different lengths :-)

> , or should I just let Mira do the chopping?  

The "memory saving chopping" is not existing. -CL:pec will handle things
well 
from a clipping point of view, but the sequence will just be hidden and
thus 
eat away memory even if not used.

> 2. Should I be complaining to my sequencing service?  Has anyone had
> experience with these read lengths from paired ends?  

I would. Here are some numbers I saw in the last few months.
- in the very worst project (paired-end, 36bp, 68% GC), MIRA had to clip
away
  a total of 17.5% of all the bases, killing 1 milliion reads completely
out
  of 8 million (12.5%). I don't have separate numbers for 'N', but these
are
  generally only a tiny fraction of the clipped sequence.
- in 'normal' projects (76bp, quite neutral GC), MIRA clips away between
8 and
  12% of all the bases.

Note that % GC has an influence on the sequencing quality, with higher
GC 
being potentially quite detrimental to sequence and quality values. See
also 
the GGCxG problem I describe in 
  http://chevreux.org/GGCxG_problem.html

The clipping works really, really well. After that, 80% of the (possibly

clipped) Solexa 76bp reads contain no error, 10% one error, 5% two
errors, ~2% 
three errors and the rest more errors.

> Is this typical or
> indicative of problems with the sequencing and/or base calling?

Talk to your provider, asking them to put the number of N's (36%,
sheeeesh) in 
relation to other projects they delivered. If they tell you "that's
normal" 
with a straight face, ask around other sequencing providers about their 
numbers :-)

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: