[mira_talk] Re: prescreening illumina sequences

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Tue, 1 Dec 2009 20:47:06 +0100

On Dienstag 01 Dezember 2009 Reith, Michael wrote:
> Thanks for the comments Bastien.  It turns out that I have 36% N's (from
> paired end, 76 bp reads from a lower eukaryote (genome ~35 Mb)).  Two
> questions:
> 
> 1. Since filling up the computer memory with a ton of N's seems like a
> waste, I'm going to remove all sequences that don't have at least 10
> bases of Phred > 15.

I'd suggest not to do that in a first time. The quality values written by the 
GERALD pipeline are, well, not the most trustworthy ones. I've seen a lot of 
cases of perfectly valid sequence with low quality values and, vice versa, 
sequence with good quality values to be entirely unusable.

> If I also chop off the runs of N's at the 3' end
> the remaining sequences, 

Yes, chop off long runs of N's. I might even implement such a clipping in MIRA 
one day, but not before releasing 3.0.

> is Mira OK with having different length Solexa
> sequences

MIRA does not care about different lengths :-)

> , or should I just let Mira do the chopping?  

The "memory saving chopping" is not existing. -CL:pec will handle things well 
from a clipping point of view, but the sequence will just be hidden and thus 
eat away memory even if not used.

> 2. Should I be complaining to my sequencing service?  Has anyone had
> experience with these read lengths from paired ends?  

I would. Here are some numbers I saw in the last few months.
- in the very worst project (paired-end, 36bp, 68% GC), MIRA had to clip away
  a total of 17.5% of all the bases, killing 1 milliion reads completely out
  of 8 million (12.5%). I don't have separate numbers for 'N', but these are
  generally only a tiny fraction of the clipped sequence.
- in 'normal' projects (76bp, quite neutral GC), MIRA clips away between 8 and
  12% of all the bases.

Note that % GC has an influence on the sequencing quality, with higher GC 
being potentially quite detrimental to sequence and quality values. See also 
the GGCxG problem I describe in 
  http://chevreux.org/GGCxG_problem.html

The clipping works really, really well. After that, 80% of the (possibly 
clipped) Solexa 76bp reads contain no error, 10% one error, 5% two errors, ~2% 
three errors and the rest more errors.

> Is this typical or
> indicative of problems with the sequencing and/or base calling?

Talk to your provider, asking them to put the number of N's (36%, sheeeesh) in 
relation to other projects they delivered. If they tell you "that's normal" 
with a straight face, ask around other sequencing providers about their 
numbers :-)

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: prescreening illumina sequences
  - From: Reith, Michael

References:
- [mira_talk] Re: prescreening illumina sequences
  - From: Reith, Michael

[mira_talk] Re: prescreening illumina sequences

Other related posts: