On Dienstag 01 Dezember 2009 Reith, Michael wrote: > Thanks for the comments Bastien. It turns out that I have 36% N's (from > paired end, 76 bp reads from a lower eukaryote (genome ~35 Mb)). Two > questions: > > 1. Since filling up the computer memory with a ton of N's seems like a > waste, I'm going to remove all sequences that don't have at least 10 > bases of Phred > 15. I'd suggest not to do that in a first time. The quality values written by the GERALD pipeline are, well, not the most trustworthy ones. I've seen a lot of cases of perfectly valid sequence with low quality values and, vice versa, sequence with good quality values to be entirely unusable. > If I also chop off the runs of N's at the 3' end > the remaining sequences, Yes, chop off long runs of N's. I might even implement such a clipping in MIRA one day, but not before releasing 3.0. > is Mira OK with having different length Solexa > sequences MIRA does not care about different lengths :-) > , or should I just let Mira do the chopping? The "memory saving chopping" is not existing. -CL:pec will handle things well from a clipping point of view, but the sequence will just be hidden and thus eat away memory even if not used. > 2. Should I be complaining to my sequencing service? Has anyone had > experience with these read lengths from paired ends? I would. Here are some numbers I saw in the last few months. - in the very worst project (paired-end, 36bp, 68% GC), MIRA had to clip away a total of 17.5% of all the bases, killing 1 milliion reads completely out of 8 million (12.5%). I don't have separate numbers for 'N', but these are generally only a tiny fraction of the clipped sequence. - in 'normal' projects (76bp, quite neutral GC), MIRA clips away between 8 and 12% of all the bases. Note that % GC has an influence on the sequencing quality, with higher GC being potentially quite detrimental to sequence and quality values. See also the GGCxG problem I describe in http://chevreux.org/GGCxG_problem.html The clipping works really, really well. After that, 80% of the (possibly clipped) Solexa 76bp reads contain no error, 10% one error, 5% two errors, ~2% three errors and the rest more errors. > Is this typical or > indicative of problems with the sequencing and/or base calling? Talk to your provider, asking them to put the number of N's (36%, sheeeesh) in relation to other projects they delivered. If they tell you "that's normal" with a straight face, ask around other sequencing providers about their numbers :-) Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html