Thanks for the comments Bastien. It turns out that I have 36% N's (from paired end, 76 bp reads from a lower eukaryote (genome ~35 Mb)). Two questions: 1. Since filling up the computer memory with a ton of N's seems like a waste, I'm going to remove all sequences that don't have at least 10 bases of Phred > 15. If I also chop off the runs of N's at the 3' end the remaining sequences, is Mira OK with having different length Solexa sequences, or should I just let Mira do the chopping? I'm a bit worried about bumping up against the memory limit of our machine (32 Gb). 2. Should I be complaining to my sequencing service? Has anyone had experience with these read lengths from paired ends? Is this typical or indicative of problems with the sequencing and/or base calling? Thanks for your help. Mike -----Original Message----- From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux Sent: November 30, 2009 2:24 PM To: mira_talk@xxxxxxxxxxxxx Subject: [mira_talk] Re: prescreening illumina sequences On Sonntag 29 November 2009 Reith, Michael wrote: > Just a quick question about prescreening my Illumina sequences. From a > quick look, it appears that my sequencing service has sent everything, > including sequences that are entirely N's. It's obvious that these should > be removed before starting the Mira assembly, as well as those with only a > few good bases. My question is how far do I go - do I keep sequences that > have 10 or 15 good bases or should I get rid of everything with more a few > N's? This is for 72 bp paired reads. Hello Michael, if you don't have too many Ns (like, >=20% of your total bases), then you don't need to bother with clipping them (be it by quality or by character) before going into assembly. The proposed-end-clip (-CL:pec) takes care of that and will make sure that only nice sequences are used. It's extremely effective for high-throughput data like Solexa or 454. The only reason one would like to clip N's before loading is to reduce memory imprint of the sequences. But as I wrote, I wouldn't bother for less than 20% of total seq (and I've never seen Solexa data set with more than a percent or two of N-sequences). Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html