Thanks for the credit John but that wasn't quite what I was getting at. Pyrosequencing is error prone as are most comparable technologies. The 454 software makes some compensation for this and allows something like a 2 bp mismatch when detecting MID tag sequences. This might also be the case for detecting the B adaptor sequence if the data was generated with older chemistry (i.e. prior to the Rapid Library MID tags). My understanding is that they specifically chose the RL MID sequences so that even with a 2 bp mismatch they could still be unambiguously assigned. However, we've been messing around with protocol modifications and have come across instances where the ligation messes up the end of the adaptor sequence so that it gets truncated and has more than the 2 bp mismatch that is needed for filtering. Subsequently these don't get trimmed automatically and need some tender loving care to make things right. However, your solution is ultimately the best one to take. If the first or last few bases of the sequences look like they are crap they probably are so just get rid of them. Bob - when it comes to Illumina data we're in the same boat. We just had ours installed and are working up our first run. Until I've had a chance to work with the data I really can't say much. The rest of the group have far more experience in this area than I do. Shaun From: Robert Bruccoleri <bruc@xxxxxxxxxxxxxxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2011-07-19 05:26 PM Subject: [mira_talk] Re: 5' trimming of partial adapters Sent by: mira_talk-bounce@xxxxxxxxxxxxx Dear John, Any suggestions with regard to Illumina reads? Regards, Bob John Nash wrote: My colleague, Shaun Tyler (also on this list), tells me that with 454 sequencing, there can be concatenation of the end adaptors to make dimers. In my hands, the second mer is often missing a base or two, and it's not removed by the primary clipping. sff_extract usually screams at me when that happens, and so I re-invoke it with " --min_left_clip=16" or somesuch. John On 2011-07-19, at 6:00 PM, Robert Bruccoleri wrote: In some of the genome assembly projects that I'm working on, I see an uneven GC content at the beginning (first 10 bases) of my reads. Since the library preparation is expected to be unbiased, uneven GC content suggests that there is a contaminant sequence at the beginning of some of my reads. Let's assume for the sake of argument that the contaminant sequence is a short subsequence of an adapter, but it's too short to identify by sequence similarity. Does anyone have any ideas about how to handle the problem besides trimming the 5' end? Does the option -CL:possible_vector_leftover_clip handle this type of problem? Thanks. --Bob <bruc.vcf> [attachment "bruc.vcf" deleted by Shaun Tyler/HC-SC/GC/CA]