On Tuesday 15 March 2011 18:04:43 hikaru wrote: > >SMALT_result > alignment:S:00 70 GJMJG5A02F1HEX pCC1Fos 16 85 > 8070 8139 F 70 100.00 524 > (SMALT certainly recognizes the vector region in GJMJG5A02F1HEX read.) Yep, it does. > >GJMJG5A02F1HEX_sequence (a part of full length) > gactacactactcgtTGAACAATGGAAGTCCGAGCTCATCGCTAATAACTTCGTATAGCATACATTATACGAAGT > TATATTCGATGCGGCCGCAAGGGGTTCGCGTCAGCGGGTGTTGG Exactly how long is that full length sequence? The SMALT hit denotes positions 16 to 85 to be vector, so if there are more than a couple of bases afterwards which are NOT vector (and not clipped away otherwise), then the behaviour of MIRA is "normal" with respect to how it is configured per default. You called MIRA like this: mira -project=bchoc -job=denovo,genome,normal,sanger -fasta 454_SETTINGS -CL:msvs=yes If you now look at how this is configured, MIRA will tell you this: Merge with SSAHA2/SMALT vector screen (msvs): [san] no [454] yes Gap size (msvsgs) : [san] 10 [454] 8 Max front gap (msvsmfg) : [san] 60 [454] 8 Max end gap (msvsmeg) : [san] 120 [454] 12 Strict front clip (msvssfc) : [san] 0 [454] 0 Strict end clip (msvssec) : [san] 0 [454] 0 Important for the case you showed is this: Your read >GJMJG5A02F1HEX gact acactactcgt TGAACAATGGAAGTCCGAGCTCATCGCTAATAACTTCGTATAGCATACATTATACGAAGTTATATTCGATGCGGCCGCAAGGGGTTCGCGTCAGCGGGTGTTGG... is probably clipped by the SFF after the first "gact" at position 0 to 3. Afterwards there are 11 bases not being vector ... and here the standard parameter -CL:msvsmfg (being "8" for 454 technology for the above command line) stops MIRA from considering that a valid hit to mask. Now, you wrote that you posted only part of the sequence. Well, same thing for the end of the sequence: if there are more bases between the end of the recognised vector and -CL:msvsmeg (12), then it will not be considered either there. Therefore, MIRA does not clip. You probably want to consider increasing both -CL:msvsmfg:msvsmeg The question you might now have: why does MIRA work like that. Answer: experience. I've seen too many people configuring their vector searches in a way that chance "hits" amidst a sequence were more a regular happening than the exception. E.g. (X showing a false vector hit amidst some sequence) <-- 200bases -->......XXXXXXXXXXXXXXX.......<-- 300bases --> Comments whether the strategy of MIRA is good or suggestion for improvements welcome. Best, Bastien PS: note to myself: document -CL:msvssfc:msvssec