[mira_talk] Re: Vector screen

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 15 Mar 2011 20:52:38 +0100

On Tuesday 15 March 2011 18:04:43 hikaru wrote:
> >SMALT_result
> alignment:S:00 70    GJMJG5A02F1HEX pCC1Fos       16       85
> 8070      8139   F      70 100.00 524
> (SMALT certainly recognizes the vector region in GJMJG5A02F1HEX read.)

Yep, it does.

> >GJMJG5A02F1HEX_sequence (a part of full length)
> gactacactactcgtTGAACAATGGAAGTCCGAGCTCATCGCTAATAACTTCGTATAGCATACATTATACGAAGT
> TATATTCGATGCGGCCGCAAGGGGTTCGCGTCAGCGGGTGTTGG 

Exactly how long is that full length sequence? The SMALT hit denotes positions 
16 to 85 to be vector, so if there are more than a couple of bases afterwards 
which are NOT vector (and not clipped away otherwise), then the behaviour of 
MIRA is "normal" with respect to how it is configured per default.

You called MIRA like this:

mira -project=bchoc -job=denovo,genome,normal,sanger -fasta 454_SETTINGS 
      -CL:msvs=yes

If you now look at how this is configured, MIRA will tell you this:

        Merge with SSAHA2/SMALT vector screen (msvs):  [san]  no
                                                       [454]  yes
            Gap size (msvsgs)                       :  [san]  10
                                                       [454]  8
            Max front gap (msvsmfg)                 :  [san]  60
                                                       [454]  8
            Max end gap (msvsmeg)                   :  [san]  120
                                                       [454]  12
            Strict front clip (msvssfc)             :  [san]  0
                                                       [454]  0
            Strict end clip (msvssec)               :  [san]  0
                                                       [454]  0

Important for the case you showed is this: Your read

>GJMJG5A02F1HEX
gact
acactactcgt
TGAACAATGGAAGTCCGAGCTCATCGCTAATAACTTCGTATAGCATACATTATACGAAGTTATATTCGATGCGGCCGCAAGGGGTTCGCGTCAGCGGGTGTTGG...

is probably clipped by the SFF after the first "gact" at position 0 to 3. 
Afterwards there are 11 bases not being vector ... and here the standard 
parameter -CL:msvsmfg (being "8" for 454 technology for the above command 
line) stops MIRA from considering that a valid hit to mask. Now, you wrote 
that you posted only part of the sequence. Well, same thing for the end of the 
sequence: if there are more bases between the end of the recognised vector and 
-CL:msvsmeg (12), then it will not be considered either there. Therefore, MIRA 
does not clip.

You probably want to consider increasing both -CL:msvsmfg:msvsmeg

The question you might now have: why does MIRA work like that. Answer: 
experience. I've seen too many people configuring their vector searches in a 
way that chance "hits" amidst a sequence were more a regular happening than 
the exception. E.g. (X showing a false vector hit amidst some sequence)

<-- 200bases -->......XXXXXXXXXXXXXXX.......<-- 300bases -->

Comments whether the strategy of MIRA is good or suggestion for improvements 
welcome.

Best,
  Bastien

PS: note to myself: document -CL:msvssfc:msvssec

Other related posts: