[mira_talk] Re: polyA masking using mira internal routines

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sat, 5 Mar 2011 16:31:23 +0100

On Tuesday 01 March 2011 12:09:24 Martin Mokrejs wrote:
> [...]

First: please do not CC me when posting to MIRA talk, it clutters my inbox. If 
there is something I read, then that mailing list. Some wpuld say even more 
attentively than my private inbox :-)

> So far I was able to come up with the above sed matching part which inserts
> the two minus seigns at the aned of teh matching region. The problem is
> that I cannot find how to replace that matching region of variable size
> with 'X' characters.
> 
> Why mira does not accept 'N' characters as masking as well? mdust masks
> with 'N' so I had to convert them to 'X'.

Because MIRA makes a slight semantic distinction between "N" and "X". "N" is a 
base where the base caller thinks there should be a base, but absolutely 
cannot find out what it is.. It's a valid base, not something which was masked 
by any other program. "X" on the other hand is a masked base.

> [...]
> While reading the definitive guide to mira I wonder whether it is expected
> that I provided mira with traceinfofo file as created by sff_extract
> while concurrently I zapped the nasty sequences in *.fasta with 'X' chars
> and forbid mira to do vector clipping, polyA trimming, nasty repeats
> filtering (because this is non-normalized library). Is it fine that after
> clipping using traceinfo positions mira yields masked sequence which is
> should further shrink down?
> [...]
> If mira would accept 'x' or 'n' as the masking character I could
> depend on -CL:lcc=yes ?

I think MIRA already accepts lowercase "x" for masking, doesn't it?

Without further user intervention, the TRACEINFO normally contains at least 
the left and right clipping points from the Roche software. In the 454 
sequence output the clipped parts are lowercase and good parts are uppercase. 
Since quite a while MIRA has a clipping mode (lowercaseclip) which enables it 
to work only with the sequence and one does not need TRACEINFO.

But sometimes this distinction gets lost somewhere and if then people use all 
uppercase sequences (then with bad quality and adaptors in them), hilarity 
ensues. Or rather not.

Therefore I keep the TRACEINFO requirement to be sure people do not throw 
sequences at MIRA where adaptor trimming has not been performed in one way or 
another. 

But your inquiry led me to check another thing: the "lowercaseclip) will fail 
if your masking was done in uppercase. E.g.

>someread
tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAtgcatcagttgcatgactgactgtgac

The lowercaseclip wil make the following sequence out of the avbove:

AAATGCATGACTGATGCTGACTAAA

However, if you masked with uppercase "X", things will fail:

>someread
tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAAgcatcagtXXXXXXXXXXXXX

That might currently lead to:

AAATGCATGACTGATGCTGACTAAAAgcatcagt

I've corrected corrected that in the development tree, but up to version 
3.2.1.8 you will need to mask with lowercase "x" :-)

> What happens if the masked sequence is in the middle of the sequence or at
> its right end?

The search behavior for masked bases is governed by -CL:mbcgs:mbcmfg:mbcmeg. 
Internal gaps via *gs, max distance from front via *mfg and from end via *meg. 
If now masked bases are too far within the sequences and cannot be reached by 
the search, they simply stay in the sequence.

Internally, MIRA treats them almsot like "N", but still tries to handle them a 
bit differently. E.g., they're not counted in different coverage statistics.

B.

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: