On Tuesday 01 March 2011 12:09:24 Martin Mokrejs wrote: > [...] First: please do not CC me when posting to MIRA talk, it clutters my inbox. If there is something I read, then that mailing list. Some wpuld say even more attentively than my private inbox :-) > So far I was able to come up with the above sed matching part which inserts > the two minus seigns at the aned of teh matching region. The problem is > that I cannot find how to replace that matching region of variable size > with 'X' characters. > > Why mira does not accept 'N' characters as masking as well? mdust masks > with 'N' so I had to convert them to 'X'. Because MIRA makes a slight semantic distinction between "N" and "X". "N" is a base where the base caller thinks there should be a base, but absolutely cannot find out what it is.. It's a valid base, not something which was masked by any other program. "X" on the other hand is a masked base. > [...] > While reading the definitive guide to mira I wonder whether it is expected > that I provided mira with traceinfofo file as created by sff_extract > while concurrently I zapped the nasty sequences in *.fasta with 'X' chars > and forbid mira to do vector clipping, polyA trimming, nasty repeats > filtering (because this is non-normalized library). Is it fine that after > clipping using traceinfo positions mira yields masked sequence which is > should further shrink down? > [...] > If mira would accept 'x' or 'n' as the masking character I could > depend on -CL:lcc=yes ? I think MIRA already accepts lowercase "x" for masking, doesn't it? Without further user intervention, the TRACEINFO normally contains at least the left and right clipping points from the Roche software. In the 454 sequence output the clipped parts are lowercase and good parts are uppercase. Since quite a while MIRA has a clipping mode (lowercaseclip) which enables it to work only with the sequence and one does not need TRACEINFO. But sometimes this distinction gets lost somewhere and if then people use all uppercase sequences (then with bad quality and adaptors in them), hilarity ensues. Or rather not. Therefore I keep the TRACEINFO requirement to be sure people do not throw sequences at MIRA where adaptor trimming has not been performed in one way or another. But your inquiry led me to check another thing: the "lowercaseclip) will fail if your masking was done in uppercase. E.g. >someread tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAtgcatcagttgcatgactgactgtgac The lowercaseclip wil make the following sequence out of the avbove: AAATGCATGACTGATGCTGACTAAA However, if you masked with uppercase "X", things will fail: >someread tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAAgcatcagtXXXXXXXXXXXXX That might currently lead to: AAATGCATGACTGATGCTGACTAAAAgcatcagt I've corrected corrected that in the development tree, but up to version 3.2.1.8 you will need to mask with lowercase "x" :-) > What happens if the masked sequence is in the middle of the sequence or at > its right end? The search behavior for masked bases is governed by -CL:mbcgs:mbcmfg:mbcmeg. Internal gaps via *gs, max distance from front via *mfg and from end via *meg. If now masked bases are too far within the sequences and cannot be reached by the search, they simply stay in the sequence. Internally, MIRA treats them almsot like "N", but still tries to handle them a bit differently. E.g., they're not counted in different coverage statistics. B. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html