[mira_talk] Re: polyA masking using mira internal routines

  • From: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 06 Mar 2011 15:15:03 +0100

Hi Bastien,

Bastien Chevreux wrote:
> On Tuesday 01 March 2011 12:09:24 Martin Mokrejs wrote:
>> [...]
> 
> First: please do not CC me when posting to MIRA talk, it clutters my inbox. 
> If 
> there is something I read, then that mailing list. Some wpuld say even more 
> attentively than my private inbox :-)

I am sorry, some people prefer that, probably because direct email do get 
delivered
often faster than through the list. Will try to remember.

> 
>> So far I was able to come up with the above sed matching part which inserts
>> the two minus seigns at the aned of teh matching region. The problem is
>> that I cannot find how to replace that matching region of variable size
>> with 'X' characters.
>>
>> Why mira does not accept 'N' characters as masking as well? mdust masks
>> with 'N' so I had to convert them to 'X'.

Well, would I realize "mdust -m X"

> 
> Because MIRA makes a slight semantic distinction between "N" and "X". "N" is 
> a 
> base where the base caller thinks there should be a base, but absolutely 
> cannot find out what it is.. It's a valid base, not something which was 
> masked 
> by any other program. "X" on the other hand is a masked base.

From my inspection of 454 sequences I see two types of errors in Roche's
basecaller:
1. If it is uncertain how long a oligomer stretch was, it appends 'n' but after
that, a single nucleotide from the oligomer. I suspect that has to do with
probabilities reflecting the neighbouring bases. It is weird that you can see
in your sequences stretches like aaggng, right?

2. I sometimes see that actually a nucleotide from a homopolymer is moved into
+2 position, like: GGGGGA becomes GGGGAG. Again, I suspect this is because the
basecaller tries to reflect signal from previous flows.

I am thinking of just dropping all n and N's from my 454 data and see what
happens with the assembly. ;-)

> 
>> [...]
>> While reading the definitive guide to mira I wonder whether it is expected
>> that I provided mira with traceinfofo file as created by sff_extract
>> while concurrently I zapped the nasty sequences in *.fasta with 'X' chars
>> and forbid mira to do vector clipping, polyA trimming, nasty repeats
>> filtering (because this is non-normalized library). Is it fine that after
>> clipping using traceinfo positions mira yields masked sequence which is
>> should further shrink down?
>> [...]
>> If mira would accept 'x' or 'n' as the masking character I could
>> depend on -CL:lcc=yes ?
> 
> I think MIRA already accepts lowercase "x" for masking, doesn't it?
> 
> Without further user intervention, the TRACEINFO normally contains at least 
> the left and right clipping points from the Roche software. In the 454 
> sequence output the clipped parts are lowercase and good parts are uppercase. 
> Since quite a while MIRA has a clipping mode (lowercaseclip) which enables it 
> to work only with the sequence and one does not need TRACEINFO.

I know, and Roche only uses the quality clipping values, because nothing from
their software look for adaptors, so the pair of values in sff files reserved
for adaptor clip points are always zero.


> But sometimes this distinction gets lost somewhere and if then people use all 
> uppercase sequences (then with bad quality and adaptors in them), hilarity 
> ensues. Or rather not.
> 
> Therefore I keep the TRACEINFO requirement to be sure people do not throw 
> sequences at MIRA where adaptor trimming has not been performed in one way or 
> another.

What happens if user provides xml info as exported by e.g. sff_extract but
provides fasta sequences subsequently changed (converted more to lowercase
or vice versa). Is it better to ignore the xml file or after interpreting the
xml clip points also try to extend the clippings based on the lower-casing in
fasta sequence files? (I suppose if I want to just use lower-case clipping
I would NOT provide any xml traceinfo).

 
> But your inquiry led me to check another thing: the "lowercaseclip) will fail 
> if your masking was done in uppercase. E.g.
> 
>> someread
> tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAtgcatcagttgcatgactgactgtgac
> 
> The lowercaseclip wil make the following sequence out of the avbove:
> 
> AAATGCATGACTGATGCTGACTAAA
> 
> However, if you masked with uppercase "X", things will fail:
> 
>> someread
> tgatgtgctgactgtgactgcAAATGCATGACTGATGCTGACTAAAAgcatcagtXXXXXXXXXXXXX
> 
> That might currently lead to:
> 
> AAATGCATGACTGATGCTGACTAAAAgcatcagt

That is bad, even worse because after me running mdust I have sequences like:

tgatgtgctgactgtgactgcAAATGCXXXXGATGCTGACTAAAtgcatcagXXXXXactgactgtgac

> 
> I've corrected corrected that in the development tree, but up to version 
> 3.2.1.8 you will need to mask with lowercase "x" :-)

I wonder if mira could print a note into its logfiles that the input sequences
contain N and X in upper/lower case and that the casing does make a difference.
Could save us a bit.

Similarly, if the clip positions seen in xml traceinfo do not match lowercasing
positions in fasta files ... Again, a good sanity check is always helpful.

> 
>> What happens if the masked sequence is in the middle of the sequence or at
>> its right end?
> 
> The search behavior for masked bases is governed by -CL:mbcgs:mbcmfg:mbcmeg. 
> Internal gaps via *gs, max distance from front via *mfg and from end via 
> *meg. 
> If now masked bases are too far within the sequences and cannot be reached by 
> the search, they simply stay in the sequence.

I think that did not answer my question. But from you example above, it looks
the internal, low-quality region is excised and the flanking sequences are 
joined.
Wow. Or is the sequence within the masked region and everything downstream
clipped away? Or is the whole read discarded? These were my questions. ;)

> 
> Internally, MIRA treats them almsot like "N", but still tries to handle them 
> a 
> bit differently. E.g., they're not counted in different coverage statistics.

M.

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: