[mira_talk] Re: polyA masking using mira internal routines

  • From: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 01 Mar 2011 12:09:24 +0100


Bastien Chevreux wrote:
> On Tuesday 15 February 2011 15:02:32 Martin Mokrejs wrote:
>> [...]
>> The mira options do not allow me to specify 454-specific problems
>> with homopolymers. I will easily run out of
>> -CLIPPING:cp_min_signal_len=20,cp_max_errors_allowed=5.
>> I think the routine should force that the polyT starts with 'T'
>> immediately after the 'gact' key, there should be at least say 10 T
>> out of 16 and then it should expand the matching region to the right
>> unless it hits a stretch where no consequence of `TT' is present.
> 
> One can indeed think of a multitude of specialised algorithms for poly-AT 
> clipping. Feel free to implement any which comes to your mind :-)
> 
>> My questions, what mira could do in masking these polyA/T from 454 reads.
> 
> Whatever anyone is willing to implement. I am currently fully busy at a 
> couple 
> of other places in the MIRA code.
> 
>> The simple statistics how to detect polyA/T regions in mira is currently
>> not much helpful IMHO. I will try to come up with some regexp to do this
>> myself and let you know. ;-)

sed 
's/^gact\(T\{8,\}[ATGCN]\{0,3\}T\{8,\}[ATGCN]\{0,3\}T\{7,\}[ATGCN]\{0,3\}T\{7,\}[ATGCN]\{0,3\}T\{6,\}[ATGCN]\{0,1\}T\{5,\}\)\(.*\)$/gact\1--\2/'

So far I was able to come up with the above sed matching part which inserts
the two minus seigns at the aned of teh matching region. The problem is that
I cannot find how to replace that matching region of variable size with 'X'
characters.

Why mira does not accept 'N' characters as masking as well? mdust masks with
'N' so I had to convert them to 'X'.

In the end I went with mdust which masked even some internal polyA/T regions
but that is fine for the first approach. But it seems to me that I mixed
up something with my flags and that now I have 2x more contigs (not finished
yet) than I used to have before; although I enforced more stringent 'mrl' and
'mo' values:

--job=denovo,est,accurate,454 454_SETTINGS -ASSEMBLY:ardct=15,urdcm=5,mrl=80 
-CLIPPING:clip_polyat=off -AL:mo=40 -DP:ure=yes,rewl=5,leip=10 COMMON_SETTINGS 
-SKIM:number_of_threads=8,bases_per_hash=10,hss=2,mnr=no,nrr=500,mmhr=2 
-CLIPPING:apply_skim_chimeradetectionclip=no,pec=no 
-ASSEMBLY:skim_each_pass=on,nop=10,uniform_read_distribution=no,use_genomic_pathfinder=no,rmb_break_loops=10,sd=no,esspd=1000
 -GE:not=8 -OUT:rld=no,orw=yes

While reading the definitive guide to mira I wonder whether it is expected
that I provided mira with traceinfofo file as created by sff_extract
while concurrently I zapped the nasty sequences in *.fasta with 'X' chars
and forbid mira to do vector clipping, polyA trimming, nasty repeats filtering
(because this is non-normalized library). Is it fine that after clipping
using traceinfo positions mira yields masked sequence which is should further
shrink down?

Currently am think of these options as well:
-AS:urdcm=500,urdsip=8,ard=no,sd=no -CL:pvlc=no,lcc=yes,mbc=yes

If mira would accept 'x' or 'n' as the masking character I could
depend on -CL:lcc=yes ?

This part from the reference manual makes me unsure what I going on:

<quote>
[maskedbases_clip(mbc)=on|yes|1, off|no|0]

    Default is dependent of the sequencing technology used. This will let mira 
perform a 'clipping' of bases that were masked out (replaced with the character 
X). It is generally not a good idea to use mask bases to remove unwanted 
portions of a sequence, the EXP file format and the NCBI traceinfo format have 
excellent possibilities to circumvent this. But because a lot of preprocessing 
software are built around cross_match, scylla- and phrap-style of base masking, 
the need arose for mira to be able to handle this, too. mira will look at the 
start and end of each sequence to see whether there are masked bases that 
should be 'clipped'. 
</quote>

What happens if the masked sequence is in the middle of the sequence or at its
right end?

Thanks,
Martin





-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: