Bastien Chevreux wrote: > On Tuesday 15 February 2011 15:02:32 Martin Mokrejs wrote: >> [...] >> The mira options do not allow me to specify 454-specific problems >> with homopolymers. I will easily run out of >> -CLIPPING:cp_min_signal_len=20,cp_max_errors_allowed=5. >> I think the routine should force that the polyT starts with 'T' >> immediately after the 'gact' key, there should be at least say 10 T >> out of 16 and then it should expand the matching region to the right >> unless it hits a stretch where no consequence of `TT' is present. > > One can indeed think of a multitude of specialised algorithms for poly-AT > clipping. Feel free to implement any which comes to your mind :-) > >> My questions, what mira could do in masking these polyA/T from 454 reads. > > Whatever anyone is willing to implement. I am currently fully busy at a > couple > of other places in the MIRA code. > >> The simple statistics how to detect polyA/T regions in mira is currently >> not much helpful IMHO. I will try to come up with some regexp to do this >> myself and let you know. ;-) sed 's/^gact\(T\{8,\}[ATGCN]\{0,3\}T\{8,\}[ATGCN]\{0,3\}T\{7,\}[ATGCN]\{0,3\}T\{7,\}[ATGCN]\{0,3\}T\{6,\}[ATGCN]\{0,1\}T\{5,\}\)\(.*\)$/gact\1--\2/' So far I was able to come up with the above sed matching part which inserts the two minus seigns at the aned of teh matching region. The problem is that I cannot find how to replace that matching region of variable size with 'X' characters. Why mira does not accept 'N' characters as masking as well? mdust masks with 'N' so I had to convert them to 'X'. In the end I went with mdust which masked even some internal polyA/T regions but that is fine for the first approach. But it seems to me that I mixed up something with my flags and that now I have 2x more contigs (not finished yet) than I used to have before; although I enforced more stringent 'mrl' and 'mo' values: --job=denovo,est,accurate,454 454_SETTINGS -ASSEMBLY:ardct=15,urdcm=5,mrl=80 -CLIPPING:clip_polyat=off -AL:mo=40 -DP:ure=yes,rewl=5,leip=10 COMMON_SETTINGS -SKIM:number_of_threads=8,bases_per_hash=10,hss=2,mnr=no,nrr=500,mmhr=2 -CLIPPING:apply_skim_chimeradetectionclip=no,pec=no -ASSEMBLY:skim_each_pass=on,nop=10,uniform_read_distribution=no,use_genomic_pathfinder=no,rmb_break_loops=10,sd=no,esspd=1000 -GE:not=8 -OUT:rld=no,orw=yes While reading the definitive guide to mira I wonder whether it is expected that I provided mira with traceinfofo file as created by sff_extract while concurrently I zapped the nasty sequences in *.fasta with 'X' chars and forbid mira to do vector clipping, polyA trimming, nasty repeats filtering (because this is non-normalized library). Is it fine that after clipping using traceinfo positions mira yields masked sequence which is should further shrink down? Currently am think of these options as well: -AS:urdcm=500,urdsip=8,ard=no,sd=no -CL:pvlc=no,lcc=yes,mbc=yes If mira would accept 'x' or 'n' as the masking character I could depend on -CL:lcc=yes ? This part from the reference manual makes me unsure what I going on: <quote> [maskedbases_clip(mbc)=on|yes|1, off|no|0] Default is dependent of the sequencing technology used. This will let mira perform a 'clipping' of bases that were masked out (replaced with the character X). It is generally not a good idea to use mask bases to remove unwanted portions of a sequence, the EXP file format and the NCBI traceinfo format have excellent possibilities to circumvent this. But because a lot of preprocessing software are built around cross_match, scylla- and phrap-style of base masking, the need arose for mira to be able to handle this, too. mira will look at the start and end of each sequence to see whether there are masked bases that should be 'clipped'. </quote> What happens if the masked sequence is in the middle of the sequence or at its right end? Thanks, Martin -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html