[mira_talk] Re: dependencies between X masked data and traceinfo.xml data in mira clipping behaviour
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Wed, 8 Oct 2008 22:02:04 +0200
On Wednesday 08 October 2008 12:04, Jorge.DUARTE@xxxxxxxxxxxx wrote:
> Does anyone have experience in using together traceinfo.xml data and X
> masked sequences with mira ?
I think I do :-)
Hello Jorge,
> I mean, looking at the doc of mira :
> [...]
> It looks like Xs are clipped at some point, but it doesn't say when, and i
> wonder if there could be a problem in the clipping process from the
> data contained into the xml file in case this clipping is done after the
> other.. (assuming that clipping means actually removing of a region)
Nope, it doesn't. Like the sequence vector "clip" and the quality "clip", MIRA
just moves around the pointers that define the "unclipped" part of the
sequence (i.e., the one that is officially "good").
> Indeed, if i have a sequence like this one:
> >FAINC1H01EKDXQ
>
> TCAGACGAGTGCGTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGCAGA
> GATGATGTGGGCAAGTTCCTTCCCACATACTTGGCGCAGGGAATCCTTCA
> GAGCGCTGAGCGGGCTGGCAAGGC
>
> and traceinfo data like this:
>
> <trace>
> <trace_name>FAINC1H01EKDXQ</trace_name>
> <trace_type_code>454</trace_type_code>
> <program_id>454Basecaller</program_id>
> <clip_quality_left>15</clip_quality_left>
> <clip_quality_right>105</clip_quality_right>
> </trace>
>
> if mira clips first the Xs, and then try to clip the sequence using the
> traceinfo data,
> will the sequence not be too short to be clipped at base 105 ?
As written above, none of the clipping routines physically removes sequence.
While there is an order of clipping, in the end MIRA will just use the most
conservative clipping points: the rightmost left clip and the leftmost right
clip.
The general order of clipping is:
0) use clip points read from file (EXP, CAF or XML)
1) quality clip
2) masked base clip
3) merging of SSAHA data
4) clipping of poly-AT
5) minimum left clip
6) minimum right clip
If a bad sequence clip is on, then the above changes to
2a) masked base clip
2b) minimum left clip
2c) bad sequence search clip
In your example above, the read first looks like this after reading the FASTA
(I use lower case here to show the clipped parts)
TCAGACGAGTGCGTXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGCAGA
GATGATGTGGGCAAGTTCCTTCCCACATACTTGGCGCAGGGAATCCTTCA
GAGCGCTGAGCGGGCTGGCAAGGC
Then the (EXP / CAF / XML) clip points are applied:
tcagacgagtgcgtxXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAGCAGA
GATGATGTGGGCAAGTTCCTTCCCACATACTTGGCGCAGGGAATCCTTCA
GAGCGctgagcgggctggcaaggc
The masked base clip then gets rid of the X (if used):
tcagacgagtgcgtxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAGCAGA
GATGATGTGGGCAAGTTCCTTCCCACATACTTGGCGCAGGGAATCCTTCA
GAGCGctgagcgggctggcaaggc
Then the rest of the clips are applied (if used).
> I appologyze if this is a stupid question, as i'm not familiar with
> general clipping behaviour of bioinformatics tools.
> But if someone could tell me that mira will have no trouble handling this,
> that would be great !
MIRA should indeed have no problem in handling that. If you spot something
erroneous ... well, drop the author a note :-)
Regards,
Bastien
PS: and thanks for using the talk list for this kind of question
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: