[mira_talk] Re: ignoring quality and vector trim information from XML files with a paired-end asssembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 30 Apr 2009 20:17:59 +0200

On Wednesday 29 April 2009 Eric Cabot wrote:
> I am trying to assemble a bacterial genome based on Sanger paired-end reads
> but have found that XML and ancillary files obtained from NCBI contain
> completely untrustworthy  (i.e. nearly all bogus)  values of the quality
> and vector trimming coordinates. 

Hello Eric,

so I've been not the only one with that impression. A few weeks ago I had 
searched for some test data at the NCBI trace archive and spent some time 
figuring out why MIRA produced bogus results. As I'd been somewhat under time 
constraint, I did not follow it up thoroughly, but as I wrote, my last working 
hypothesis was that the vector clippings were ... somewhat sub-optimal.

Question: could you document a few test cases and make the NCBI aware of the 
problem. I think it's pretty important as if the data is more bogus that 
right, all the nice XML standardisation work they did is for nothing.

> Given that I know the sequences of the
> vectors and adapters what are good approaches to use these data with mira?
> My reading of the available documentation is that the XML file contains the
> information needed to treat pairs as pairs.  Assuming that I can clip away
> vectors (e.g. with SSAHA) and low quality ends (somehow) prior to running
> mira, is there a parameter that will allow me to use the template
> information from the XML but ignore vector and quality trim coordinates?

This is not possible. I quickly though about it, but the number of options 
that you'd need (take this info, ignore that other, corret a thirs, etc.) 
would seriously increase the total amount of paramaters ... as if MIRA didn't 
have enough yet :-)

> An alternate that comes to mind is to modify the XML file itself.
> If that is feasible, could I simply remove CLIP_VECTOR and CLIP_QUALITY
> blocks?

Removing this kind of info from the standard XML files should be quite easy. If 
you don't have some silly case where entries are splitted across several 
lines, a simple "grep -v clip_quality_left" would, for example, remove all 
left quality clippings.

Now, getting new clips back into a XML, you'd need to write an XML yourself 
(which is not too difficult).


Actually, I've had for quite some time now the idea to introduce a kind of 
"final commands after loading " file for cases like this. It would be a kind of 
"EXP-light" format which would be dead simple to write for anyone. 
Unfortunately, a few other things were more important to me than this 
hypothetical use case ... which is not hypothetical anymore.

I'll think of it (but no promise).

> Does any one have any suggestions as to how I might proceed?

I'd go for the "rewrite XML" solution at the moment ... it's the cleanest ... 
and you could gain fame, gold and admiration of groupies[1] by making your 
script available :-)

Regards,
  Bastien

[*] actually, I'm not sure about the validity this statement. I'm still 
pondering on the reason why at least the latter two didn't materialise for me.

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: