On Wednesday 29 April 2009 Eric Cabot wrote: > I am trying to assemble a bacterial genome based on Sanger paired-end reads > but have found that XML and ancillary files obtained from NCBI contain > completely untrustworthy (i.e. nearly all bogus) values of the quality > and vector trimming coordinates. Hello Eric, so I've been not the only one with that impression. A few weeks ago I had searched for some test data at the NCBI trace archive and spent some time figuring out why MIRA produced bogus results. As I'd been somewhat under time constraint, I did not follow it up thoroughly, but as I wrote, my last working hypothesis was that the vector clippings were ... somewhat sub-optimal. Question: could you document a few test cases and make the NCBI aware of the problem. I think it's pretty important as if the data is more bogus that right, all the nice XML standardisation work they did is for nothing. > Given that I know the sequences of the > vectors and adapters what are good approaches to use these data with mira? > My reading of the available documentation is that the XML file contains the > information needed to treat pairs as pairs. Assuming that I can clip away > vectors (e.g. with SSAHA) and low quality ends (somehow) prior to running > mira, is there a parameter that will allow me to use the template > information from the XML but ignore vector and quality trim coordinates? This is not possible. I quickly though about it, but the number of options that you'd need (take this info, ignore that other, corret a thirs, etc.) would seriously increase the total amount of paramaters ... as if MIRA didn't have enough yet :-) > An alternate that comes to mind is to modify the XML file itself. > If that is feasible, could I simply remove CLIP_VECTOR and CLIP_QUALITY > blocks? Removing this kind of info from the standard XML files should be quite easy. If you don't have some silly case where entries are splitted across several lines, a simple "grep -v clip_quality_left" would, for example, remove all left quality clippings. Now, getting new clips back into a XML, you'd need to write an XML yourself (which is not too difficult). Actually, I've had for quite some time now the idea to introduce a kind of "final commands after loading " file for cases like this. It would be a kind of "EXP-light" format which would be dead simple to write for anyone. Unfortunately, a few other things were more important to me than this hypothetical use case ... which is not hypothetical anymore. I'll think of it (but no promise). > Does any one have any suggestions as to how I might proceed? I'd go for the "rewrite XML" solution at the moment ... it's the cleanest ... and you could gain fame, gold and admiration of groupies[1] by making your script available :-) Regards, Bastien [*] actually, I'm not sure about the validity this statement. I'm still pondering on the reason why at least the latter two didn't materialise for me. -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html