[mira_talk] ignoring quality and vector trim information from XML files with a paired-end asssembly

  • From: Eric Cabot <ecabot@xxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 29 Apr 2009 14:43:52 -0500

I am trying to assemble a bacterial genome based on Sanger paired-end reads
but have found that XML and ancillary files obtained from NCBI contain
completely untrustworthy  (i.e. nearly all bogus)  values of the quality and
vector trimming coordinates.  Given that I know the sequences of the vectors
and adapters what are good approaches to use these data with mira?

My reading of the available documentation is that the XML file contains the
information needed to treat pairs as pairs.  Assuming that I can clip away
vectors (e.g. with SSAHA) and low quality ends (somehow) prior to running
mira, is there a parameter that will allow me to use the template
information from the XML but ignore vector and quality trim coordinates?

An alternate that comes to mind is to modify the XML file itself.
If that is feasible, could I simply remove CLIP_VECTOR and CLIP_QUALITY
blocks?

Does any one have any suggestions as to how I might proceed?


...


For the record here is the last command that I used to produce an assembly
that had 1687 large contigs:

>mira -project=e_hyb -fasta -job=denovo,genome,normal,sanger,454 \
-highlyrepetitive -DP:ure=yes -CL:pvlc=yes \
454_SETTINGS -CL:emrc=yes:qc=no   SANGER_SETTINGS -CL:qc=yes

(The project also contains, non-paired-end 454 reads and -- yes --- lots of plasmids with diverse copy numbers).

Thanks mira talkers,

Eric C.

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: