[mira_talk] Re: Beginner Question RE sanger data in hybrid assembly

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 23 Nov 2009 22:04:52 +0100

On Montag 23 November 2009 Jeremiah Davie wrote:
> Hi All,
>      I am learning to use MIRA and wanted to incorporate Sanger data
> along with a 454 run. The problem is that the sanger data is saved as
> *.ab1 files, as the sanger sequencing facility at our university uses
> an older ABI sequencer.

Hello Jeremiah,

first question is: does your provider just provide the ab1 files or do they 
also have a service where they preprocess the data? If yes, ask them to do 
that. There's a ton of stuff one should be aware of and it's by no mean 
trivial (quality clipping, sequence vector trimming etc.pp)

Most providers should still have pretty good pipelines from the heydays of 
Sanger sequencing.

If they do, they can give you the data in almost any format and MIRA should be 
able to use it: FASTA + XML, EXP or even masked FASTA if there's no other 
possibility.

> I can use Sequencher to convert those files to
> fasta/fastq files, but cannot generate the traceinfo.xml files that
> MIRA expects. Is there a way to avoid using the traceinfo.xml files?

Yes, using EXP files for masked FASTA files.

> Conversely, does anyone know of a program that will convert an .ab1
> file to fasta/fastq/traceinfo.xml collection? 

Nothing public I know of (I know at least two companies has an internal 
pipeline for that).

But there's still GAP4 and the pregap4 pipeline. Comes with a pretty robust 
ab1 -> EXP conversion pipeline. And MIRA can then read the Sanger reads in EXP 
and 454 reads in FASTA + XML.

Have a look at it: http://staden.sourceforge.net/

> If not, can someone
> guide me to an easy to follow guide for writing a traceinfo.xml file?

The canonical source would be the NCBI (they standardized the format):
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc_b&m=doc&s=rfc_b

Have a look at the example a bit down on the page, it's not really difficult. 
But please note that MIRA does currently not parse "<common_fields>" (it's 
somewhat recent), all these fields need to be placed per read into the file.

Here's a minimal entry per read I would generate:

   <trace>
      <trace_name>HBBAA1U0001</trace_name>
      <trace_file>HBBAA1U0001.scf</trace_file>
      <clip_vector_left>56</clip_vector_left>
      <clip_vector_right>737</clip_vector_right>
      <clip_quality_left>80</clip_vector_left>
      <clip_quality_right>700</clip_vector_right>
      <template_id>HBBAA1U0001</template_id>
      <insert_size>1500</insert_size>
      <insert_stdev>450</insert_stdev>
   </trace>

Leave out "tremplate_id" and the "insert_*" if you don't work with templates.

> Any help would be greatly appreciated; I'm pulling my hair out on
> this. Sincerely, Jeremiah

Don't! It's not worth it. Besides: they'll disappear in the coming years 
faster than you'd wish for :-)

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

References:
- [mira_talk] Beginner Question RE sanger data in hybrid assembly
  - From: Jeremiah Davie

[mira_talk] Re: Beginner Question RE sanger data in hybrid assembly

Other related posts: