[mira_talk] Re: MIRA - Baccardi - ace file

  • From: John Nash <john.he.nash@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 12 Mar 2012 16:12:34 -0400

On 2012-03-12, at 3:18 PM, Shankar Manoharan wrote:

> Thank you Andre...
>       Is it absolutely essential to validate an assembly before proceeding 
> with annotation ? I'm guessing the answer is 'yes'. So, what other tools are 
> available for the same other than Amos validate and Baccardi ? Any help would 
> be appreciated :)

Assemblers are not perfect, and neither is NGS sequencing.  Until read lengths 
are longer than repeat sizes, there will be issues which have to be dealt with 
manually. That will solve the former. The latter is usually dealt with by using 
appropriate coverage and using more than one sequencing technology.

After MIRA (or any other assembler) has finished, it rare that you will get 
back one contig per linkage group (i.e. per chromosome or large plasmid). What 
you will get back is a set of unordered contigs which (typically) end at 
repeats. If your sequence coverage is not good, you will have even more contigs.

Many researchers like to transform their set of contigs to one contig per 
linkage group.  Even if you don't to do that, the assembly must be examined to 
figure out misassemblies and errors caused by NGS technology problems (e.g. 454 
sequence give lots of frameshifts because of its problem handling 
polynucleotide repeats.

I take my MIRA output (the CAF file), and convert it to a set of contigs with > 
1/3 overall coverage and with a minimum contig size of 500 nt, e.g. $ 
convert_project -f caf -t caf -x 500 -y 40 PROJ_out.caf PROJ_largecontigs_out.

Then I import it into gap5 (part of the Staden package) and go through MIRA's 
tags to fix errors that MIRA has flagged. Bastien has denoted the tags worth 
chasing in the manual - usually I chase anything indicative of a misassembly or 
a 454 polynucleotide problem. I guess this is what people call "validating the 
assembly".

If I need to circularize the genome, I make sure that the assembly has decent 
overlaps at the region where I want to circularize, and then I "top-and-tail" 
the assembly.

If I need to close the genome completely, I get an OpGen map done, order the 
contigs, and use blast and directed PCR to close the gaps.

gap5 has this neat feature called "Find next problem" :)

Then I dump the FASTA file out of Staden, and feed it into an annotator 
(usually RAST). Occasionally, the annotation will point out issues that need me 
to go back and check the original sequence - these are usually around 
454-induced frameshifts. With a hybrid assembly, this can be reduced but not 
eliminated.

HTH





Other related posts: