[mira_talk] Re: Denovo versus mappings assembly with Mira

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sat, 11 May 2013 23:22:16 +0200

On May 11, 2013, at 23:02 , David Coil <coil.david@xxxxxxxxx> wrote:
> But I'm wondering if the results of a mapping assembly (using the 
> -SB:abnc-yes flag) count as a "genome sequence".

*sigh* The -SB:abnc flag. Combines the worst and the best of the worlds of 
de-novo and mapping assemblies. Looks like way more people are using it than 
I'd initially thought, maybe I should reconsider its removal in the 3.9 line.

>  Basically you get one big contig (the mapping) and then a number of small 
> ones assembled from the leftovers.   But the big mapped contig has gaps even 
> though it's called a "contig".   You'd have to break the contig on those gaps 
> to say, submit the genome to NCBI even though many of them are very small.   
> One could of course use the unpadded result, but then you may be joining 
> things together over large gaps that really should be considered separate 
> contigs.

Careful there. Uncovered areas of a backbone (reference sequence) are NOT 
deleted from the "unpadded" results, only "gap" columns are (i.e., columns 
created by a spurious insertion base in one (or very few) reads at a given 
place.

If you take the current default FASTA output from a mapping assembly in MIRA 
you get something many people do not expect: an amalgam of the data from your 
mapped strain and, in coverage holes, the data from the reference. I thought 
this to be a good idea, but I'm not so sure anymore.

What one should do with the results from mapping: use convert_project to 
extract the clean "by strain" data. Like this:

  convert_project -f maf -t fasta mira_out.maf mynewresults

Uncovered areas of the backbone are then represented by a string of "N" 
characters in these new results.

> So it seems to me that mapping assemblies are great for answering biological 
> questions... but insufficient to say publish the genome sequence of an 
> isolate.    Would people agree or disagree with that idea?  Am I thinking 
> about the mapping assemblies incorrectly?

Agree, you are thinking about these assemblies totally correctly. I wrote the 
mapping modes of MIRA to answer biological questions and that's what it does :-)

The following is just my 2 cents, other feedback welcome.

For *very* related strains however one can think reworking the mapping output 
slightly in an assembly editor (gap4/gap5) and then use this for publishing. 
Very related means in this case: whatever you feel comfortable with to finish 
by hand … and on the importance you attach to having the genome "completed." 
For strains having just a couple of SNPs, short indels and maybe two or three 
larger indels or genome reorganisation breakpoints its a no-brainer, for more 
you have to decide.

Hope that helps,
  Bastien


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: