[mira_talk] Re: High Coverage Mapping Assembly

  • From: Adrian Pelin <apelin20@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 25 Sep 2014 20:43:17 -0400

What is the purpose of your endeavor?

Sincerely,
Adrian

> On Sep 25, 2014, at 8:39 PM, Said Muñoz Montero <said3427@xxxxxxxxx> wrote:
> 
> Hello Mira experts,
> 
> I am doing a mapping assembly with 9 different parasite isolates 
> simultaneously using a reference genome from the same specie. The genome 
> variability between samples is low, except for copy number variation.  The 
> total coverage after combining all samples is 360X so I changed the 
> -NW:cac=stop parameter.
> 
> I have read the warnings about similar tasks in MIRA mailing list but these 
> are referred to a denovo assembly. Despite the computational resources 
> needed. What do you think about these strategy?
> 
> I would really appreciate any advice!
> 
> Here are the warnings given by Bastien in the Mira Guide:
> 
> "With todays' sequencing technologies (especially Illumina, but also Ion
> Torrent and 454), many people simply take everything they get and throw it
> into an assembly. Which, in the case of Illumina and Ion, can mean they try
> to assemble their organism with a coverage of 100x, 200x and more (I've
> seen trials with more than 1000x).
> 
> This is not good. Not. At. All! For two reasons (well, three to be precise).
> The first reason is that, usually, one does not sequence a single cell but
> a population of cells. If this population is not clonal (i.e., it contains
> subpopulations with genomic differences with each other), assemblers will
> be able to pick up these differences in the DNA once a certain sequence
> count is reached and they will try reconstruct a genome containing all
> clonal variations, treating these variations as potential repeats with
> slightly different sequences. Which, of course, will be wrong and I am
> pretty sure you do not want that.
> 
> The second and way more important reason is that none of the current
> sequencing technologies is completely error free. Even more problematic,
> they contain both random and non-random sequencing errors. Especially the
> latter can become a big hurdle if these non-random errors are so prevalent
> that they suddenly appear to be valid sequence to an assembler. This in
> turn leads to false repeat detection, hence possibly contig breaks or even
> wrong consensus sequence. You don't want that, do you?
> 
> The last reason is that overlap based assemblers (like MIRA is) need
> *exponentially* more time and memory when the coverage increases. So
> keeping the coverage comparatively low helps you there."
> 
> THANKS!!!

Other related posts: