[mira_talk] Re: High Coverage Mapping Assembly

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 25 Sep 2014 21:32:05 -0500

You didn't say how large the genome is, so I'll guess high at ~18Mb? I've 
played with mapping multiple strains simultaneously, but I have concluded that 
it usually isn't very helpful for a few reasons.

1) While getting a memory usage estimate is fairly straightforward, CPU time is 
harder to predict. In my experience, mapping two strains separately in 
different processes is faster than mapping two strains simultaneously. While 
that may be counter-intuitive, it actually makes sense when you consider that 
Mira does some things multi-threaded and other things single-threaded. So with 
twice the work in a single thread, a batch job will take longer than single 
jobs each in their own process. Even so, you might be ok with the longer 
execution time for the convenience of one manifest file and one output assembly 
that you can then analyze (more on this below). For two strains that might be 
fine, but for nine strains at 18Mb per strain, this will take a very long time 
and is probably not worth it.

2) Consider Bastien's first point from the manual. Only here you are not 
dealing with cell-to-cell variability, you are dealing with strain-to-strain 
variability. In other words, the problem is amplified. You won't get contig 
breaking in the same way as you would with a denovo assembly, but Mira uses 
sequence context from neighboring reads to better deal with sequencing errors, 
so when you have subsets of reads with SNPs in different places, this may be 
problematic for the read mapping and/or feature tagging. The more strains the 
more problematic this can be. If you have larger differences--like deletions or 
rearrangements--you may miss those entirely.

3) Best case scenario, all the reads map to the right places, and you can 
conceivably detect SNPs. How will you analyze this? The SNP feature tags that 
Mira uses will apply to the entire assembly, not to individual strains. So 
while you can look at a single strain's consensus, you will have to find the 
SNPs manually. In some cases you might find special tags, such as IUPc or STMU, 
when two strains disagree about a SNP. You can quickly find these areas and 
then figure out what is going on by manual inspection. But this will be a much 
more complicated problem with nine strains. And again, if you have larger 
deletions or rearrangements in some of these strains, you may not see them 
without going through the debris file and trying to piece together what 
happened.

Bottom line: I would just do each mapping separately. If you have a cluster 
that you can submit jobs on, just write a quick bash script to submit them one 
at a time in succession. You can analyze the jobs that finish first while the 
later jobs are still running.


> On Sep 25, 2014, at 7:39 PM, Said Muñoz Montero <said3427@xxxxxxxxx> wrote:
> 
> Hello Mira experts,
> 
> I am doing a mapping assembly with 9 different parasite isolates 
> simultaneously using a reference genome from the same specie. The genome 
> variability between samples is low, except for copy number variation.  The 
> total coverage after combining all samples is 360X so I changed the 
> -NW:cac=stop parameter.
> 
> I have read the warnings about similar tasks in MIRA mailing list but these 
> are referred to a denovo assembly. Despite the computational resources 
> needed. What do you think about these strategy?
> 
> I would really appreciate any advice!
> 
> Here are the warnings given by Bastien in the Mira Guide:
> 
> "With todays' sequencing technologies (especially Illumina, but also Ion
> Torrent and 454), many people simply take everything they get and throw it
> into an assembly. Which, in the case of Illumina and Ion, can mean they try
> to assemble their organism with a coverage of 100x, 200x and more (I've
> seen trials with more than 1000x).
> 
> This is not good. Not. At. All! For two reasons (well, three to be precise).
> The first reason is that, usually, one does not sequence a single cell but
> a population of cells. If this population is not clonal (i.e., it contains
> subpopulations with genomic differences with each other), assemblers will
> be able to pick up these differences in the DNA once a certain sequence
> count is reached and they will try reconstruct a genome containing all
> clonal variations, treating these variations as potential repeats with
> slightly different sequences. Which, of course, will be wrong and I am
> pretty sure you do not want that.
> 
> The second and way more important reason is that none of the current
> sequencing technologies is completely error free. Even more problematic,
> they contain both random and non-random sequencing errors. Especially the
> latter can become a big hurdle if these non-random errors are so prevalent
> that they suddenly appear to be valid sequence to an assembler. This in
> turn leads to false repeat detection, hence possibly contig breaks or even
> wrong consensus sequence. You don't want that, do you?
> 
> The last reason is that overlap based assemblers (like MIRA is) need
> *exponentially* more time and memory when the coverage increases. So
> keeping the coverage comparatively low helps you there."
> 
> THANKS!!!

Other related posts: