[mira_talk] Re: MIRA v. 3.9.15 vs MIRA v. 4.0.2 performance degradation issues when using Illumina 54-bp paired-end reads in hybrid assembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 22 May 2014 22:21:33 +0200

On 22 May 2014, at 21:41 , Bayles, Darrell <Darrell.Bayles@xxxxxxxxxxxx> wrote:
> The bacterium does have a large number of large repetitive elements, and yes 
> most are transposons.  We have considered PacBio and that would be the 
> simplest way to work through the big repeats; however, I’d still like to get 
> some clarification regarding the questions of performance differences between 
> MIRA v. 3.9.15 and MIRA v. 4.0.2, and clarification about the questions 
> resulting from your comments about short reads.  While I didn’t expect there 
> to be a big improvement in assembling with v. 4.0.2, I certainly didn’t 
> expect a substantial decline in the goodness of assembly either.

There are a couple of things to consider when trying to explain the differences 
you see. Remember, I do not know the data set, I’m just guessing.

1. No guess here: 3.9.15 belongs to those development version where more 
misassemblies happened than in later versions as people had given me a lot of 
tough data to optimise MIRA for.
2. the “default” settings have changed all along the 3.9.x development as I 
adjusted MIRA to more current data sets. And this means that some heuristics 
now are (much) better adapted to seeing “short” reads in the 100+ bp range and 
will utterly fail for smaller reads. The “why” I did not bother to investigate, 
100bp Illuminas have been here since 2010 or so and why should I spend time 
optimising for data sets no one is generating since 3+ years or so?

Especially the failing heuristics probably leads to the very long assembly 
times you’re seeing: it’s spending way more time in Smith-Waterman alignments 
than I’d expect. Which means that too many hits from the SKIM phase are not 
pruned out, which in turn leads to sub-optimal overlap graphs and that leads to 
… well, in the end, worse assemblies with shorter reads when you’re using the 
4.x series of MIRA.

Maybe I should add another “Nag and Warn” flag which stops the assembly if it 
detects a readgroup with reads distinctly smaller than, say, 80bp.

B.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: