[mira_talk] MIRA v. 3.9.15 vs MIRA v. 4.0.2 performance degradation issues when using Illumina 54-bp paired-end reads in hybrid assembly

From: "Bayles, Darrell" <Darrell.Bayles@xxxxxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Thu, 22 May 2014 17:08:57 +0000
I'm looking for some additional information regarding a comment Bastien made in 
a recent post 
(//www.freelists.org/post/mira_talk/Assembly-in-versions-4-05-is-worse-than-previous-versions,5).
  In that post was the following: "On another note: the assembly engine of the 
late 3.9.x and the MIRA 4 series has been changed and does not work well 
anymore with very short reads. I think the cutoff is somewhere around 60 to 70 
bp (I never bothered checking), but for reads below this cutoff things tend to 
fall apart. This might explain in part the worse N50 you are seeing. BTW, this 
is an "expect-no-fix" problem as I see no point in continuing to support this 
kind of data atm."

Does that hold true when dealing with paired-end reads when each member of the 
pair is less than 60 bp, but where there is longer read technology data being 
assembled too?  If reads less than 60 bp are a problem, are there any parameter 
tweaks that would improve the assembly if that type of data were used in the 
assembly?  Also ,would shorter read Illumina paired-end data cause dramatically 
longer assembly times?

Here's the back story that engendered the questions.  I'm assembling a 4 Mb 
bacterial genome that has approximately a 40 % G+C content.  The input data 
consists of three libraries, two Roche FLX and one Illumina.  More 
specifically, there is one FLX titanium mate-pair library (average read length 
of 165 bp after splitting with sff_extract, average insert size 2.4 Kb), one 
FLXplus shotgun library (average read length 1300 bp), and one Illumina TruSeq 
paired-end library (mechanical shearing i.e. NOT Nextera, 54-bp reads, average 
template size 500 bp). The data being assembled is about 45X coverage FLX and 
34X coverage Illumina for about 79X total coverage.  I've used the exact same 
read input files and assembled the genome with MIRA v. 3.9.15 and MIRA v. 4.0.2 
and observed that the MIRA v. 4.0.2 assemblies are taking inordinately long to 
complete (8 h with 3.9.15 and 3 d with 4.0.2) and yielding poorer assemblies 
compared to v. 3.9.15 (see summary metrics below).  Assembling only the Roche 
FLX data portion with MIRA v. 4.0.2 results in a better assembly than is 
achieved when the Illumina data is added.  After a lot of underlying checks, 
I'm now thinking that it may simply be the shorter read size in the Illumina 
pairs data, coupled with changes in MIRA v. 4.0.X line that is causing the 
difference.  An additional observation is that most of the additional assembly 
time in with MIRA v. 4.0.2 seems to be due to the program spending much longer 
in the " Aligning possible forward matches" and "Aligning possible complement 
matches" phases.

Assembling the genome (~8 h) with MIRA v. 3.9.15 yields the following:
---------------------------------------------------------
Num. reads assembled: 3120516
Num. singlets: 0

Coverage assessment (calculated from contigs >= 5000):
=========================================================
  Avg. total coverage: 78.25
  Avg. coverage per sequencing technology
        Sanger: 0.00
        454:    44.77
        IonTor: 0.00
        PcBioHQ:        0.00
        PcBioLQ:        0.00
        Text:   0.00
        Solexa: 33.87
        Solid:  0.00


Large contigs (makes less sense for EST assemblies):
====================================================
With    Contig size             >= 500
        AND (Total avg. Cov     >= 26
             OR Cov(san)        >= 0
             OR Cov(454)        >= 15
             OR Cov(ion)        >= 0
             OR Cov(pbh)        >= 0
             OR Cov(pbl)        >= 0
             OR Cov(txt)        >= 0
             OR Cov(sxa)        >= 11
             OR Cov(sid)        >= 0
            )

  Length assessment:
  ------------------
  Number of contigs:    183
  Total consensus:      3930971
  Largest contig:       143998
  N50 contig size:      47182
  N90 contig size:      11779
  N95 contig size:      7356

  Coverage assessment:
  --------------------
  Max coverage (total): 586
  Max coverage per sequencing technology
        Sanger: 0
        454:    1554
        IonTor: 0
        PcBioHQ:        0
        PcBioLQ:        0
        Text:   0
        Solexa: 675
        Solid:  0

  Quality assessment:
  -------------------
  Average consensus quality:                    88
  Consensus bases with IUPAC:                   76      (you might want to 
check these)
  Strong unresolved repeat positions (SRMc):    0       (excellent)
  Weak unresolved repeat positions (WRMc):      24      (you might want to 
check these)
  Sequencing Type Mismatch Unsolved (STMU):     0       (excellent)
  Contigs having only reads wo qual:            0       (excellent)
  Contigs with reads wo qual values:            0       (excellent)



Assembling the genome (~3 d) with MIRA v. 4.0.2 yields the following:
---------------------------------------------------------
Num. reads assembled: 3213968
Num. singlets: 15

Coverage assessment (calculated from contigs >= 5000 with coverage >= 19):
=========================================================
  Avg. total coverage: 79.11
  Avg. coverage per sequencing technology
        Sanger: 0.00
        454:    45.57
        IonTor: 0.00
        PcBioHQ:        0.00
        PcBioLQ:        0.00
        Text:   0.00
        Solexa: 33.84
        Solid:  0.00


Large contigs (makes less sense for EST assemblies):
====================================================
With    Contig size             >= 500
        AND (Total avg. Cov     >= 40
             OR Cov(san)        >= 0
             OR Cov(454)        >= 23
             OR Cov(ion)        >= 0
             OR Cov(pbh)        >= 0
             OR Cov(pbl)        >= 0
             OR Cov(txt)        >= 0
             OR Cov(sxa)        >= 17
             OR Cov(sid)        >= 0
            )

  Length assessment:
  ------------------
  Number of contigs:    345
  Total consensus:      4026035
  Largest contig:       91188
  N50 contig size:      31538
  N90 contig size:      5884
  N95 contig size:      2534

  Coverage assessment:
  --------------------
  Max coverage (total): 4417
  Max coverage per sequencing technology
        Sanger: 0
        454:    448
        IonTor: 0
        PcBioHQ:        0
        PcBioLQ:        0
        Text:   0
        Solexa: 3980
        Solid:  0

  Quality assessment:
  -------------------
  Average consensus quality:                    88
  Consensus bases with IUPAC:                   50      (you might want to 
check these)
  Strong unresolved repeat positions (SRMc):    0       (excellent)
  Weak unresolved repeat positions (WRMc):      18      (you might want to 
check these)
  Sequencing Type Mismatch Unsolved (STMU):     0       (excellent)
  Contigs having only reads wo qual:            0       (excellent)
  Contigs with reads wo qual values:            0       (excellent)



Any additional thoughts or suggestions are appreciated!

Regards,

Darrell




This electronic message contains information generated by the USDA solely for 
the intended recipients. Any unauthorized interception of this message or the 
use or disclosure of the information it contains may violate the law and 
subject the violator to civil or criminal penalties. If you believe you have 
received this message in error, please notify the sender and delete the email 
immediately.
Follow-Ups:
- [mira_talk] Re: MIRA v. 3.9.15 vs MIRA v. 4.0.2 performance degradation issues when using Illumina 54-bp paired-end reads in hybrid assembly
  - From: Bastien Chevreux
[mira_talk] MIRA v. 3.9.15 vs MIRA v. 4.0.2 performance degradation issues when using Illumina 54-bp paired-end reads in hybrid assembly

Other related posts: