[mira_talk] Mira results much worse than newbler

  • From: Lionel Guy <guy.lionel@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 14 Jul 2011 16:51:40 +0200

Hi Bastien and Miraistas,

I got some strange results comparing assemblies of newbler and MIRA. I
assembled de novo 620000 unpaired 454 Titanium reads, from a bacteria
that is ~50% GC and 5.3 Mb long. 

I was expecting an average coverage of ~45X, roughly 200 long contigs,
and that's more or less what I got with Newbler (2.5.3) using standard
settings (except long contigs, which I set to 500 bp):
- almost all reads aligned
- 217 contigs > 500 bp
- N50 at 80kb, largest contig 225kb
- total size about 5.3 Mb

When I ran mira 3.4rc2:

mira --project=$ASS_ID --job=denovo,genome,normal,454 -OUT:ora=yes
-DI:trt=/scratch/tmp -GE:not=12 > assembly_log.txt &

Result was... not that good. In short:
- about 67% of reads assembled (only)
- average coverage 21 (half as expected)
- 5577 contigs > 500 bp
- N50 at 2 kb, largest contig at 18kb
- total size about 8.3 Mb (40% more than true length)

I attach the length vs coverage plot of the Mira assembly. The assembly
log is 16Mb once bzipped, so I can't send it, but I can put it somewhere
on the net if necessary.

I took exactly the same starting material. I extracted the sequences
from the sff using sff_extract with -c (or -C, can't remember) option to
hard-clip sequences. I'm running the same assembly using Mira 3.2.1.
It's in pass 2 now, but from the results of pass 1 it doesn't look very
different...

Anyone has seen this before? Mira used to perform better than newbler,
but here I'm a bit speechless...

Thanks for any idea, opinion!

Cheers,

Lionel


In details:

Newbler:
numberOfReads = 614615, 614189;
numberOfBases = 258203857, 249060819;

readStatus
numAlignedReads     = 601557, 97.94%;
numAlignedBases     = 239458446, 96.14%;
inferredReadError  = 0.41%, 985238;

numberAssembled = 306880;
numberPartial   = 294629;
numberSingleton = 1675;
numberRepeat    = 169;
numberOutlier   = 4549;
numberTooShort  = 6287;

largeContigMetrics
numberOfContigs   = 217;
numberOfBases     = 5330283;

avgContigSize     = 24563;
N50ContigSize     = 80240;
largestContigSize = 224866;

Q40PlusBases      = 5328574, 99.97%;
Q39MinusBases     = 1709, 0.03%;


MIRA:
Assembly information:
=====================
Num. reads assembled: 414749
Num. singlets: 0

Coverage assessment (calculated from contigs >= 5000):
=========================================================
  Avg. total coverage: 20.20
  Avg. coverage per sequencing technology
        454:    21.00

Large contigs (makes less sense for EST assemblies):
====================================================
With    Contig size             >= 500
        AND (Total avg. Cov     >= 7

  Length assessment:
  ------------------
  Number of contigs: 5577
  Total consensus: 8561693
  Largest contig: 18136
  N50 contig size: 2088
  N90 contig size: 691
  N95 contig size: 604

  Coverage assessment:
  --------------------
  Max coverage (total): 56
  Max coverage per sequencing technology
        454:    71

  Quality assessment:
  -------------------
  Average consensus quality: 76
  Consensus bases with IUPAC: 17498 (you might want to check these)

Attachment: E112_11NrM01_length_vs_coverage.png
Description: PNG image

Other related posts: