[mira_talk] Re: Mira results much worse than newbler

  • From: Robin Kramer <kodream@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 14 Jul 2011 09:11:51 -0600

You should really use sffinfo, sff_extract doesn't clean the sequences.

Sincerely yours,

Robin

On Thu, Jul 14, 2011 at 8:51 AM, Lionel Guy <guy.lionel@xxxxxxxxx> wrote:

> Hi Bastien and Miraistas,
>
> I got some strange results comparing assemblies of newbler and MIRA. I
> assembled de novo 620000 unpaired 454 Titanium reads, from a bacteria
> that is ~50% GC and 5.3 Mb long.
>
> I was expecting an average coverage of ~45X, roughly 200 long contigs,
> and that's more or less what I got with Newbler (2.5.3) using standard
> settings (except long contigs, which I set to 500 bp):
> - almost all reads aligned
> - 217 contigs > 500 bp
> - N50 at 80kb, largest contig 225kb
> - total size about 5.3 Mb
>
> When I ran mira 3.4rc2:
>
> mira --project=$ASS_ID --job=denovo,genome,normal,454 -OUT:ora=yes
> -DI:trt=/scratch/tmp -GE:not=12 > assembly_log.txt &
>
> Result was... not that good. In short:
> - about 67% of reads assembled (only)
> - average coverage 21 (half as expected)
> - 5577 contigs > 500 bp
> - N50 at 2 kb, largest contig at 18kb
> - total size about 8.3 Mb (40% more than true length)
>
> I attach the length vs coverage plot of the Mira assembly. The assembly
> log is 16Mb once bzipped, so I can't send it, but I can put it somewhere
> on the net if necessary.
>
> I took exactly the same starting material. I extracted the sequences
> from the sff using sff_extract with -c (or -C, can't remember) option to
> hard-clip sequences. I'm running the same assembly using Mira 3.2.1.
> It's in pass 2 now, but from the results of pass 1 it doesn't look very
> different...
>
> Anyone has seen this before? Mira used to perform better than newbler,
> but here I'm a bit speechless...
>
> Thanks for any idea, opinion!
>
> Cheers,
>
> Lionel
>
>
> In details:
>
> Newbler:
> numberOfReads = 614615, 614189;
> numberOfBases = 258203857, 249060819;
>
> readStatus
> numAlignedReads     = 601557, 97.94%;
> numAlignedBases     = 239458446, 96.14%;
> inferredReadError  = 0.41%, 985238;
>
> numberAssembled = 306880;
> numberPartial   = 294629;
> numberSingleton = 1675;
> numberRepeat    = 169;
> numberOutlier   = 4549;
> numberTooShort  = 6287;
>
> largeContigMetrics
> numberOfContigs   = 217;
> numberOfBases     = 5330283;
>
> avgContigSize     = 24563;
> N50ContigSize     = 80240;
> largestContigSize = 224866;
>
> Q40PlusBases      = 5328574, 99.97%;
> Q39MinusBases     = 1709, 0.03%;
>
>
> MIRA:
> Assembly information:
> =====================
> Num. reads assembled: 414749
> Num. singlets: 0
>
> Coverage assessment (calculated from contigs >= 5000):
> =========================================================
>  Avg. total coverage: 20.20
>  Avg. coverage per sequencing technology
>        454:    21.00
>
> Large contigs (makes less sense for EST assemblies):
> ====================================================
> With    Contig size             >= 500
>        AND (Total avg. Cov     >= 7
>
>  Length assessment:
>  ------------------
>  Number of contigs: 5577
>  Total consensus: 8561693
>  Largest contig: 18136
>  N50 contig size: 2088
>  N90 contig size: 691
>  N95 contig size: 604
>
>  Coverage assessment:
>  --------------------
>  Max coverage (total): 56
>  Max coverage per sequencing technology
>        454:    71
>
>  Quality assessment:
>  -------------------
>  Average consensus quality: 76
>  Consensus bases with IUPAC: 17498 (you might want to check these)
>
>

Other related posts: