Hi Bastien and Miraistas, I got some strange results comparing assemblies of newbler and MIRA. I assembled de novo 620000 unpaired 454 Titanium reads, from a bacteria that is ~50% GC and 5.3 Mb long. I was expecting an average coverage of ~45X, roughly 200 long contigs, and that's more or less what I got with Newbler (2.5.3) using standard settings (except long contigs, which I set to 500 bp): - almost all reads aligned - 217 contigs > 500 bp - N50 at 80kb, largest contig 225kb - total size about 5.3 Mb When I ran mira 3.4rc2: mira --project=$ASS_ID --job=denovo,genome,normal,454 -OUT:ora=yes -DI:trt=/scratch/tmp -GE:not=12 > assembly_log.txt & Result was... not that good. In short: - about 67% of reads assembled (only) - average coverage 21 (half as expected) - 5577 contigs > 500 bp - N50 at 2 kb, largest contig at 18kb - total size about 8.3 Mb (40% more than true length) I attach the length vs coverage plot of the Mira assembly. The assembly log is 16Mb once bzipped, so I can't send it, but I can put it somewhere on the net if necessary. I took exactly the same starting material. I extracted the sequences from the sff using sff_extract with -c (or -C, can't remember) option to hard-clip sequences. I'm running the same assembly using Mira 3.2.1. It's in pass 2 now, but from the results of pass 1 it doesn't look very different... Anyone has seen this before? Mira used to perform better than newbler, but here I'm a bit speechless... Thanks for any idea, opinion! Cheers, Lionel In details: Newbler: numberOfReads = 614615, 614189; numberOfBases = 258203857, 249060819; readStatus numAlignedReads = 601557, 97.94%; numAlignedBases = 239458446, 96.14%; inferredReadError = 0.41%, 985238; numberAssembled = 306880; numberPartial = 294629; numberSingleton = 1675; numberRepeat = 169; numberOutlier = 4549; numberTooShort = 6287; largeContigMetrics numberOfContigs = 217; numberOfBases = 5330283; avgContigSize = 24563; N50ContigSize = 80240; largestContigSize = 224866; Q40PlusBases = 5328574, 99.97%; Q39MinusBases = 1709, 0.03%; MIRA: Assembly information: ===================== Num. reads assembled: 414749 Num. singlets: 0 Coverage assessment (calculated from contigs >= 5000): ========================================================= Avg. total coverage: 20.20 Avg. coverage per sequencing technology 454: 21.00 Large contigs (makes less sense for EST assemblies): ==================================================== With Contig size >= 500 AND (Total avg. Cov >= 7 Length assessment: ------------------ Number of contigs: 5577 Total consensus: 8561693 Largest contig: 18136 N50 contig size: 2088 N90 contig size: 691 N95 contig size: 604 Coverage assessment: -------------------- Max coverage (total): 56 Max coverage per sequencing technology 454: 71 Quality assessment: ------------------- Average consensus quality: 76 Consensus bases with IUPAC: 17498 (you might want to check these)
Attachment:
E112_11NrM01_length_vs_coverage.png
Description: PNG image