I'm looking for some additional information regarding a comment Bastien made in a recent post (//www.freelists.org/post/mira_talk/Assembly-in-versions-4-05-is-worse-than-previous-versions,5). In that post was the following: "On another note: the assembly engine of the late 3.9.x and the MIRA 4 series has been changed and does not work well anymore with very short reads. I think the cutoff is somewhere around 60 to 70 bp (I never bothered checking), but for reads below this cutoff things tend to fall apart. This might explain in part the worse N50 you are seeing. BTW, this is an "expect-no-fix" problem as I see no point in continuing to support this kind of data atm." Does that hold true when dealing with paired-end reads when each member of the pair is less than 60 bp, but where there is longer read technology data being assembled too? If reads less than 60 bp are a problem, are there any parameter tweaks that would improve the assembly if that type of data were used in the assembly? Also ,would shorter read Illumina paired-end data cause dramatically longer assembly times? Here's the back story that engendered the questions. I'm assembling a 4 Mb bacterial genome that has approximately a 40 % G+C content. The input data consists of three libraries, two Roche FLX and one Illumina. More specifically, there is one FLX titanium mate-pair library (average read length of 165 bp after splitting with sff_extract, average insert size 2.4 Kb), one FLXplus shotgun library (average read length 1300 bp), and one Illumina TruSeq paired-end library (mechanical shearing i.e. NOT Nextera, 54-bp reads, average template size 500 bp). The data being assembled is about 45X coverage FLX and 34X coverage Illumina for about 79X total coverage. I've used the exact same read input files and assembled the genome with MIRA v. 3.9.15 and MIRA v. 4.0.2 and observed that the MIRA v. 4.0.2 assemblies are taking inordinately long to complete (8 h with 3.9.15 and 3 d with 4.0.2) and yielding poorer assemblies compared to v. 3.9.15 (see summary metrics below). Assembling only the Roche FLX data portion with MIRA v. 4.0.2 results in a better assembly than is achieved when the Illumina data is added. After a lot of underlying checks, I'm now thinking that it may simply be the shorter read size in the Illumina pairs data, coupled with changes in MIRA v. 4.0.X line that is causing the difference. An additional observation is that most of the additional assembly time in with MIRA v. 4.0.2 seems to be due to the program spending much longer in the " Aligning possible forward matches" and "Aligning possible complement matches" phases. Assembling the genome (~8 h) with MIRA v. 3.9.15 yields the following: --------------------------------------------------------- Num. reads assembled: 3120516 Num. singlets: 0 Coverage assessment (calculated from contigs >= 5000): ========================================================= Avg. total coverage: 78.25 Avg. coverage per sequencing technology Sanger: 0.00 454: 44.77 IonTor: 0.00 PcBioHQ: 0.00 PcBioLQ: 0.00 Text: 0.00 Solexa: 33.87 Solid: 0.00 Large contigs (makes less sense for EST assemblies): ==================================================== With Contig size >= 500 AND (Total avg. Cov >= 26 OR Cov(san) >= 0 OR Cov(454) >= 15 OR Cov(ion) >= 0 OR Cov(pbh) >= 0 OR Cov(pbl) >= 0 OR Cov(txt) >= 0 OR Cov(sxa) >= 11 OR Cov(sid) >= 0 ) Length assessment: ------------------ Number of contigs: 183 Total consensus: 3930971 Largest contig: 143998 N50 contig size: 47182 N90 contig size: 11779 N95 contig size: 7356 Coverage assessment: -------------------- Max coverage (total): 586 Max coverage per sequencing technology Sanger: 0 454: 1554 IonTor: 0 PcBioHQ: 0 PcBioLQ: 0 Text: 0 Solexa: 675 Solid: 0 Quality assessment: ------------------- Average consensus quality: 88 Consensus bases with IUPAC: 76 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 24 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) Assembling the genome (~3 d) with MIRA v. 4.0.2 yields the following: --------------------------------------------------------- Num. reads assembled: 3213968 Num. singlets: 15 Coverage assessment (calculated from contigs >= 5000 with coverage >= 19): ========================================================= Avg. total coverage: 79.11 Avg. coverage per sequencing technology Sanger: 0.00 454: 45.57 IonTor: 0.00 PcBioHQ: 0.00 PcBioLQ: 0.00 Text: 0.00 Solexa: 33.84 Solid: 0.00 Large contigs (makes less sense for EST assemblies): ==================================================== With Contig size >= 500 AND (Total avg. Cov >= 40 OR Cov(san) >= 0 OR Cov(454) >= 23 OR Cov(ion) >= 0 OR Cov(pbh) >= 0 OR Cov(pbl) >= 0 OR Cov(txt) >= 0 OR Cov(sxa) >= 17 OR Cov(sid) >= 0 ) Length assessment: ------------------ Number of contigs: 345 Total consensus: 4026035 Largest contig: 91188 N50 contig size: 31538 N90 contig size: 5884 N95 contig size: 2534 Coverage assessment: -------------------- Max coverage (total): 4417 Max coverage per sequencing technology Sanger: 0 454: 448 IonTor: 0 PcBioHQ: 0 PcBioLQ: 0 Text: 0 Solexa: 3980 Solid: 0 Quality assessment: ------------------- Average consensus quality: 88 Consensus bases with IUPAC: 50 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 18 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) Any additional thoughts or suggestions are appreciated! Regards, Darrell This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.