On 5/16/2011 1:17 PM, Bastien Chevreux wrote:
We had two strains that appear to differ by only a handful of SNPs that we assembled. One produced an N50 of ~150Kb and the other ~40Kb. (Details below.)On May 16, 2011, at 15:16 , Phillip San Miguel wrote:I am now using MIRA V3.2.1.17 to de novo assemble 13 million solexa reads (101 base PE reads). That is 1.3 billion bases of sequence. The genome size is about 4.5 million bases (Salmonella). So that is 200x-300x coverage--more than I intended.Do yourself a favour: go with 6m reads, that should be plenty enough.Anyone want to predict the N50 contig length?Depends on the genome itself, how repetitive it is. With PE reads I would hope for N50 >20kb though.
The main difference between the two samples was in a detail of TruSeq library construction . One isolate had size selection done using a Pippin Prep system. The other isolate used E-gels and combined 2 fractions. The actual size-selection systems are probably unimportant. However, the Pippin Prep system size selection produced a library with "inserts" ranging from 340-440 bp with a mode at 381 bp. Whereas the E-gel size selected libraries resulted from the combination of 2 size fractions. The resulting insert size distribution was bimodal, with one mode at 320 and the other mode at 366.
Histogram:--this is from ELAND2 alignment of the reads against the (nearly identical) reference sequence:
length bin (bp) Pippin prep E-gel 0-140 1% 1% 141-160 0% 1% 161-180 0% 2% 181-200 0% 2% 201-220 0% 2% 221-240 0% 3% 241-260 0% 4% 261-280 0% 5% 281-300 0% 8% 301-320 1% 22% 321-340 2% 20% 341-360 11% 10% 361-380 32% 13% 381-400 34% 5% 401-420 13% 2% 421-999 3% 1% Mira de novo assembly results: For "Large" contigs Number of contigs 145 371 Total consensus: 4916075 4919599 Largest contig: 311963 118228 N50 contig size: 149609 39379 N90 contig size: 41722 7578 N95 contig size: 18547 3588The assemblies were done on different computers, but I think the salient parameters were the same:
For the "Pippin prep" assembly:--job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -GE:tismin=200:tismax=700
For the "E-gel" assembly: --job=genome,accurate,solexa SOLEXA_SETTINGS -GE:tismin=200:tismax=700 -- Phillip