[mira_talk] Library insert size distribution effects on MIRA 3.2.1.17

  • From: Phillip San Miguel <pmiguel@xxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 19 May 2011 11:18:29 -0400

On 5/16/2011 1:17 PM, Bastien Chevreux wrote:
On May 16, 2011, at 15:16 , Phillip San Miguel wrote:
I am now using MIRA V3.2.1.17 to de novo assemble 13 million solexa reads (101 base PE reads). That is 1.3 billion bases of sequence. The genome size is about 4.5 million bases (Salmonella). So that is 200x-300x coverage--more than I intended.

Do yourself a favour: go with 6m reads, that should be plenty enough.

Anyone want to predict the N50 contig length?

Depends on the genome itself, how repetitive it is. With PE reads I would hope for N50 >20kb though.
We had two strains that appear to differ by only a handful of SNPs that we assembled. One produced an N50 of ~150Kb and the other ~40Kb. (Details below.)

The main difference between the two samples was in a detail of TruSeq library construction . One isolate had size selection done using a Pippin Prep system. The other isolate used E-gels and combined 2 fractions. The actual size-selection systems are probably unimportant. However, the Pippin Prep system size selection produced a library with "inserts" ranging from 340-440 bp with a mode at 381 bp. Whereas the E-gel size selected libraries resulted from the combination of 2 size fractions. The resulting insert size distribution was bimodal, with one mode at 320 and the other mode at 366.
Histogram:
--this is from ELAND2 alignment of the reads against the (nearly identical) reference sequence:
length bin (bp)    Pippin prep    E-gel
  0-140             1%             1%
141-160             0%             1%
161-180             0%             2%
181-200             0%             2%
201-220             0%             2%
221-240             0%             3%
241-260             0%             4%
261-280             0%             5%
281-300             0%             8%
301-320             1%            22%
321-340             2%            20%
341-360            11%            10%
361-380            32%            13%
381-400            34%             5%
401-420            13%             2%
421-999             3%             1%

Mira de novo assembly results:
For "Large" contigs
Number of contigs      145 371
Total consensus:   4916075 4919599
Largest contig:     311963 118228
N50 contig size:    149609 39379
N90 contig size:     41722 7578
N95 contig size:     18547 3588


The assemblies were done on different computers, but I think the salient parameters were the same:
For the "Pippin prep" assembly:
--job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -GE:tismin=200:tismax=700

For the "E-gel" assembly:
--job=genome,accurate,solexa SOLEXA_SETTINGS  -GE:tismin=200:tismax=700

--
Phillip


Other related posts: