On May 16, 2011, at 15:16 , Phillip San Miguel wrote: > I am now using MIRA V3.2.1.17 to de novo assemble 13 million solexa reads > (101 base PE reads). That is 1.3 billion bases of sequence. The genome size > is about 4.5 million bases (Salmonella). So that is 200x-300x coverage--more > than I intended. Do yourself a favour: go with 6m reads, that should be plenty enough. > Anyone want to predict the N50 contig length? Depends on the genome itself, how repetitive it is. With PE reads I would hope for N50 >20kb though. > I tried MIRA V3.2.1.15 on a 70% GC bacterial genome (Deinococcus) at around > 100x coverage with solexa PE 101 base reads. My N50 contig size was 4630 > bases. That seems short to me, but it might be a result of the 70% GC. So I > decided to de novo assemble a 50% GC data set from the same run. That's bad, really bad. You are the second report I get that apparently, MIRA has problems with high GC Solexa data sets. The first being a supersecret bug of a big company, I cannot get the data to see what's causing havoc. Would it be possible for me to have a look at that thing? No promises, but it might help. B.