On Freitag 24 Juli 2009 Rafał Woycicki wrote: > I am doing hybrid assembly using MIRA 2.9.46x1 on the plant genome (~400 > Mbp) with 8x unpaired 454 Titanium, 4x paired 454 Titanium (3kbp) and BAC > ends. > [...] > It was feeded with ~ 14 Millions Titanium reads and 65000 Sanger reads. Hello Rafal, 400Mb and 14 million reads ... you're afraid of nothing. Are you aware that, in terms of read numbers, this represents ~40% of the data the human genome project or Celera worked with (afair they had each 30 to 36 million Sanger reads)? And that they had whole data centers at their disposal to crunch that data? > The program is working now for 1 week using at most 150 GB of RAM and > (sometimes) 10 cores on IA64. > I suppose that everythink is going right cause it is putting new files in > log directory, but my question is: Do you know how more long it could take > it to finish? As you did not write with what parameters you started MIRA, I can only guess. But even in the most favourable circumstances you're in for weeks ... as in in the order of at least 6 to 8. You might want to test version 2.9.48 which is due to come out this week-end, it contains improvements specifically for very large data sets (those in the order of millions) and has brought down assembly times by a factor of two for me. > Thank you for any thoughts. 1) Should the GC content of your organism be relatively high and your Titanium data have been generated without the new kits from Roche, then trouble might be heading your way. Roche/454 has been pretty quiet on the subject, but their Titanium had pretty big problems with secondary structures formation. I've had a project with a simple bacterium which yielded 800 contigs in Titanium data because of this. Roche is shipping new kits since beginning of July with special chemistry and first rumours I heard is that it works now. You might want to check with your sequence provider. 2) I guess your Titanium reads will be ~380 on average in length, which brings the coverage of a 400Mb genome to ~13-14x. This is too low for 454 data and your genome will be quite fragmented. 3) If you're trying out 2.9.48, I'd be interested in the log of the current run and the log of the 2.9.48 run to see what the effect of some changes are. I can test this large number of reads only with Solexa data at the moment. Regards, Bastien -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html