[mira_talk] Re: Fwd: Asking about hybrid assembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sat, 25 Jul 2009 17:06:52 +0200

On Freitag 24 Juli 2009 Rafał Woycicki wrote:
> I am doing hybrid assembly using MIRA 2.9.46x1 on the plant genome (~400
> Mbp) with 8x unpaired 454 Titanium, 4x paired 454 Titanium (3kbp) and BAC
> ends.
> [...]
> It was feeded with ~ 14 Millions Titanium reads and 65000 Sanger reads.

Hello Rafal,

400Mb and 14 million reads ... you're afraid of nothing.

Are you aware that, in terms of read numbers, this represents ~40% of the data 
the human genome project or Celera worked with (afair they had each 30 to 36 
million Sanger reads)? And that they had whole data centers at their disposal 
to crunch that data?

> The program is working now for 1 week using at most 150 GB of RAM and
> (sometimes) 10 cores on IA64.
> I suppose that everythink is going right cause it is putting new files in
> log directory, but my question is: Do you know how more long it could take
> it to finish?

As you did not write with what parameters you started MIRA, I can only guess. 
But even in the most favourable circumstances you're in for weeks ... as in in 
the order of at least 6 to 8.

You might want to test version 2.9.48 which is due to come out this week-end, 
it contains improvements specifically for very large data sets (those in the 
order of millions) and has brought down assembly times by a factor of two for 
me.

> Thank you for any thoughts.

1) Should the GC content of your organism be relatively high and your Titanium 
data have been generated without the new kits from Roche, then trouble might 
be heading your way. Roche/454 has been pretty quiet on the subject, but their 
Titanium had pretty big problems with secondary structures formation. I've had 
a project with a simple bacterium which yielded 800 contigs in Titanium data 
because of this. Roche is shipping new kits since beginning of July with 
special chemistry and first rumours I heard is that it works now. You might 
want to check with your sequence provider.

2) I guess your Titanium reads will be ~380 on average in length, which brings 
the coverage of a 400Mb genome to ~13-14x. This is too low for 454 data and 
your genome will be quite fragmented.

3) If you're trying out 2.9.48, I'd be interested in the log of the current 
run and the log of the 2.9.48 run to see what the effect of some changes are. I 
can test this large number of reads only with Solexa data at the moment.


Regards,
  Bastien


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: