[mira_talk] Re: 500Mb assembly

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 10 Dec 2009 19:46:58 +0100

On Donnerstag 10 Dezember 2009 Filip Van Nieuwerburgh wrote:
> Thanks for your encouraging insights ;-(

Sorry, I've acquired the reputation to have changed from "diplomatic" to "a 
bit direct" :-)

But you are aware that the Human Genome Project employed perhaps hundreds of 
people and had 35 million reads (Sanger, a bit longer but still) to assemble 
the human genome, right?

Celera also had around 35 million reads, less people, but a huge server farm. 
And they made almost nightly downloads of the public HGP data to perform 
comparisons and reconcilations (some called this 'cheating', but hey, the data 
*was* public after all).

Coming to work load: last I read is that the Beijing Center has 7 
bioinformaticians per Illumina GA to get the analysis work done ... and they 
have at least 30(!) of these babies. Source:
http://www.genomeweb.com/informatics/bioinformatics-job-market-tug-war-heavy-
demand-data-analysis-vs-tightening-budge

So, excuse me if I'm being blunt, but ... I wouldn't do this assembly only by 
myself, and certainly not in a month or two :-)

> I am curious: If MIRA could
> run on multiple processors (I also have access to a 128-core system),
> would it be able to manage this project?

Yes and no. Let me start with answering this question:

> I then of course have a second
> question: Are there any concrete plans to develop MIRA so that it can be
> run on multiple processors?

MIRA already is in part using multiple processors (in the SKIM part, the first 
all-vs-all comparison). I've had plans to implement multi-threading also in 
the Smith-Waterman part for quite some time now, but never really came around 
it due to lack of time (if anyone's willing to implement, I'll coach :-)

But that's not the biggest problem. It's afterwards, during contig pathfinding 
in the overlap graph and contig building. This can not be easily parallelised 
except by taking repeats out of the assembly process. And keeping good track 
of repeats is actually what makes MIRA pretty competitive against other 
assemblers, so it's at the moment a no-go for me.

Last but not least: memory. MIRA keeps tons of stuff as info in memory to get 
things assembled right, but this has been killing me ever since Solexas came 
on the market. I still have ideas on how to bring down memory requirements 
further, but this takes time to implement. At the moment, you'd need at least 
~250 GB RAM to even think of running MIRA with 100m reads.

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: