[mira_talk] Re: megahubs ?

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 2 Mar 2009 13:56:37 +0100

On Monday 02 March 2009 Jan van Haarst wrote:
> [...]
> Is there a setting I can use, so I can assemble this thing ?

Hi Jan,

there is: reduce the number of SKIM hits (-SK:mhpr) to something lower, say 
100 and try again ... but this might not be enough.

I'm in the midst of a larger algorithm replacement, but if you feel 
adventurous you can try out the current head of my development branch: 
http://www.chevreux.org/tmp/mira_2.9.41x4_dev_linux-gnu_x86_64.tar.bz2

(Note: it should work as expected but I don't give any guarantee, it does have 
a few new algorithms that passed just a few tests)

As a few things have changed, I'll give you a short walkthrough on how to use 
as some things a still a bit bumpy.

Step 1: estimating some memory parameters. Run "miramem" like in the 
transcript shown below, but entering your "correct" values. The transcript 
below simulates 900 thousand paired-end FLX reads (which are approximately the 
same size as the old GS20) and 5 million FLX reads. Additionally, I guessed a 
50m genome (take here the avg. of Newbler and Celera) and the biggest 
chromosome/contig to be 5 megabases (take the largest value from Newbler / 
Celera):

-------------------------------------------------------------------------------
Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g] 
g                                                               
Size of genome? [4.5m] 50m                                      
50000000                                                        
Looks like a larger eukaryote, guessing largest chromosome size: 30m
Change if needed!                                                   
Size of largest chromosome? [30000000] 5m                           
5000000                                                             
Is it a denovo or mapping assembly? (d/m/) [d]                      
d                                                                   
Number of Sanger reads? [40k] 0                                     
0                                                                   
Are there 454 reads? (y/n/) [n] y                                   
y
Number of 454 GS20 reads? [0] 900k
900000
Number of 454 FLX reads? [0] 5m
5000000
Number of 454 Titanium reads? [0]
0
Are there Solexa reads? (y/n/) [n]
n


************************* Estimates *************************

The contigs will have an average coverage of ~ 25.2 (+/- 10%)

RAM estimates:
           reads+contigs (unavoidable): 22.2 GiB
                large tables (tunable): 1.1 GiB
                                        ---------
                          total (peak): 23.3 GiB

       add if using -CL:pvlc (tunable): 10.8 GiB

*************************************************************
-------------------------------------------------------------------------------

Now, mira estimated it would need 23.3GiB with standard parameters 
(additionally 10.8 more if -CL:pvlc is used (which might be standard in some 
"--job=" configurations, check that for your call and turn it off if needed)).

The important number is the "total peak" (plus -CL:pvlc if used): assuming if 
your machine had 32 GiB, this would leave ~8 GiB (8192 KiB) unused when -
CL:pvlc is off. Take half of the unused memory in KiB (4096) and add this 
number to the -SK:mchr number (which should be 1024 by default), leading to a 
parameter "-SK:mchr=5120"

2) call mira like this

  mira --project=yournamehere 
       --job=yourjobdefaultshere 
       -CL:pvlc=no 
       -SK:mhpr=100:mchr=5120
       >log_assembly.txt

and see whether this helps.

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: