[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: John DeFilippo <defilippo.john@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Sun, 31 Aug 2014 15:01:40 -0400

Hi Rick,

>  run miramem?  

I had no idea of chromosome sizes, but:
RAM estimates:
reads+contigs (unavoidable): 686.7 GiB
large tables (tunable): 3.8 GiB
---------
total (peak): 690.5 GiB
add if using -CL:pvlc=yes : 306.4 GiB

> That should give you a *rough* idea of how much memory you will need thus if 
> you should even try to attempt to use your HPCC resource.

I think one of the systems they have is a 512 core (don’t know #CPUs) SGI box 
with 4 TB RAM. 
They have ICC intel and GCC AMD boxes, but I don’t know their specs. 
They have many bioinformatics packages, not MIRA, but I imagine I could get 
them to install it.

> it is unclear to me if your sizes refer to the number of bases or the actual 
> file size.  In other words when you say, “Ion Torrent PGM (FASTQ ~ 3 GB), 
> Proton (FASTQ ~ 9 GB), and PacBio (FASTQ ~ 78 MB) reads” does the 3 GB mean 3 
> billion bases or 3 billion bytes in the file?

These are the #bytes in the FASTQ files. 

> if the latter is the file zipped (compressed) or not? 

These are uncompressed file sizes.

> I am trying to figure out if you have enough depth of coverage to assemble a 
> 0.8 Gbase genome.

PGM run was about 6 million reads with mean length of about 230 bp, Proton run 
was about 34 million reads with mean length of 118 bp, so only about 1x and 5x 
of math coverage (don’t know how many of these reads might be duplicates). 
PacBio was about 134,000 reads with about a 6.3 Kb read length. 
I realize this is not very deep coverage at all for a denovo assembly, but we 
were hoping that using the IonTorrent short reads with the PacBio long reads 
would give us something. We’re not looking for a high quality draft assembly.

Thanks.

John

On Aug 31, 2014, at 1:35 PM, Rick Westerman <westerman@xxxxxxxxxx> wrote:

> Did you read chapter "3.14.1.  Estimating needed memory for an assembly 
> project” and then run miramem?  That should give you a *rough* idea of how 
> much memory you will need thus if you should even try to attempt to use your 
> HPCC resource.
> 
> Also it is unclear to me if your sizes refer to the number of bases or the 
> actual file size.  In other words when you say, “Ion Torrent PGM (FASTQ ~ 3 
> GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 MB) reads” does the 3 GB 
> mean 3 billion bases or 3 billion bytes in the file?  And if the latter is 
> the file zipped (compressed) or not?  I am trying to figure out if you have 
> enough depth of coverage to assemble a 0.8 Gbase genome.
> 
> --
> Rick Westerman
> westerman@xxxxxxxxxx
> 
> 
> 
> 
> On Aug 31, 2014, at 1:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
> 
>> Hi Bastien,
>> 
>>> Huh … 800? 8-0-0?
>> 
>> yup, a sea urchin, about 1/4 the human genome
>> 
>>> I’m not sure whether you should try to assembly such a large genome with 
>>> MIRA.
>> 
>> A bioinformatician at IonTorrent who was familiar with our PGM and Proton 
>> sequencing results had suggested either MIRA or Newbler as 
>> IonTorrent-friendly commercial assembly tools. Since I’m attempting a hybrid 
>> denovo assembly using long PacBio reads to supplement the short IonTorrent 
>> reads, some research I did indicated MIRA was a good candidate for such an 
>> assembly. I hoped the size of the genome would be more of a time-to-run 
>> issue, not a make or break issue for the assembler.
>> 
>>> I know I wouldn’t.
>> 
>> Keeping in mind that I’m a biologist, not a bioinformatician or computer 
>> scientist, whose sole bioinformatics experience is limited to running 
>> command line BLAST, but who doesn’t mind devoting the time to teach myself 
>> new skills, what would you recommend? (BTW, I am the entire 'bioinformatics 
>> department' in our tiny underfunded university lab).
>> 
>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>> assemble such a genome with MIRA. 
>> 
>> I do have access to a group HPCC that our university is part of. I’ve been 
>> working on my Mac because being such a newbie at all of this I like to work 
>> at home, as it takes me all day to figure out how to do things, and they 
>> don’t like to hand out VPNs to access it from home. But I can access it from 
>> our lab. So on a high performance computing cluster, is MIRA a viable choice 
>> for doing the kind of large genome hybrid denovo assembly I’m attempting?
>> 
>> Thanks.
>> 
>> JD
>> 
>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
>> 
>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
>>>> This is my first time using MIRA, and my first attempt at an assembly.
>>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion 
>>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 
>>>> MB) reads.
>>> 
>>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a 
>>> large genome with MIRA. I know I wouldn’t.
>>> 
>>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread
>>> 
>>> Not all parts of MIRA run in multithread: some are not worth it, others 
>>> cannot be multithreaded.
>>> 
>>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to 
>>>> reboot.
>>> 
>>> I suspect a RAM problem coupled with an OSX memory management weirdness. 
>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>> assemble such a genome with MIRA. There’s no way your Mac has that. 
>>> Normally the OS should, at one point, simply return a memory allocation 
>>> failure and that would be the end of the story … I have no idea why it 
>>> decides to freeze instead.
>>> 
>>> B.
>>> 
>>> 
>>> 
>>> --
>>> You have received this mail because you are subscribed to the mira_talk 
>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>> visit http://www.chevreux.org/mira_mailinglists.html
>> 
>> 
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
> 
> 
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler

References:
- [mira_talk] not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: Bastien Chevreux
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Rick Westerman

[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: