[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: Chris Hoefler <hoeflerb@xxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Sun, 31 Aug 2014 14:38:43 -0500

Great, this certainly answers some questions. Unfortunately, at these coverage 
levels, there is not much you can do with the PacBio data. I wouldn't even try. 
With the short read data, you might have enough to get some contigs, but you 
will have a lot of coverage gaps and unresolved repeats. Mira is pretty good at 
avoiding misassemblies, but your miramem calculation is not optimistic. You 
will be pushing 1Tb of memory usage ( remember miramem is just an estimate) and 
probably a lot of CPU time as well. Other assemblers like Velvet are faster and 
use less memory, but misassemblies will be present and you will have to look 
out for them. If you have the resources to get some pairing data, that will 
probably help a lot.


> On Aug 31, 2014, at 2:01 PM, John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
> 
> 
> 
> PGM run was about 6 million reads with mean length of about 230 bp, Proton 
> run was about 34 million reads with mean length of 118 bp, so only about 1x 
> and 5x of math coverage (don’t know how many of these reads might be 
> duplicates). PacBio was about 134,000 reads with about a 6.3 Kb read length. 
> I realize this is not very deep coverage at all for a denovo assembly, but we 
> were hoping that using the IonTorrent short reads with the PacBio long reads 
> would give us something. We’re not looking for a high quality draft assembly.
> 
> Thanks.
> 
> John
> 
>> On Aug 31, 2014, at 1:35 PM, Rick Westerman <westerman@xxxxxxxxxx> wrote:
>> 
>> Did you read chapter "3.14.1.  Estimating needed memory for an assembly 
>> project” and then run miramem?  That should give you a *rough* idea of how 
>> much memory you will need thus if you should even try to attempt to use your 
>> HPCC resource.
>> 
>> Also it is unclear to me if your sizes refer to the number of bases or the 
>> actual file size.  In other words when you say, “Ion Torrent PGM (FASTQ ~ 3 
>> GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 MB) reads” does the 3 GB 
>> mean 3 billion bases or 3 billion bytes in the file?  And if the latter is 
>> the file zipped (compressed) or not?  I am trying to figure out if you have 
>> enough depth of coverage to assemble a 0.8 Gbase genome.
>> 
>> --
>> Rick Westerman
>> westerman@xxxxxxxxxx
>> 
>> 
>> 
>> 
>>> On Aug 31, 2014, at 1:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> 
>>> wrote:
>>> 
>>> Hi Bastien,
>>> 
>>>> Huh … 800? 8-0-0?
>>> 
>>> yup, a sea urchin, about 1/4 the human genome
>>> 
>>>> I’m not sure whether you should try to assembly such a large genome with 
>>>> MIRA.
>>> 
>>> A bioinformatician at IonTorrent who was familiar with our PGM and Proton 
>>> sequencing results had suggested either MIRA or Newbler as 
>>> IonTorrent-friendly commercial assembly tools. Since I’m attempting a 
>>> hybrid denovo assembly using long PacBio reads to supplement the short 
>>> IonTorrent reads, some research I did indicated MIRA was a good candidate 
>>> for such an assembly. I hoped the size of the genome would be more of a 
>>> time-to-run issue, not a make or break issue for the assembler.
>>> 
>>>> I know I wouldn’t.
>>> 
>>> Keeping in mind that I’m a biologist, not a bioinformatician or computer 
>>> scientist, whose sole bioinformatics experience is limited to running 
>>> command line BLAST, but who doesn’t mind devoting the time to teach myself 
>>> new skills, what would you recommend? (BTW, I am the entire 'bioinformatics 
>>> department' in our tiny underfunded university lab).
>>> 
>>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>>> assemble such a genome with MIRA.
>>> 
>>> I do have access to a group HPCC that our university is part of. I’ve been 
>>> working on my Mac because being such a newbie at all of this I like to work 
>>> at home, as it takes me all day to figure out how to do things, and they 
>>> don’t like to hand out VPNs to access it from home. But I can access it 
>>> from our lab. So on a high performance computing cluster, is MIRA a viable 
>>> choice for doing the kind of large genome hybrid denovo assembly I’m 
>>> attempting?
>>> 
>>> Thanks.
>>> 
>>> JD
>>> 
>>>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
>>>> 
>>>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
>>>>> This is my first time using MIRA, and my first attempt at an assembly.
>>>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion 
>>>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 
>>>>> MB) reads.
>>>> 
>>>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a 
>>>> large genome with MIRA. I know I wouldn’t.
>>>> 
>>>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread
>>>> 
>>>> Not all parts of MIRA run in multithread: some are not worth it, others 
>>>> cannot be multithreaded.
>>>> 
>>>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to 
>>>>> reboot.
>>>> 
>>>> I suspect a RAM problem coupled with an OSX memory management weirdness. 
>>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>>> assemble such a genome with MIRA. There’s no way your Mac has that. 
>>>> Normally the OS should, at one point, simply return a memory allocation 
>>>> failure and that would be the end of the story … I have no idea why it 
>>>> decides to freeze instead.
>>>> 
>>>> B.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> You have received this mail because you are subscribed to the mira_talk 
>>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>>> visit http://www.chevreux.org/mira_mailinglists.html
>>> 
>>> 
>>> --
>>> You have received this mail because you are subscribed to the mira_talk 
>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>> visit http://www.chevreux.org/mira_mailinglists.html
>> 
>> 
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
> 
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

References:
- [mira_talk] not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: Bastien Chevreux
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Rick Westerman
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo

[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: