[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: Chris Hoefler <hoeflerb@xxxxxxxxx>
To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
Date: Sun, 31 Aug 2014 14:23:19 -0500

I have never worked with a genome of this size or complexity, so take whatever 
I say with the requisite salt, but just a few comments in no particular order.

1) Assembling a genome of this size is hard. It requires a fair amount of time 
and expertise to do correctly. Just as a point of reference, have a look at the 
number of authors on this (old but not ancient) paper.
http://www.sciencemag.org/content/314/5801/941
So as a beginning bioinformatician, this will likely be a quite difficult task.

2) As per Rick's comment, I don't think your data set is really up to the task. 
A reasonable approach to a de novo assembly is to use long reads (aka PacBio) 
to put together initial large contigs, and then polish things off by mapping 
short reads over them. The problem is that your short read coverage might be 
adequate, but your PacBio coverage is definitely not. Compressed or not, your 
PacBio data represents less than 1X coverage of the genome. And these are 
probably not error-corrected, so you have that problem as well. In addition, 
the majority of your short reads are likely <200 bp with some longer 400 bp 
reads. Without any pairing info (you didn't say whether you had any), the 
ability to resolve repeats will be severely limited.

3) So what are you left with? Well you can try a short read assembly using 
something like Ray or Velvet that can handle the large genome size. But ploidy 
and repeats will be a significant problem. Mira handles those two things quite 
well, but the memory requirement will be challenging. Once you have some short 
reads, you can try your luck scaffolding with the PacBio reads, but I wouldn't 
expect a great result from that. In the end you will be left with a highly 
fragmented genome with mostly unresolved repeats. This may be good enough, but 
it depends on what you are planning to use the data for.

4) Alternatively, you can try to get more data. Other people on this list can 
tell you about BAC libraries and such. Personally, I think PacBio is rapidly 
becoming the future, especially with the amount of work going into using it for 
large genome assembly. But, this is still largely new territory for PacBio, and 
the data and compute requirements are tremendous. Last year, PacBio published a 
de novo human genome assembly using just PacBio data. The results are quite 
good, but they ended up using 405,000 CPU hours on the Google Compute Cloud to 
do the error-correction and assembly. And this was a haploid assembly at that. 
There is a lot of new and interesting work on improving performance and 
handling ploidy, but this is really at the cutting edge right now and I would 
give it a year or so before it really becomes mainstream.

So what to do? Well, start with a few questions. What do you want out of the 
data? Is it something you can do with a reference-guided assembly using only 
short reads, or absent that possibility a highly fragmented de novo assembly of 
dubious quality using only short reads? If so, make do with what you have, work 
on your cluster, and you can try Mira, but it may give you some serious 
problems. If not, what is realistic in terms of getting more data? And are you 
prepared for the task? Do you have other bioinformatics resources elsewhere at 
your university to turn to?

> On Aug 31, 2014, at 12:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
> 
> Hi Bastien,
> 
>> Huh … 800? 8-0-0?
> 
> yup, a sea urchin, about 1/4 the human genome
> 
>> I’m not sure whether you should try to assembly such a large genome with 
>> MIRA.
> 
> A bioinformatician at IonTorrent who was familiar with our PGM and Proton 
> sequencing results had suggested either MIRA or Newbler as 
> IonTorrent-friendly commercial assembly tools. Since I’m attempting a hybrid 
> denovo assembly using long PacBio reads to supplement the short IonTorrent 
> reads, some research I did indicated MIRA was a good candidate for such an 
> assembly. I hoped the size of the genome would be more of a time-to-run 
> issue, not a make or break issue for the assembler.
> 
>> I know I wouldn’t.
> 
> Keeping in mind that I’m a biologist, not a bioinformatician or computer 
> scientist, whose sole bioinformatics experience is limited to running command 
> line BLAST, but who doesn’t mind devoting the time to teach myself new 
> skills, what would you recommend? (BTW, I am the entire 'bioinformatics 
> department' in our tiny underfunded university lab).
> 
>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>> assemble such a genome with MIRA.
> 
> I do have access to a group HPCC that our university is part of. I’ve been 
> working on my Mac because being such a newbie at all of this I like to work 
> at home, as it takes me all day to figure out how to do things, and they 
> don’t like to hand out VPNs to access it from home. But I can access it from 
> our lab. So on a high performance computing cluster, is MIRA a viable choice 
> for doing the kind of large genome hybrid denovo assembly I’m attempting?
> 
> Thanks.
> 
> JD
> 
>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
>> 
>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
>>> This is my first time using MIRA, and my first attempt at an assembly.
>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion 
>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 
>>> MB) reads.
>> 
>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a 
>> large genome with MIRA. I know I wouldn’t.
>> 
>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread
>> 
>> Not all parts of MIRA run in multithread: some are not worth it, others 
>> cannot be multithreaded.
>> 
>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to 
>>> reboot.
>> 
>> I suspect a RAM problem coupled with an OSX memory management weirdness. 
>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>> assemble such a genome with MIRA. There’s no way your Mac has that. Normally 
>> the OS should, at one point, simply return a memory allocation failure and 
>> that would be the end of the story … I have no idea why it decides to freeze 
>> instead.
>> 
>> B.
>> 
>> 
>> 
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
> 
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo

References:
- [mira_talk] not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: Bastien Chevreux
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo

[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: