[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: John DeFilippo <defilippo.john@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Sun, 31 Aug 2014 16:14:24 -0400

Hi Chris,

Thank you for your thorough and thoughtful response. 

> Is it something you can do with a reference-guided assembly using only short 
> reads,


We actually had our IonTorrent Proton reads mapped to the urchin genome from 
the paper you referenced, but our’s is a  much more ancient species, and we 
only got 16% alignment. Hence the plan to go denovo by adding long reads.

> The problem is that your short read coverage might be adequate, but your 
> PacBio coverage is definitely not.


My understanding is that one PacBio library is good for 6-8 flow cells, and 
we’ve done 3, so we could at least double our PacBio data using the same 
library, if that starts to take us closer to adequate.

> are you prepared for the task?

Ignorance is bliss, and I don’t know enough to know what I can’t do. This is 
all self-taught, and I’m game for trying.

> Do you have other bioinformatics resources elsewhere at your university to 
> turn to?

Unfortunately the internet is my only other resource.

> What do you want out of the data?

We’re not looking for a high-quality draft assembly, more of a starting point. 
We just feel like we should be able to get something from what he have, even 
it’s not the best quality.

John


On Aug 31, 2014, at 3:23 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:

> I have never worked with a genome of this size or complexity, so take 
> whatever I say with the requisite salt, but just a few comments in no 
> particular order.
> 
> 1) Assembling a genome of this size is hard. It requires a fair amount of 
> time and expertise to do correctly. Just as a point of reference, have a look 
> at the number of authors on this (old but not ancient) paper.
> http://www.sciencemag.org/content/314/5801/941
> So as a beginning bioinformatician, this will likely be a quite difficult 
> task.
> 
> 2) As per Rick's comment, I don't think your data set is really up to the 
> task. A reasonable approach to a de novo assembly is to use long reads (aka 
> PacBio) to put together initial large contigs, and then polish things off by 
> mapping short reads over them. The problem is that your short read coverage 
> might be adequate, but your PacBio coverage is definitely not. Compressed or 
> not, your PacBio data represents less than 1X coverage of the genome. And 
> these are probably not error-corrected, so you have that problem as well. In 
> addition, the majority of your short reads are likely <200 bp with some 
> longer 400 bp reads. Without any pairing info (you didn't say whether you had 
> any), the ability to resolve repeats will be severely limited.
> 
> 3) So what are you left with? Well you can try a short read assembly using 
> something like Ray or Velvet that can handle the large genome size. But 
> ploidy and repeats will be a significant problem. Mira handles those two 
> things quite well, but the memory requirement will be challenging. Once you 
> have some short reads, you can try your luck scaffolding with the PacBio 
> reads, but I wouldn't expect a great result from that. In the end you will be 
> left with a highly fragmented genome with mostly unresolved repeats. This may 
> be good enough, but it depends on what you are planning to use the data for.
> 
> 4) Alternatively, you can try to get more data. Other people on this list can 
> tell you about BAC libraries and such. Personally, I think PacBio is rapidly 
> becoming the future, especially with the amount of work going into using it 
> for large genome assembly. But, this is still largely new territory for 
> PacBio, and the data and compute requirements are tremendous. Last year, 
> PacBio published a de novo human genome assembly using just PacBio data. The 
> results are quite good, but they ended up using 405,000 CPU hours on the 
> Google Compute Cloud to do the error-correction and assembly. And this was a 
> haploid assembly at that. There is a lot of new and interesting work on 
> improving performance and handling ploidy, but this is really at the cutting 
> edge right now and I would give it a year or so before it really becomes 
> mainstream.
> 
> So what to do? Well, start with a few questions. What do you want out of the 
> data? Is it something you can do with a reference-guided assembly using only 
> short reads, or absent that possibility a highly fragmented de novo assembly 
> of dubious quality using only short reads? If so, make do with what you have, 
> work on your cluster, and you can try Mira, but it may give you some serious 
> problems. If not, what is realistic in terms of getting more data? And are 
> you prepared for the task? Do you have other bioinformatics resources 
> elsewhere at your university to turn to?
> 
> 
> 
>> On Aug 31, 2014, at 12:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> 
>> wrote:
>> 
>> Hi Bastien,
>> 
>>> Huh … 800? 8-0-0?
>> 
>> yup, a sea urchin, about 1/4 the human genome
>> 
>>> I’m not sure whether you should try to assembly such a large genome with 
>>> MIRA.
>> 
>> A bioinformatician at IonTorrent who was familiar with our PGM and Proton 
>> sequencing results had suggested either MIRA or Newbler as 
>> IonTorrent-friendly commercial assembly tools. Since I’m attempting a hybrid 
>> denovo assembly using long PacBio reads to supplement the short IonTorrent 
>> reads, some research I did indicated MIRA was a good candidate for such an 
>> assembly. I hoped the size of the genome would be more of a time-to-run 
>> issue, not a make or break issue for the assembler.
>> 
>>> I know I wouldn’t.
>> 
>> Keeping in mind that I’m a biologist, not a bioinformatician or computer 
>> scientist, whose sole bioinformatics experience is limited to running 
>> command line BLAST, but who doesn’t mind devoting the time to teach myself 
>> new skills, what would you recommend? (BTW, I am the entire 'bioinformatics 
>> department' in our tiny underfunded university lab).
>> 
>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>> assemble such a genome with MIRA.
>> 
>> I do have access to a group HPCC that our university is part of. I’ve been 
>> working on my Mac because being such a newbie at all of this I like to work 
>> at home, as it takes me all day to figure out how to do things, and they 
>> don’t like to hand out VPNs to access it from home. But I can access it from 
>> our lab. So on a high performance computing cluster, is MIRA a viable choice 
>> for doing the kind of large genome hybrid denovo assembly I’m attempting?
>> 
>> Thanks.
>> 
>> JD
>> 
>>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
>>> 
>>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
>>>> This is my first time using MIRA, and my first attempt at an assembly.
>>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion 
>>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 
>>>> MB) reads.
>>> 
>>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a 
>>> large genome with MIRA. I know I wouldn’t.
>>> 
>>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread
>>> 
>>> Not all parts of MIRA run in multithread: some are not worth it, others 
>>> cannot be multithreaded.
>>> 
>>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to 
>>>> reboot.
>>> 
>>> I suspect a RAM problem coupled with an OSX memory management weirdness. 
>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>> assemble such a genome with MIRA. There’s no way your Mac has that. 
>>> Normally the OS should, at one point, simply return a memory allocation 
>>> failure and that would be the end of the story … I have no idea why it 
>>> decides to freeze instead.
>>> 
>>> B.
>>> 
>>> 
>>> 
>>> --
>>> You have received this mail because you are subscribed to the mira_talk 
>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>> visit http://www.chevreux.org/mira_mailinglists.html
>> 
>> 
>> -- 
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
> 
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler

References:
- [mira_talk] not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read lengths
  - From: Bastien Chevreux
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler

[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: