[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Sun, 31 Aug 2014 22:32:53 -0500

> My understanding is that one PacBio library is good for 6-8 flow cells, and 
> we’ve done 3, so we could at least double our PacBio data using the same 
> library, if that starts to take us closer to adequate.
> 

As others on this list can tell you, I'm usually a PacBio advocate, but for 
what you intend I think it is going to be beyond your reach at this moment. 
This is for two reasons, primarily. The first is that you can't do much with 
PacBio data without error-correction (scaffolding might be possible in some 
cases, but it is rarely the best option). There have been a few published 
methods for error-correcting PacBio data. The most robust method at this point 
is known as preassembly and is often associated with the hierarchical genome 
assembly pipeline (HGAP) that PacBio makes available in their software. The 
downside is that you need a lot more data than you otherwise would. The idea is 
that if you are aiming for 25-30X coverage for an assembly, you would need 3-4x 
that for the error-correction step (i.e. 75-120X). Just for a fun math 
exercise, a 150k ZMW cell running a Pippin size-selected 20kb library with the 
P5-C3 chemistry can give you around 500 Mb of data (albeit lower quality than 
P4-C2). So you would need 2 SMRT cells to get a little over 1X coverage of your 
genome. To get the requisite 80X needed for preassembly, you would need 140 
SMRT cells of data.

That leads to the second issue, which is computational resources. The 
preassembler has received some significant algorithmic improvements since it's 
first release, making it tractable for larger datasets, but the fact remains 
that it is still a compute-intensive process. It seems for 100Mb-ish genomes 
the compute time is manageable (Drosophila and Arabidopsis genomes have been 
reported), but yours is almost an order of magnitude larger. 

Finally, heterozygosity is a problem. All of these issues are currently being 
worked on by a number of groups. The new Dazzler assembler is attempting to 
deal with the coverage requirement and computational burden, as is MHAP. The 
Falcon assembler from PacBio is attempting to deal with heterozygosity. All of 
this is really cool, and very much beta, so I wouldn't go there just yet.


> We’re not looking for a high-quality draft assembly, more of a starting 
> point. We just feel like we should be able to get something from what he 
> have, even it’s not the best quality.

The thing is, genome assembly is a computational jigsaw puzzle. The larger (and 
more complex) the genome, and the shorter the reads, the higher the error rate, 
the absence of long-range information, and the lower the coverage, the harder 
it is to solve the puzzle, assuming that it is even possible. It is like the 
CSI episode where a stack of 100 shredded documents are spread out on a table, 
and they take a picture and push a button on the computer and it puts all of 
them magically back together. Only it's not because that would actually be 
impossible in real life.

There is a lot that can be done short of a full genome assembly, it all depends 
on what you are looking for. A method that comes up periodically on this list 
is to fish out reads (using mirabait or similar) associated with specific 
features that you are interested in (ex. genes, metabolic pathways, ribosomal 
sequence, viral insertions, etc) and assemble them independently for comparison 
with a reference or whatever else you are trying to do.

The plant genome people here have been using a lot of Rad-Seq to look at 
chromosome structure and recombination frequency. You can also do a lot by 
focusing on transcript data which will lower the size requirement by quite a 
bit. And PacBio might be a good choice for that because the read length allows 
you to sequence full length transcripts, which makes identifying alternative 
splice variants much easier.

It doesn't hurt to throw your data at an assembler and see what comes out, but 
I would be formulating a backup plan while that is going on.


> 
> 
>> On Aug 31, 2014, at 3:23 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:
>> 
>> I have never worked with a genome of this size or complexity, so take 
>> whatever I say with the requisite salt, but just a few comments in no 
>> particular order.
>> 
>> 1) Assembling a genome of this size is hard. It requires a fair amount of 
>> time and expertise to do correctly. Just as a point of reference, have a 
>> look at the number of authors on this (old but not ancient) paper.
>> http://www.sciencemag.org/content/314/5801/941
>> So as a beginning bioinformatician, this will likely be a quite difficult 
>> task.
>> 
>> 2) As per Rick's comment, I don't think your data set is really up to the 
>> task. A reasonable approach to a de novo assembly is to use long reads (aka 
>> PacBio) to put together initial large contigs, and then polish things off by 
>> mapping short reads over them. The problem is that your short read coverage 
>> might be adequate, but your PacBio coverage is definitely not. Compressed or 
>> not, your PacBio data represents less than 1X coverage of the genome. And 
>> these are probably not error-corrected, so you have that problem as well. In 
>> addition, the majority of your short reads are likely <200 bp with some 
>> longer 400 bp reads. Without any pairing info (you didn't say whether you 
>> had any), the ability to resolve repeats will be severely limited.
>> 
>> 3) So what are you left with? Well you can try a short read assembly using 
>> something like Ray or Velvet that can handle the large genome size. But 
>> ploidy and repeats will be a significant problem. Mira handles those two 
>> things quite well, but the memory requirement will be challenging. Once you 
>> have some short reads, you can try your luck scaffolding with the PacBio 
>> reads, but I wouldn't expect a great result from that. In the end you will 
>> be left with a highly fragmented genome with mostly unresolved repeats. This 
>> may be good enough, but it depends on what you are planning to use the data 
>> for.
>> 
>> 4) Alternatively, you can try to get more data. Other people on this list 
>> can tell you about BAC libraries and such. Personally, I think PacBio is 
>> rapidly becoming the future, especially with the amount of work going into 
>> using it for large genome assembly. But, this is still largely new territory 
>> for PacBio, and the data and compute requirements are tremendous. Last year, 
>> PacBio published a de novo human genome assembly using just PacBio data. The 
>> results are quite good, but they ended up using 405,000 CPU hours on the 
>> Google Compute Cloud to do the error-correction and assembly. And this was a 
>> haploid assembly at that. There is a lot of new and interesting work on 
>> improving performance and handling ploidy, but this is really at the cutting 
>> edge right now and I would give it a year or so before it really becomes 
>> mainstream.
>> 
>> So what to do? Well, start with a few questions. What do you want out of the 
>> data? Is it something you can do with a reference-guided assembly using only 
>> short reads, or absent that possibility a highly fragmented de novo assembly 
>> of dubious quality using only short reads? If so, make do with what you 
>> have, work on your cluster, and you can try Mira, but it may give you some 
>> serious problems. If not, what is realistic in terms of getting more data? 
>> And are you prepared for the task? Do you have other bioinformatics 
>> resources elsewhere at your university to turn to?
>> 
>> 
>> 
>>> On Aug 31, 2014, at 12:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> 
>>> wrote:
>>> 
>>> Hi Bastien,
>>> 
>>>> Huh … 800? 8-0-0?
>>> 
>>> yup, a sea urchin, about 1/4 the human genome
>>> 
>>>> I’m not sure whether you should try to assembly such a large genome with 
>>>> MIRA.
>>> 
>>> A bioinformatician at IonTorrent who was familiar with our PGM and Proton 
>>> sequencing results had suggested either MIRA or Newbler as 
>>> IonTorrent-friendly commercial assembly tools. Since I’m attempting a 
>>> hybrid denovo assembly using long PacBio reads to supplement the short 
>>> IonTorrent reads, some research I did indicated MIRA was a good candidate 
>>> for such an assembly. I hoped the size of the genome would be more of a 
>>> time-to-run issue, not a make or break issue for the assembler.
>>> 
>>>> I know I wouldn’t.
>>> 
>>> Keeping in mind that I’m a biologist, not a bioinformatician or computer 
>>> scientist, whose sole bioinformatics experience is limited to running 
>>> command line BLAST, but who doesn’t mind devoting the time to teach myself 
>>> new skills, what would you recommend? (BTW, I am the entire 'bioinformatics 
>>> department' in our tiny underfunded university lab).
>>> 
>>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>>> assemble such a genome with MIRA.
>>> 
>>> I do have access to a group HPCC that our university is part of. I’ve been 
>>> working on my Mac because being such a newbie at all of this I like to work 
>>> at home, as it takes me all day to figure out how to do things, and they 
>>> don’t like to hand out VPNs to access it from home. But I can access it 
>>> from our lab. So on a high performance computing cluster, is MIRA a viable 
>>> choice for doing the kind of large genome hybrid denovo assembly I’m 
>>> attempting?
>>> 
>>> Thanks.
>>> 
>>> JD
>>> 
>>>>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote:
>>>>> This is my first time using MIRA, and my first attempt at an assembly.
>>>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion 
>>>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 
>>>>> MB) reads.
>>>> 
>>>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a 
>>>> large genome with MIRA. I know I wouldn’t.
>>>> 
>>>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread
>>>> 
>>>> Not all parts of MIRA run in multithread: some are not worth it, others 
>>>> cannot be multithreaded.
>>>> 
>>>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to 
>>>>> reboot.
>>>> 
>>>> I suspect a RAM problem coupled with an OSX memory management weirdness. 
>>>> You’d probably need a couple of dozen GiB (if not in the hundreds) to 
>>>> assemble such a genome with MIRA. There’s no way your Mac has that. 
>>>> Normally the OS should, at one point, simply return a memory allocation 
>>>> failure and that would be the end of the story … I have no idea why it 
>>>> decides to freeze instead.
>>>> 
>>>> B.
>>>> 
>>>> 
>>>> 
>>>> --
>>>> You have received this mail because you are subscribed to the mira_talk 
>>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>>> visit http://www.chevreux.org/mira_mailinglists.html
>>> 
>>> 
>>> -- 
>>> You have received this mail because you are subscribed to the mira_talk 
>>> mailing list. For information on how to subscribe or unsubscribe, please 
>>> visit http://www.chevreux.org/mira_mailinglists.html
>> 
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
> 
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: