I have never worked with a genome of this size or complexity, so take whatever I say with the requisite salt, but just a few comments in no particular order. 1) Assembling a genome of this size is hard. It requires a fair amount of time and expertise to do correctly. Just as a point of reference, have a look at the number of authors on this (old but not ancient) paper. http://www.sciencemag.org/content/314/5801/941 So as a beginning bioinformatician, this will likely be a quite difficult task. 2) As per Rick's comment, I don't think your data set is really up to the task. A reasonable approach to a de novo assembly is to use long reads (aka PacBio) to put together initial large contigs, and then polish things off by mapping short reads over them. The problem is that your short read coverage might be adequate, but your PacBio coverage is definitely not. Compressed or not, your PacBio data represents less than 1X coverage of the genome. And these are probably not error-corrected, so you have that problem as well. In addition, the majority of your short reads are likely <200 bp with some longer 400 bp reads. Without any pairing info (you didn't say whether you had any), the ability to resolve repeats will be severely limited. 3) So what are you left with? Well you can try a short read assembly using something like Ray or Velvet that can handle the large genome size. But ploidy and repeats will be a significant problem. Mira handles those two things quite well, but the memory requirement will be challenging. Once you have some short reads, you can try your luck scaffolding with the PacBio reads, but I wouldn't expect a great result from that. In the end you will be left with a highly fragmented genome with mostly unresolved repeats. This may be good enough, but it depends on what you are planning to use the data for. 4) Alternatively, you can try to get more data. Other people on this list can tell you about BAC libraries and such. Personally, I think PacBio is rapidly becoming the future, especially with the amount of work going into using it for large genome assembly. But, this is still largely new territory for PacBio, and the data and compute requirements are tremendous. Last year, PacBio published a de novo human genome assembly using just PacBio data. The results are quite good, but they ended up using 405,000 CPU hours on the Google Compute Cloud to do the error-correction and assembly. And this was a haploid assembly at that. There is a lot of new and interesting work on improving performance and handling ploidy, but this is really at the cutting edge right now and I would give it a year or so before it really becomes mainstream. So what to do? Well, start with a few questions. What do you want out of the data? Is it something you can do with a reference-guided assembly using only short reads, or absent that possibility a highly fragmented de novo assembly of dubious quality using only short reads? If so, make do with what you have, work on your cluster, and you can try Mira, but it may give you some serious problems. If not, what is realistic in terms of getting more data? And are you prepared for the task? Do you have other bioinformatics resources elsewhere at your university to turn to? > On Aug 31, 2014, at 12:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> wrote: > > Hi Bastien, > >> Huh … 800? 8-0-0? > > yup, a sea urchin, about 1/4 the human genome > >> I’m not sure whether you should try to assembly such a large genome with >> MIRA. > > A bioinformatician at IonTorrent who was familiar with our PGM and Proton > sequencing results had suggested either MIRA or Newbler as > IonTorrent-friendly commercial assembly tools. Since I’m attempting a hybrid > denovo assembly using long PacBio reads to supplement the short IonTorrent > reads, some research I did indicated MIRA was a good candidate for such an > assembly. I hoped the size of the genome would be more of a time-to-run > issue, not a make or break issue for the assembler. > >> I know I wouldn’t. > > Keeping in mind that I’m a biologist, not a bioinformatician or computer > scientist, whose sole bioinformatics experience is limited to running command > line BLAST, but who doesn’t mind devoting the time to teach myself new > skills, what would you recommend? (BTW, I am the entire 'bioinformatics > department' in our tiny underfunded university lab). > >> You’d probably need a couple of dozen GiB (if not in the hundreds) to >> assemble such a genome with MIRA. > > I do have access to a group HPCC that our university is part of. I’ve been > working on my Mac because being such a newbie at all of this I like to work > at home, as it takes me all day to figure out how to do things, and they > don’t like to hand out VPNs to access it from home. But I can access it from > our lab. So on a high performance computing cluster, is MIRA a viable choice > for doing the kind of large genome hybrid denovo assembly I’m attempting? > > Thanks. > > JD > >> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: >> >>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote: >>> This is my first time using MIRA, and my first attempt at an assembly. >>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion >>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 >>> MB) reads. >> >> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a >> large genome with MIRA. I know I wouldn’t. >> >>> 1. parameter set to not=4, but CPU usage shows only using 1 thread >> >> Not all parts of MIRA run in multithread: some are not worth it, others >> cannot be multithreaded. >> >>> 2. After about 10-20 minutes of CPU time my system freezes and I have to >>> reboot. >> >> I suspect a RAM problem coupled with an OSX memory management weirdness. >> You’d probably need a couple of dozen GiB (if not in the hundreds) to >> assemble such a genome with MIRA. There’s no way your Mac has that. Normally >> the OS should, at one point, simply return a memory allocation failure and >> that would be the end of the story … I have no idea why it decides to freeze >> instead. >> >> B. >> >> >> >> -- >> You have received this mail because you are subscribed to the mira_talk >> mailing list. For information on how to subscribe or unsubscribe, please >> visit http://www.chevreux.org/mira_mailinglists.html > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html