Hi Chris, Thank you for your thorough and thoughtful response. > Is it something you can do with a reference-guided assembly using only short > reads, We actually had our IonTorrent Proton reads mapped to the urchin genome from the paper you referenced, but our’s is a much more ancient species, and we only got 16% alignment. Hence the plan to go denovo by adding long reads. > The problem is that your short read coverage might be adequate, but your > PacBio coverage is definitely not. My understanding is that one PacBio library is good for 6-8 flow cells, and we’ve done 3, so we could at least double our PacBio data using the same library, if that starts to take us closer to adequate. > are you prepared for the task? Ignorance is bliss, and I don’t know enough to know what I can’t do. This is all self-taught, and I’m game for trying. > Do you have other bioinformatics resources elsewhere at your university to > turn to? Unfortunately the internet is my only other resource. > What do you want out of the data? We’re not looking for a high-quality draft assembly, more of a starting point. We just feel like we should be able to get something from what he have, even it’s not the best quality. John On Aug 31, 2014, at 3:23 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote: > I have never worked with a genome of this size or complexity, so take > whatever I say with the requisite salt, but just a few comments in no > particular order. > > 1) Assembling a genome of this size is hard. It requires a fair amount of > time and expertise to do correctly. Just as a point of reference, have a look > at the number of authors on this (old but not ancient) paper. > http://www.sciencemag.org/content/314/5801/941 > So as a beginning bioinformatician, this will likely be a quite difficult > task. > > 2) As per Rick's comment, I don't think your data set is really up to the > task. A reasonable approach to a de novo assembly is to use long reads (aka > PacBio) to put together initial large contigs, and then polish things off by > mapping short reads over them. The problem is that your short read coverage > might be adequate, but your PacBio coverage is definitely not. Compressed or > not, your PacBio data represents less than 1X coverage of the genome. And > these are probably not error-corrected, so you have that problem as well. In > addition, the majority of your short reads are likely <200 bp with some > longer 400 bp reads. Without any pairing info (you didn't say whether you had > any), the ability to resolve repeats will be severely limited. > > 3) So what are you left with? Well you can try a short read assembly using > something like Ray or Velvet that can handle the large genome size. But > ploidy and repeats will be a significant problem. Mira handles those two > things quite well, but the memory requirement will be challenging. Once you > have some short reads, you can try your luck scaffolding with the PacBio > reads, but I wouldn't expect a great result from that. In the end you will be > left with a highly fragmented genome with mostly unresolved repeats. This may > be good enough, but it depends on what you are planning to use the data for. > > 4) Alternatively, you can try to get more data. Other people on this list can > tell you about BAC libraries and such. Personally, I think PacBio is rapidly > becoming the future, especially with the amount of work going into using it > for large genome assembly. But, this is still largely new territory for > PacBio, and the data and compute requirements are tremendous. Last year, > PacBio published a de novo human genome assembly using just PacBio data. The > results are quite good, but they ended up using 405,000 CPU hours on the > Google Compute Cloud to do the error-correction and assembly. And this was a > haploid assembly at that. There is a lot of new and interesting work on > improving performance and handling ploidy, but this is really at the cutting > edge right now and I would give it a year or so before it really becomes > mainstream. > > So what to do? Well, start with a few questions. What do you want out of the > data? Is it something you can do with a reference-guided assembly using only > short reads, or absent that possibility a highly fragmented de novo assembly > of dubious quality using only short reads? If so, make do with what you have, > work on your cluster, and you can try Mira, but it may give you some serious > problems. If not, what is realistic in terms of getting more data? And are > you prepared for the task? Do you have other bioinformatics resources > elsewhere at your university to turn to? > > > >> On Aug 31, 2014, at 12:21 PM, John DeFilippo <defilippo.john@xxxxxxxxx> >> wrote: >> >> Hi Bastien, >> >>> Huh … 800? 8-0-0? >> >> yup, a sea urchin, about 1/4 the human genome >> >>> I’m not sure whether you should try to assembly such a large genome with >>> MIRA. >> >> A bioinformatician at IonTorrent who was familiar with our PGM and Proton >> sequencing results had suggested either MIRA or Newbler as >> IonTorrent-friendly commercial assembly tools. Since I’m attempting a hybrid >> denovo assembly using long PacBio reads to supplement the short IonTorrent >> reads, some research I did indicated MIRA was a good candidate for such an >> assembly. I hoped the size of the genome would be more of a time-to-run >> issue, not a make or break issue for the assembler. >> >>> I know I wouldn’t. >> >> Keeping in mind that I’m a biologist, not a bioinformatician or computer >> scientist, whose sole bioinformatics experience is limited to running >> command line BLAST, but who doesn’t mind devoting the time to teach myself >> new skills, what would you recommend? (BTW, I am the entire 'bioinformatics >> department' in our tiny underfunded university lab). >> >>> You’d probably need a couple of dozen GiB (if not in the hundreds) to >>> assemble such a genome with MIRA. >> >> I do have access to a group HPCC that our university is part of. I’ve been >> working on my Mac because being such a newbie at all of this I like to work >> at home, as it takes me all day to figure out how to do things, and they >> don’t like to hand out VPNs to access it from home. But I can access it from >> our lab. So on a high performance computing cluster, is MIRA a viable choice >> for doing the kind of large genome hybrid denovo assembly I’m attempting? >> >> Thanks. >> >> JD >> >>> On Aug 31, 2014, at 2:52 AM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: >>> >>>> On 31 Aug 2014, at 4:56 , John DeFilippo <defilippo.john@xxxxxxxxx> wrote: >>>> This is my first time using MIRA, and my first attempt at an assembly. >>>> It’s an ~ 800 MB genome, and I’m attempting a denovo assembly using Ion >>>> Torrent PGM (FASTQ ~ 3 GB), Proton (FASTQ ~ 9 GB), and PacBIo (FASTQ ~ 78 >>>> MB) reads. >>> >>> Huh … 800? 8-0-0? I’m not sure whether you should try to assembly such a >>> large genome with MIRA. I know I wouldn’t. >>> >>>> 1. parameter set to not=4, but CPU usage shows only using 1 thread >>> >>> Not all parts of MIRA run in multithread: some are not worth it, others >>> cannot be multithreaded. >>> >>>> 2. After about 10-20 minutes of CPU time my system freezes and I have to >>>> reboot. >>> >>> I suspect a RAM problem coupled with an OSX memory management weirdness. >>> You’d probably need a couple of dozen GiB (if not in the hundreds) to >>> assemble such a genome with MIRA. There’s no way your Mac has that. >>> Normally the OS should, at one point, simply return a memory allocation >>> failure and that would be the end of the story … I have no idea why it >>> decides to freeze instead. >>> >>> B. >>> >>> >>> >>> -- >>> You have received this mail because you are subscribed to the mira_talk >>> mailing list. For information on how to subscribe or unsubscribe, please >>> visit http://www.chevreux.org/mira_mailinglists.html >> >> >> -- >> You have received this mail because you are subscribed to the mira_talk >> mailing list. For information on how to subscribe or unsubscribe, please >> visit http://www.chevreux.org/mira_mailinglists.html > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html