> My genome is about 1.4 Mb, and I have 28245 reads with median length > ~2000. I am really inexperienced with pacbio, so could you clarify what you > mean by filter? > Your provider should have sent a report with some basic info about the sequencing run and quality of the reads. Usually it has pre-filter and post-filter statistics. Filtering is based on two parameters: read length and quality. There can be a fair number of reads that abort prematurely, giving you a large population of <50 bp reads. Usually you want to just throw those out. Some reads can also have an abnormally low quality, due to bad fluorescence signal or photobleaching, so you want to throw those out too. For the yield report, the filtering parameters are typically min length=50 bp and min quality=0.75. Depending on your intended downstream use, you may want more stringent filtering. The preassembler, for error correction for example, prefers a min length=500 bp and min quality=0.80. Your median read length will also be calculated post-filtering, and for P4-C2 should be around 4000-5000 bp. Your 2000 bp median read length sounds like a pre-filter calculation. If it is the post-filter result, you should speak to your sequencing provider because that is less than spec. (I am assembling a parasite genome from sequencing results from whole > organism by pulling out reads with homology with a closely related genome > and then extracting those reads.) Based on my back of the envelope calculation, you have about 40X coverage. It will probably be less after filtering. This is not very much data. You should be getting about 75k reads (pre-filter) or 25k reads (post-filter) from a 75k ZMW smrtcell (this is old technology). With the new (~1 yr old) 150k ZMW smrtcells, you should be getting 150k reads (pre-filter) or 50k reads (post-filter). So multiple problems here. I think you should look at the filtering parameters and speak with your sequencing provider about the quality of the run. A single 150k ZMW smrtcell should be giving you closer to 170X coverage post-filter. If you were using the older smrtcells, I recommend switching to the newer and using the P4-C2 enzyme. You could try working with the data, but I think you will end up needlessly banging your head against the wall. With such a low coverage you would not be able to use self-correction, which gives you best results. Correcting with other technologies (ex: Illumina) might work, but it will most likely give back a lower median read length than you already have which diminishes the usefulness of PacBio data. On Thu, May 15, 2014 at 1:53 PM, Chenling Antelope < chenlingantelope@xxxxxxxxx> wrote: > Hi Chris, > My genome is about 1.4 Mb, and I have 28245 reads with median length > ~2000. I am really inexperienced with pacbio, so could you clarify what you > mean by filter? (I am assembling a parasite genome from sequencing results > from whole organism by pulling out reads with homology with a closely > related genome and then extracting those reads.) > Best, > Chenling > > > 2014-05-15 11:32 GMT-07:00 Chris Hoefler <hoeflerb@xxxxxxxxx>: > > Just FYI, you will need a lot more than 16 Gb to error-correct and >> assemble your reads. If you can get access to a high memory machine (or a >> cluster) that would be best. What is your expected genome size? What is >> your post-filter median read length and yield? >> >> >> Best, >> Chris >> >> >> On Thu, May 15, 2014 at 11:04 AM, Chenling Antelope < >> chenlingantelope@xxxxxxxxx> wrote: >> >>> THANKS Bastien and Andrej :) >>> >>> >>> 2014-05-15 0:26 GMT-07:00 Andrej Benjak <abenjak@xxxxxxxxx>: >>> >>> Hi Chenling, >>>> >>>> For correcting PacBio reads and/or de novo assemblies you can use the >>>> SMRT portal. As an alternative to the PIA local installation, you can >>>> download the PacBio virtual machine with the SMRT portal installed and >>>> configured (not the latest version, but almost): >>>> >>>> https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/SMRT-Analysis-Virtual-Machine-Install >>>> >>>> >>>> Cheers, >>>> Andrej >>>> >>>> >>>> >>>> On 05/15/2014 09:02 AM, Bastien Chevreux wrote: >>>> >>>> On 15 May 2014, at 2:55 , Chenling Antelope <chenlingantelope@xxxxxxxxx> >>>> <chenlingantelope@xxxxxxxxx> wrote: >>>> >>>> Thanks Bastien for the answer! >>>> However I am currently unable to correct my reads because I lack the glib >>>> version required by celera. >>>> >>>> Then you should get that from somewhere :-) >>>> >>>> >>>> Also, I used miramem to estimate the RAM required, which is a lot smaller >>>> than my actual RAM 16G >>>> >>>> miramem does not know about PacBio reads yet, especially not about the >>>> worst memory eater for that scenario: the Smith-Waterman overlapper. >>>> >>>> >>>> Is there something else I can do to trouble shoot? >>>> >>>> You could try to remove all reads >= 10kb (or 9kb, 8kb, etc.) to save >>>> memory at the overlap stage. >>>> >>>> But again: it makes absolutely no sense to currently use MIRA with >>>> non-corrected PacBio reads. These simply contain too much crap which MIRA >>>> is not prepared for. You will get “something” as result, but it will be >>>> total nonsense. >>>> >>>> B. >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Chris Hoefler, PhD >> Postdoctoral Research Associate >> Straight Lab >> Texas A&M University >> 2128 TAMU >> College Station, TX 77843-2128 >> > > -- Chris Hoefler, PhD Postdoctoral Research Associate Straight Lab Texas A&M University 2128 TAMU College Station, TX 77843-2128