Oh, and see if your sequencing provider will run the self-correction for you, as Bastien suggested. It will definitely save you some time. On Thu, May 15, 2014 at 5:21 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote: > > My genome is about 1.4 Mb, and I have 28245 reads with median length >> ~2000. I am really inexperienced with pacbio, so could you clarify what you >> mean by filter? >> > > Your provider should have sent a report with some basic info about the > sequencing run and quality of the reads. Usually it has pre-filter and > post-filter statistics. Filtering is based on two parameters: read length > and quality. There can be a fair number of reads that abort prematurely, > giving you a large population of <50 bp reads. Usually you want to just > throw those out. Some reads can also have an abnormally low quality, due to > bad fluorescence signal or photobleaching, so you want to throw those out > too. For the yield report, the filtering parameters are typically min > length=50 bp and min quality=0.75. Depending on your intended downstream > use, you may want more stringent filtering. The preassembler, for error > correction for example, prefers a min length=500 bp and min quality=0.80. > Your median read length will also be calculated post-filtering, and for > P4-C2 should be around 4000-5000 bp. Your 2000 bp median read length sounds > like a pre-filter calculation. If it is the post-filter result, you should > speak to your sequencing provider because that is less than spec. > > (I am assembling a parasite genome from sequencing results from whole >> organism by pulling out reads with homology with a closely related genome >> and then extracting those reads.) > > > Based on my back of the envelope calculation, you have about 40X coverage. > It will probably be less after filtering. This is not very much data. You > should be getting about 75k reads (pre-filter) or 25k reads (post-filter) > from a 75k ZMW smrtcell (this is old technology). With the new (~1 yr old) > 150k ZMW smrtcells, you should be getting 150k reads (pre-filter) or 50k > reads (post-filter). So multiple problems here. I think you should look at > the filtering parameters and speak with your sequencing provider about the > quality of the run. A single 150k ZMW smrtcell should be giving you closer > to 170X coverage post-filter. If you were using the older smrtcells, I > recommend switching to the newer and using the P4-C2 enzyme. > > You could try working with the data, but I think you will end up > needlessly banging your head against the wall. With such a low coverage you > would not be able to use self-correction, which gives you best results. > Correcting with other technologies (ex: Illumina) might work, but it will > most likely give back a lower median read length than you already have > which diminishes the usefulness of PacBio data. > > > > On Thu, May 15, 2014 at 1:53 PM, Chenling Antelope < > chenlingantelope@xxxxxxxxx> wrote: > >> Hi Chris, >> My genome is about 1.4 Mb, and I have 28245 reads with median length >> ~2000. I am really inexperienced with pacbio, so could you clarify what you >> mean by filter? (I am assembling a parasite genome from sequencing results >> from whole organism by pulling out reads with homology with a closely >> related genome and then extracting those reads.) >> Best, >> Chenling >> >> >> 2014-05-15 11:32 GMT-07:00 Chris Hoefler <hoeflerb@xxxxxxxxx>: >> >> Just FYI, you will need a lot more than 16 Gb to error-correct and >>> assemble your reads. If you can get access to a high memory machine (or a >>> cluster) that would be best. What is your expected genome size? What is >>> your post-filter median read length and yield? >>> >>> >>> Best, >>> Chris >>> >>> >>> On Thu, May 15, 2014 at 11:04 AM, Chenling Antelope < >>> chenlingantelope@xxxxxxxxx> wrote: >>> >>>> THANKS Bastien and Andrej :) >>>> >>>> >>>> 2014-05-15 0:26 GMT-07:00 Andrej Benjak <abenjak@xxxxxxxxx>: >>>> >>>> Hi Chenling, >>>>> >>>>> For correcting PacBio reads and/or de novo assemblies you can use the >>>>> SMRT portal. As an alternative to the PIA local installation, you can >>>>> download the PacBio virtual machine with the SMRT portal installed and >>>>> configured (not the latest version, but almost): >>>>> >>>>> https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/SMRT-Analysis-Virtual-Machine-Install >>>>> >>>>> >>>>> Cheers, >>>>> Andrej >>>>> >>>>> >>>>> >>>>> On 05/15/2014 09:02 AM, Bastien Chevreux wrote: >>>>> >>>>> On 15 May 2014, at 2:55 , Chenling Antelope <chenlingantelope@xxxxxxxxx> >>>>> <chenlingantelope@xxxxxxxxx> wrote: >>>>> >>>>> Thanks Bastien for the answer! >>>>> However I am currently unable to correct my reads because I lack the glib >>>>> version required by celera. >>>>> >>>>> Then you should get that from somewhere :-) >>>>> >>>>> >>>>> Also, I used miramem to estimate the RAM required, which is a lot >>>>> smaller than my actual RAM 16G >>>>> >>>>> miramem does not know about PacBio reads yet, especially not about the >>>>> worst memory eater for that scenario: the Smith-Waterman overlapper. >>>>> >>>>> >>>>> Is there something else I can do to trouble shoot? >>>>> >>>>> You could try to remove all reads >= 10kb (or 9kb, 8kb, etc.) to save >>>>> memory at the overlap stage. >>>>> >>>>> But again: it makes absolutely no sense to currently use MIRA with >>>>> non-corrected PacBio reads. These simply contain too much crap which MIRA >>>>> is not prepared for. You will get “something” as result, but it will be >>>>> total nonsense. >>>>> >>>>> B. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Chris Hoefler, PhD >>> Postdoctoral Research Associate >>> Straight Lab >>> Texas A&M University >>> 2128 TAMU >>> College Station, TX 77843-2128 >>> >> >> > > > -- > Chris Hoefler, PhD > Postdoctoral Research Associate > Straight Lab > Texas A&M University > 2128 TAMU > College Station, TX 77843-2128 > -- Chris Hoefler, PhD Postdoctoral Research Associate Straight Lab Texas A&M University 2128 TAMU College Station, TX 77843-2128