[mira_talk] Re: exiting mira without error message

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 15 May 2014 17:26:45 -0500

Oh, and see if your sequencing provider will run the self-correction for
you, as Bastien suggested. It will definitely save you some time.


On Thu, May 15, 2014 at 5:21 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:

>
> My genome is about 1.4 Mb, and I have 28245 reads with median length
>> ~2000. I am really inexperienced with pacbio, so could you clarify what you
>> mean by filter?
>>
>
> Your provider should have sent a report with some basic info about the
> sequencing run and quality of the reads. Usually it has pre-filter and
> post-filter statistics. Filtering is based on two parameters: read length
> and quality. There can be a fair number of reads that abort prematurely,
> giving you a large population of <50 bp reads. Usually you want to just
> throw those out. Some reads can also have an abnormally low quality, due to
> bad fluorescence signal or photobleaching, so you want to throw those out
> too. For the yield report, the filtering parameters are typically min
> length=50 bp and min quality=0.75. Depending on your intended downstream
> use, you may want more stringent filtering. The preassembler, for error
> correction for example, prefers a min length=500 bp and min quality=0.80.
> Your median read length will also be calculated post-filtering, and for
> P4-C2 should be around 4000-5000 bp. Your 2000 bp median read length sounds
> like a pre-filter calculation. If it is the post-filter result, you should
> speak to your sequencing provider because that is less than spec.
>
> (I am assembling a parasite genome from sequencing results from whole
>> organism by pulling out reads with homology with a closely related genome
>> and then extracting those reads.)
>
>
> Based on my back of the envelope calculation, you have about 40X coverage.
> It will probably be less after filtering. This is not very much data. You
> should be getting about 75k reads (pre-filter) or 25k reads (post-filter)
> from a 75k ZMW smrtcell (this is old technology). With the new (~1 yr old)
> 150k ZMW smrtcells, you should be getting 150k reads (pre-filter) or 50k
> reads (post-filter). So multiple problems here. I think you should look at
> the filtering parameters and speak with your sequencing provider about the
> quality of the run. A single 150k ZMW smrtcell should be giving you closer
> to 170X coverage post-filter. If you were using the older smrtcells, I
> recommend switching to the newer and using the P4-C2 enzyme.
>
> You could try working with the data, but I think you will end up
> needlessly banging your head against the wall. With such a low coverage you
> would not be able to use self-correction, which gives you best results.
> Correcting with other technologies (ex: Illumina) might work, but it will
> most likely give back a lower median read length than you already have
> which diminishes the usefulness of PacBio data.
>
>
>
> On Thu, May 15, 2014 at 1:53 PM, Chenling Antelope <
> chenlingantelope@xxxxxxxxx> wrote:
>
>> Hi Chris,
>> My genome is about 1.4 Mb, and I have 28245 reads with median length
>> ~2000. I am really inexperienced with pacbio, so could you clarify what you
>> mean by filter? (I am assembling a parasite genome from sequencing results
>> from whole organism by pulling out reads with homology with a closely
>> related genome and then extracting those reads.)
>> Best,
>> Chenling
>>
>>
>> 2014-05-15 11:32 GMT-07:00 Chris Hoefler <hoeflerb@xxxxxxxxx>:
>>
>> Just FYI, you will need a lot more than 16 Gb to error-correct and
>>> assemble your reads. If you can get access to a high memory machine (or a
>>> cluster) that would be best. What is your expected genome size? What is
>>> your post-filter median read length and yield?
>>>
>>>
>>> Best,
>>> Chris
>>>
>>>
>>> On Thu, May 15, 2014 at 11:04 AM, Chenling Antelope <
>>> chenlingantelope@xxxxxxxxx> wrote:
>>>
>>>> THANKS Bastien and Andrej :)
>>>>
>>>>
>>>> 2014-05-15 0:26 GMT-07:00 Andrej Benjak <abenjak@xxxxxxxxx>:
>>>>
>>>>  Hi Chenling,
>>>>>
>>>>> For correcting PacBio reads and/or de novo assemblies you can use the
>>>>> SMRT portal. As an alternative to the PIA local installation, you can
>>>>> download the PacBio virtual machine with the SMRT portal installed and
>>>>> configured (not the latest version, but almost):
>>>>>
>>>>> https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/SMRT-Analysis-Virtual-Machine-Install
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Andrej
>>>>>
>>>>>
>>>>>
>>>>> On 05/15/2014 09:02 AM, Bastien Chevreux wrote:
>>>>>
>>>>> On 15 May 2014, at 2:55 , Chenling Antelope <chenlingantelope@xxxxxxxxx> 
>>>>> <chenlingantelope@xxxxxxxxx> wrote:
>>>>>
>>>>>  Thanks Bastien for the answer!
>>>>> However I am currently unable to correct my reads because I lack the glib 
>>>>> version required by celera.
>>>>>
>>>>>  Then you should get that from somewhere :-)
>>>>>
>>>>>
>>>>>  Also, I used miramem to estimate the RAM required, which is a lot 
>>>>> smaller than my actual RAM 16G
>>>>>
>>>>>  miramem does not know about PacBio reads yet, especially not about the 
>>>>> worst memory eater for that scenario: the Smith-Waterman overlapper.
>>>>>
>>>>>
>>>>>  Is there something else I can do to trouble shoot?
>>>>>
>>>>>  You could try to remove all reads >= 10kb (or 9kb, 8kb, etc.) to save 
>>>>> memory at the overlap stage.
>>>>>
>>>>> But again: it makes absolutely no sense to currently use MIRA with 
>>>>> non-corrected PacBio reads. These simply contain too much crap which MIRA 
>>>>> is not prepared for. You will get “something” as result, but it will be 
>>>>> total nonsense.
>>>>>
>>>>> B.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Chris Hoefler, PhD
>>> Postdoctoral Research Associate
>>> Straight Lab
>>> Texas A&M University
>>> 2128 TAMU
>>> College Station, TX 77843-2128
>>>
>>
>>
>
>
> --
> Chris Hoefler, PhD
> Postdoctoral Research Associate
> Straight Lab
> Texas A&M University
> 2128 TAMU
> College Station, TX 77843-2128
>



-- 
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: