[mira_talk] Re: exiting mira without error message

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 15 May 2014 17:21:48 -0500

> My genome is about 1.4 Mb, and I have 28245 reads with median length
> ~2000. I am really inexperienced with pacbio, so could you clarify what you
> mean by filter?
>

Your provider should have sent a report with some basic info about the
sequencing run and quality of the reads. Usually it has pre-filter and
post-filter statistics. Filtering is based on two parameters: read length
and quality. There can be a fair number of reads that abort prematurely,
giving you a large population of <50 bp reads. Usually you want to just
throw those out. Some reads can also have an abnormally low quality, due to
bad fluorescence signal or photobleaching, so you want to throw those out
too. For the yield report, the filtering parameters are typically min
length=50 bp and min quality=0.75. Depending on your intended downstream
use, you may want more stringent filtering. The preassembler, for error
correction for example, prefers a min length=500 bp and min quality=0.80.
Your median read length will also be calculated post-filtering, and for
P4-C2 should be around 4000-5000 bp. Your 2000 bp median read length sounds
like a pre-filter calculation. If it is the post-filter result, you should
speak to your sequencing provider because that is less than spec.

(I am assembling a parasite genome from sequencing results from whole
> organism by pulling out reads with homology with a closely related genome
> and then extracting those reads.)


Based on my back of the envelope calculation, you have about 40X coverage.
It will probably be less after filtering. This is not very much data. You
should be getting about 75k reads (pre-filter) or 25k reads (post-filter)
from a 75k ZMW smrtcell (this is old technology). With the new (~1 yr old)
150k ZMW smrtcells, you should be getting 150k reads (pre-filter) or 50k
reads (post-filter). So multiple problems here. I think you should look at
the filtering parameters and speak with your sequencing provider about the
quality of the run. A single 150k ZMW smrtcell should be giving you closer
to 170X coverage post-filter. If you were using the older smrtcells, I
recommend switching to the newer and using the P4-C2 enzyme.

You could try working with the data, but I think you will end up needlessly
banging your head against the wall. With such a low coverage you would not
be able to use self-correction, which gives you best results. Correcting
with other technologies (ex: Illumina) might work, but it will most likely
give back a lower median read length than you already have which diminishes
the usefulness of PacBio data.



On Thu, May 15, 2014 at 1:53 PM, Chenling Antelope <
chenlingantelope@xxxxxxxxx> wrote:

> Hi Chris,
> My genome is about 1.4 Mb, and I have 28245 reads with median length
> ~2000. I am really inexperienced with pacbio, so could you clarify what you
> mean by filter? (I am assembling a parasite genome from sequencing results
> from whole organism by pulling out reads with homology with a closely
> related genome and then extracting those reads.)
> Best,
> Chenling
>
>
> 2014-05-15 11:32 GMT-07:00 Chris Hoefler <hoeflerb@xxxxxxxxx>:
>
> Just FYI, you will need a lot more than 16 Gb to error-correct and
>> assemble your reads. If you can get access to a high memory machine (or a
>> cluster) that would be best. What is your expected genome size? What is
>> your post-filter median read length and yield?
>>
>>
>> Best,
>> Chris
>>
>>
>> On Thu, May 15, 2014 at 11:04 AM, Chenling Antelope <
>> chenlingantelope@xxxxxxxxx> wrote:
>>
>>> THANKS Bastien and Andrej :)
>>>
>>>
>>> 2014-05-15 0:26 GMT-07:00 Andrej Benjak <abenjak@xxxxxxxxx>:
>>>
>>>  Hi Chenling,
>>>>
>>>> For correcting PacBio reads and/or de novo assemblies you can use the
>>>> SMRT portal. As an alternative to the PIA local installation, you can
>>>> download the PacBio virtual machine with the SMRT portal installed and
>>>> configured (not the latest version, but almost):
>>>>
>>>> https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/SMRT-Analysis-Virtual-Machine-Install
>>>>
>>>>
>>>> Cheers,
>>>> Andrej
>>>>
>>>>
>>>>
>>>> On 05/15/2014 09:02 AM, Bastien Chevreux wrote:
>>>>
>>>> On 15 May 2014, at 2:55 , Chenling Antelope <chenlingantelope@xxxxxxxxx> 
>>>> <chenlingantelope@xxxxxxxxx> wrote:
>>>>
>>>>  Thanks Bastien for the answer!
>>>> However I am currently unable to correct my reads because I lack the glib 
>>>> version required by celera.
>>>>
>>>>  Then you should get that from somewhere :-)
>>>>
>>>>
>>>>  Also, I used miramem to estimate the RAM required, which is a lot smaller 
>>>> than my actual RAM 16G
>>>>
>>>>  miramem does not know about PacBio reads yet, especially not about the 
>>>> worst memory eater for that scenario: the Smith-Waterman overlapper.
>>>>
>>>>
>>>>  Is there something else I can do to trouble shoot?
>>>>
>>>>  You could try to remove all reads >= 10kb (or 9kb, 8kb, etc.) to save 
>>>> memory at the overlap stage.
>>>>
>>>> But again: it makes absolutely no sense to currently use MIRA with 
>>>> non-corrected PacBio reads. These simply contain too much crap which MIRA 
>>>> is not prepared for. You will get “something” as result, but it will be 
>>>> total nonsense.
>>>>
>>>> B.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Chris Hoefler, PhD
>> Postdoctoral Research Associate
>> Straight Lab
>> Texas A&M University
>> 2128 TAMU
>> College Station, TX 77843-2128
>>
>
>


-- 
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: