[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: John DeFilippo <defilippo.john@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 1 Sep 2014 08:45:12 -0400

Hi Chris,

Again, I appreciate the depth of your responses.

> you can't do much with PacBio data without error-correction 

I’m not sure what was done to the raw data before we received it. The only 
thing we were told was: 'It looks like your output is around 76 megabases once 
the linker sequences are removed and the forward and reverse strands are edited 
into a high quality consensus.’

(For an 800 Mb genome, to me this only looks like 0.1x coverage - what am I 
missing?)

> (scaffolding might be possible in some cases, but it is rarely the best 
> option)

Does that refer to using the PacBio long reads as scaffolds to align short 
reads assembled as far as contains to?

> There is a lot that can be done short of a full genome assembly,

Is just trying to get the short reads into even some degree of contigs using 
the long reads feasible?

> A method that comes up periodically on this list is to fish out reads (using 
> mirabait or similar) associated with specific features that you are 
> interested in (ex. genes, metabolic pathways, ribosomal sequence, viral 
> insertions, etc) and assemble them independently for comparison with a 
> reference or whatever else you are trying to do.

This actually would be something we’d be interested in. We’re immunologists by 
trade, and are primarily interested in the immune genes. What I’ve done so far 
is to use the 1,000 or so immune-related gene proteins from the urchin paper 
you referenced as tblastn queries (E < 0.001) against our IonTorrent database 
(I removed almost 2/3 of the reads using a filter of 90% of bases in a sequence 
had to have a quality quality cut-off value >= Phred=20). We got significant 
hits from all but a handful of genes. 

Is running mirabait doable on my quad core MacBook Pro with 16 GB memory, or 
would have to use our HPCC?

I see this caveat in the manual: ‘Do not compute hash statistics from a file 
with sequences, but instead treat the baitfilename as file name of a valid 
mirabait hash statistics file and load it from disk… there are currently no 
fool-guards implemented. This means that the user must absolutely make sure to 
use the same mirabait value for 'k' both in the run which generated the hash 
statistics file and in the search using the pre-computed file or else results 
will be (horribly) wrong.’ 

Does the fact that I have no idea what this means doom me?

> and assemble them independently for comparison with a reference or whatever 
> else you are trying to do.

Using MIRA? And could these be done on my Mac?

Thanks again for your help.

John
--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler

[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: