[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

From: John DeFilippo <defilippo.john@xxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 1 Sep 2014 18:15:46 -0400
Hi Chris,

> This is a good starting place. 

OK… a plan! I appreciate your steering me towards this option.

> Put the known gene sequences that have hits in your data set in a multi-fasta 
> file and use as the bait.

Just to be clear, the bait will be the full gene nt sequence of the immune 
genes from the already sequenced urchin that I previously used as tblastn 
queries and got hits against my urchin species database - the full nt sequence 
of the ‘reference’ genes are the bait, correct? 

> and assemble them independently for comparison with a reference or whatever 
> else you are trying to do. 

So even though we only got a 16% alignment of our Proton reads against that 
reference genome, since now we’re only using genes with known significant 
(protein) hits, assembling these with MIRA after mirabait becomes feasible.

>  your Ion data sets total 12 Gb

Right, we have a 3GB FASTQ file from our PGM run with about 6 million reads of 
mean length about 230 bp, and a 9 GB FASTA file from our Proton run with about 
34 million reads of mean length 118 bp, both of which I’ve used as databases to 
query the other urchin’s immune genes against. Do I incorporate each of them in 
mirabait individually or combined? And I’m guessing my PacBio data is totally 
out the picture here.

> You might try a few different k-mer sizes depending on how similar the 
> homologs are. 

Clearly I’ll need to play around with this, but right now I don’t have a clue 
what might be appropriate k-mer sizes to try. Poking around, I see that on 
Penn's Galaxy server, the default k-mer size for mirabait is 31 - a decent 
starting point?

Thanks again. 

John


On Sep 1, 2014, at 4:50 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:

> 
> I’m not sure what was done to the raw data before we received it. The only 
> thing we were told was: 'It looks like your output is around 76 megabases 
> once the linker sequences are removed and the forward and reverse strands are 
> edited into a high quality consensus.’
> 
> It sounds like they ran error-correction on it. Making some assumptions about 
> your library (10 kb bead-selected) and sequencing chemistry (P4-C2), you 
> should expect a yield of around 250 Mb per SMRT cell, so 3 cells will give 
> you around 750 Mb. After error-correction your yield will diminish a bit, in 
> my experience usually by 4-5X. It sounds like you took a pretty big hit. I 
> expect that is largely because you don't have enough coverage of the genome. 
> The error-correction only works efficiently if there is a sufficient number 
> of overlapping reads. Without that, it won't work well.
> 
> Does that refer to using the PacBio long reads as scaffolds to align short 
> reads assembled as far as contains to?
> 
> Before error-correction methods were available, using the raw PacBio reads to 
> scaffold contigs was successful to some degree. Basically a modified BLAST 
> algorithm is used to align the reads to high-confidence contigs, effectively 
> joining them by some known distance (the length of the read - the overlaps). 
> The advantages of using PacBio reads to scaffold (as opposed to a mate-pair 
> library, for example), is the ability to know the precise size of the gap in 
> bases, and if you have several spanning reads you can fill the gap with 
> sequence from the reads. However, you still need a decent coverage for this 
> to work well, and in cases where you have that coverage you are often better 
> off just error-correcting and assembling de novo instead. Also, if you have 
> missassemblies in your contigs, scaffolding will give you a lot of garbage, 
> and you have to be careful with repeats.
> 
> Is just trying to get the short reads into even some degree of contigs using 
> the long reads feasible?
> 
> Given the coverage you have, no, I wouldn't expect anything great. You may 
> have enough links to support scaffolds in a few places, but it is not going 
> to help the assembly much.
> 
> This actually would be something we’d be interested in. We’re immunologists 
> by trade, and are primarily interested in the immune genes. What I’ve done so 
> far is to use the 1,000 or so immune-related gene proteins from the urchin 
> paper you referenced as tblastn queries (E < 0.001) against our IonTorrent 
> database (I removed almost 2/3 of the reads using a filter of 90% of bases in 
> a sequence had to have a quality quality cut-off value >= Phred=20). We got 
> significant hits from all but a handful of genes
> 
> This is a good starting place. Put the known gene sequences that have hits in 
> your data set in a multi-fasta file and use as the bait. It is probably best 
> to just do a few at a time. You might try a few different k-mer sizes 
> depending on how similar the homologs are. The reads that are extracted can 
> then be used for a de novo assembly of those genes.
> 
> Is running mirabait doable on my quad core MacBook Pro with 16 GB memory,
> 
> Should be. Well, your Ion data sets total 12 Gb, so make sure you give as 
> much free memory available to Mira as you can.
> 
> I see this caveat in the manual:
> 
> Only important if you use -L. So don't use it.
> 
> Using MIRA? And could these be done on my Mac?
> 
> Yes. The number of reads you extract for each gene will be a small fraction 
> of your total reads, so it should be completely manageable on your Mac.
> 
> Again, I appreciate the depth of your responses.
> 
> No problem. Hope it helps. 
> 
> 
> On Mon, Sep 1, 2014 at 7:45 AM, John DeFilippo <defilippo.john@xxxxxxxxx> 
> wrote:
> Hi Chris,
> 
> Again, I appreciate the depth of your responses.
> 
> > you can't do much with PacBio data without error-correction
> 
> I’m not sure what was done to the raw data before we received it. The only 
> thing we were told was: 'It looks like your output is around 76 megabases 
> once the linker sequences are removed and the forward and reverse strands are 
> edited into a high quality consensus.’
> 
> (For an 800 Mb genome, to me this only looks like 0.1x coverage - what am I 
> missing?)
> 
> > (scaffolding might be possible in some cases, but it is rarely the best 
> > option)
> 
> Does that refer to using the PacBio long reads as scaffolds to align short 
> reads assembled as far as contains to?
> 
> > There is a lot that can be done short of a full genome assembly,
> 
> Is just trying to get the short reads into even some degree of contigs using 
> the long reads feasible?
> 
> > A method that comes up periodically on this list is to fish out reads 
> > (using mirabait or similar) associated with specific features that you are 
> > interested in (ex. genes, metabolic pathways, ribosomal sequence, viral 
> > insertions, etc) and assemble them independently for comparison with a 
> > reference or whatever else you are trying to do.
> 
> This actually would be something we’d be interested in. We’re immunologists 
> by trade, and are primarily interested in the immune genes. What I’ve done so 
> far is to use the 1,000 or so immune-related gene proteins from the urchin 
> paper you referenced as tblastn queries (E < 0.001) against our IonTorrent 
> database (I removed almost 2/3 of the reads using a filter of 90% of bases in 
> a sequence had to have a quality quality cut-off value >= Phred=20). We got 
> significant hits from all but a handful of genes.
> 
> Is running mirabait doable on my quad core MacBook Pro with 16 GB memory, or 
> would have to use our HPCC?
> 
> I see this caveat in the manual: ‘Do not compute hash statistics from a file 
> with sequences, but instead treat the baitfilename as file name of a valid 
> mirabait hash statistics file and load it from disk… there are currently no 
> fool-guards implemented. This means that the user must absolutely make sure 
> to use the same mirabait value for 'k' both in the run which generated the 
> hash statistics file and in the search using the pre-computed file or else 
> results will be (horribly) wrong.’
> 
> Does the fact that I have no idea what this means doom me?
> 
> > and assemble them independently for comparison with a reference or whatever 
> > else you are trying to do.
> 
> Using MIRA? And could these be done on my Mac?
> 
> Thanks again for your help.
> 
> John
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html
> 
> 
> 
> -- 
> Chris Hoefler, PhD
> Postdoctoral Research Associate
> Straight Lab
> Texas A&M University
> 2128 TAMU
> College Station, TX 77843-2128
Follow-Ups:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler
References:
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: John DeFilippo
- [mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths
  - From: Chris Hoefler
[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

Other related posts: