[mira_talk] Re: not=4 does not appear to be working on my Mac OS; system freezes after long read name lengths

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Tue, 2 Sep 2014 11:53:35 -0500

> How is N handled?
>

It is ignored. Anything non-ACGT will not be matched.

So for mirabait, would it be better to use my FASTQ or FASTA data?


Fastq. The reads that you get from mirabait should also be fastq (ie:
specify fastq not fasta as the output format) so that you can use them
directly for assembly. I don't usually quality filter anything that goes to
Mira because Mira has it's own routines for dealing with sequencing errors.
And Bastien says no to any preprocessing in the manual. Just convert from
BAM to fastq and that's it.

The PacBio data was given to us in both formats and I have no idea what
> filters were applied to it.


Just use the fastq file. No filters should have been applied.

This is good news. Our grant dollars are scarce, so what we spent on the
> PacBio was a big deal for us. It really bothered me that we couldn’t get
> any return on that investment without spending much more (which we are not
> likely to do).


Since PacBio has little to no discernible sequencing bias, if you can get
to 1-2X coverage of your error-corrected sequence, you can probably pull
out everything you are looking for fairly easily, taking advantage of the
long read lengths.



On Tue, Sep 2, 2014 at 10:45 AM, John DeFilippo <defilippo.john@xxxxxxxxx>
wrote:

> Hi Chris,
>
> Thanks for the mini tutorial!
>
> if the kmer from your read has some gaps in it, it won't hit the bait.
>
>
> How is N handled?
>
> As of Mira 4.0.1 you can use multiple input files, so you can use both at
> the same time. However, they have to be the same format (ex: both fastq).
>
>
> For all our data (PGM, Proton, PacBio), I have both FASTQ and FASTA. I
> converted the Ion Torrent data myself on Galaxy. For the PGM data I just
> used a 175 bp read length filter, but didn’t filter out low quality (I
> didn’t really understand it at the time). With the Proton, as I said
> before, I removed about 2/3 of the reads using a read length filter of 100
> bp and a quality filter of 90% of bases in a sequence had to have a quality
> quality cut-off value >= Phred=20. The PacBio data was given to us in both
> formats and I have no idea what filters were applied to it.
>
> So for mirabait, would it be better to use my FASTQ or FASTA data?
>
> And you need qualities if you are planning to assemble these reads. Do you
> have a separate fasta.qual file for your Proton run, or a fastq?
>
>
> I don’t have fasta.qual files, so for the assembly I’ll use FASTQ.
>
> a fairly manual process, though, and for 1000 genes that would not be fun
> times.
>
>
> When I ran tblastn on our PGM data, I executed them manually, one at a
> time, and yeh, not fun.
>
> Some script-fu may help if you happen to know some Perl (or other).
>
>
> I don’t know any Perl or Python (the latter is on my ’to learn’ list), but
> by the time I ran tblastn on our Proton data I was able to at least make up
> a very simple bash script to automate the process. Talk about time well
> spent learning.
>
> And I’m guessing my PacBio data is totally out the picture here.
>
> Since it is error-corrected, you might as well include it.
>
>
> This is good news. Our grant dollars are scarce, so what we spent on the
> PacBio was a big deal for us. It really bothered me that we couldn’t get
> any return on that investment without spending much more (which we are not
> likely to do).
>
> Thanks again.
>
> John
>
> On Sep 2, 2014, at 10:49 AM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:
>
>
> Just to be clear, the bait will be the full gene nt sequence of the immune
>> genes from the already sequenced urchin that I previously used as tblastn
>> queries and got hits against my urchin species database - the full nt
>> sequence of the ‘reference’ genes are the bait, correct?
>>
>
> Yes.
>
> So even though we only got a 16% alignment of our Proton reads against
>> that reference genome, since now we’re only using genes with known
>> significant (protein) hits, assembling these with MIRA after mirabait
>> becomes feasible.
>
>
> Yeah.... Ok, so this will require some playing around. It is worth it to
> just grab some test sequences so you can get a feel for how mirabait works.
>
> Here is an example gene fragment to use as a bait,
> >bait_test
> TCTGTCTTAT GTCTTTCCTC TATTTTATTT TTGACCGTCC TCATCCGTAC ATATGCGATG
> TTCGGCAATG AATTTGTCAT TGATCATTTG ACTAGCATGA CATTCGTTGT CTTCTGCGGT
>
> And here are three "reads",
> >read1
> ACTGACCAATGCCAAACGGCACGGCGGCGCCGCAGCCTGT
> >read2
> GGCAATCCTGAGGCATTATTTTTGACCTGTAAATGCAGAA
> >read3
> CGGAGTCCCATCTGGCATTTTTGATCGTCCAGCTGCAGCT
>
> Running mirabait on these with a kmer size of 8 will give you reads 2 and
> 3. A kmer size of 10-12 will give you just read2. A larger kmer size will
> give you nothing, and a kmer size of 6 will give you all three reads.
>
> So, I like the description Bastien has given it, "a grep for kmers". Kmer
> in this context is a stretch of nucleotides that match exactly between the
> read and the bait. So if the kmer from your read has some gaps in it, it
> won't hit the bait. In your case, I would try a few kmer sizes and see what
> you manage to get back. You can use your alignments as a guide (if you have
> stretches of at least 10 nucleotides that are identical between homologs,
> use a kmer setting of 10, for example). The tradeoff with smaller kmer
> sizes is more reads (to try to get a full ORF) vs reads that don't belong
> to the ORF. The latter is not necessarily a problem because you will
> assemble the reads in the end and take only the "contig" that matches your
> ORF, but if you have a highly repetitive sequence of some kind (like a
> transposon), this might be hard to sort out. With a stricter kmer setting
> you may only get gene fragments. This is also ok, because you can then use
> those fragments as bait to build outward until you get the whole ORF.
>
> For highly divergent sequences you may have more problems. If you end up
> getting too many reads outside the ORF (with a small kmer setting), you can
> try using your tblastn results to identify a few reads that match your ORF,
> and then use those reads as bait to try to extract the whole ORF. This
> would end up being a fairly manual process, though, and for 1000 genes that
> would not be fun times. Some script-fu may help if you happen to know some
> Perl (or other).
>
> Right, we have a 3GB FASTQ file from our PGM run with about 6 million
>> reads of mean length about 230 bp, and a 9 GB FASTA file from our Proton
>> run with about 34 million reads of mean length 118 bp, both of which I’ve
>> used as databases to query the other urchin’s immune genes against. Do I
>> incorporate each of them in mirabait individually or combined?
>
>
> As of Mira 4.0.1 you can use multiple input files, so you can use both at
> the same time. However, they have to be the same format (ex: both fastq).
> And you need qualities if you are planning to assemble these reads. Do you
> have a separate fasta.qual file for your Proton run, or a fastq?
>
> And I’m guessing my PacBio data is totally out the picture here.
>
>
> Since it is error-corrected, you might as well include it. You may not get
> many hits, but when you do you are more likely to span the entire ORF with
> 1 or 2 reads.
>
>
> On Mon, Sep 1, 2014 at 5:15 PM, John DeFilippo <defilippo.john@xxxxxxxxx>
> wrote:
>
>> Hi Chris,
>>
>> This is a good starting place.
>>
>>
>> OK… a plan! I appreciate your steering me towards this option.
>>
>> Put the known gene sequences that have hits in your data set in a
>> multi-fasta file and use as the bait.
>>
>>
>> Just to be clear, the bait will be the full gene nt sequence of the
>> immune genes from the already sequenced urchin that I previously used as
>> tblastn queries and got hits against my urchin species database - the full
>> nt sequence of the ‘reference’ genes are the bait, correct?
>>
>> and assemble them independently for comparison with a reference or
>>> whatever else you are trying to do.
>>
>>
>> So even though we only got a 16% alignment of our Proton reads against
>> that reference genome, since now we’re only using genes with known
>> significant (protein) hits, assembling these with MIRA after mirabait
>> becomes feasible.
>>
>>  your Ion data sets total 12 Gb
>>
>>
>> Right, we have a 3GB FASTQ file from our PGM run with about 6 million
>> reads of mean length about 230 bp, and a 9 GB FASTA file from our Proton
>> run with about 34 million reads of mean length 118 bp, both of which I’ve
>> used as databases to query the other urchin’s immune genes against. Do I
>> incorporate each of them in mirabait individually or combined? And I’m
>> guessing my PacBio data is totally out the picture here.
>>
>> You might try a few different k-mer sizes depending on how similar the
>> homologs are.
>>
>>
>> Clearly I’ll need to play around with this, but right now I don’t have a
>> clue what might be appropriate k-mer sizes to try. Poking around, I see
>> that on Penn's Galaxy server, the default k-mer size for mirabait is 31 - a
>> decent starting point?
>>
>> Thanks again.
>>
>> John
>>
>>
>> On Sep 1, 2014, at 4:50 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:
>>
>>
>> I’m not sure what was done to the raw data before we received it. The
>>> only thing we were told was: 'It looks like your output is around 76
>>> megabases once the linker sequences are removed and the forward and reverse
>>> strands are edited into a high quality consensus.’
>>
>>
>> It sounds like they ran error-correction on it. Making some assumptions
>> about your library (10 kb bead-selected) and sequencing chemistry (P4-C2),
>> you should expect a yield of around 250 Mb per SMRT cell, so 3 cells will
>> give you around 750 Mb. After error-correction your yield will diminish a
>> bit, in my experience usually by 4-5X. It sounds like you took a pretty big
>> hit. I expect that is largely because you don't have enough coverage of the
>> genome. The error-correction only works efficiently if there is a
>> sufficient number of overlapping reads. Without that, it won't work well.
>>
>> Does that refer to using the PacBio long reads as scaffolds to align
>>> short reads assembled as far as contains to?
>>
>>
>> Before error-correction methods were available, using the raw PacBio
>> reads to scaffold contigs was successful to some degree. Basically a
>> modified BLAST algorithm is used to align the reads to high-confidence
>> contigs, effectively joining them by some known distance (the length of the
>> read - the overlaps). The advantages of using PacBio reads to scaffold (as
>> opposed to a mate-pair library, for example), is the ability to know the
>> precise size of the gap in bases, and if you have several spanning reads
>> you can fill the gap with sequence from the reads. However, you still need
>> a decent coverage for this to work well, and in cases where you have that
>> coverage you are often better off just error-correcting and assembling de
>> novo instead. Also, if you have missassemblies in your contigs, scaffolding
>> will give you a lot of garbage, and you have to be careful with repeats.
>>
>> Is just trying to get the short reads into even some degree of contigs
>>> using the long reads feasible?
>>
>>
>> Given the coverage you have, no, I wouldn't expect anything great. You
>> may have enough links to support scaffolds in a few places, but it is not
>> going to help the assembly much.
>>
>> This actually would be something we’d be interested in. We’re
>>> immunologists by trade, and are primarily interested in the immune genes.
>>> What I’ve done so far is to use the 1,000 or so immune-related gene
>>> proteins from the urchin paper you referenced as tblastn queries (E <
>>> 0.001) against our IonTorrent database (I removed almost 2/3 of the reads
>>> using a filter of 90% of bases in a sequence had to have a quality quality
>>> cut-off value >= Phred=20). We got significant hits from all but a handful
>>> of genes
>>
>>
>> This is a good starting place. Put the known gene sequences that have
>> hits in your data set in a multi-fasta file and use as the bait. It is
>> probably best to just do a few at a time. You might try a few different
>> k-mer sizes depending on how similar the homologs are. The reads that are
>> extracted can then be used for a de novo assembly of those genes.
>>
>> Is running mirabait doable on my quad core MacBook Pro with 16 GB memory,
>>
>>
>>  Should be. Well, your Ion data sets total 12 Gb, so make sure you give
>> as much free memory available to Mira as you can.
>>
>> I see this caveat in the manual:
>>
>>
>> Only important if you use -L. So don't use it.
>>
>> Using MIRA? And could these be done on my Mac?
>>
>>
>> Yes. The number of reads you extract for each gene will be a small
>> fraction of your total reads, so it should be completely manageable on your
>> Mac.
>>
>> Again, I appreciate the depth of your responses.
>>
>>
>> No problem. Hope it helps.
>>
>>
>> On Mon, Sep 1, 2014 at 7:45 AM, John DeFilippo <defilippo.john@xxxxxxxxx>
>> wrote:
>>
>>> Hi Chris,
>>>
>>> Again, I appreciate the depth of your responses.
>>>
>>> > you can't do much with PacBio data without error-correction
>>>
>>> I’m not sure what was done to the raw data before we received it. The
>>> only thing we were told was: 'It looks like your output is around 76
>>> megabases once the linker sequences are removed and the forward and reverse
>>> strands are edited into a high quality consensus.’
>>>
>>> (For an 800 Mb genome, to me this only looks like 0.1x coverage - what
>>> am I missing?)
>>>
>>> > (scaffolding might be possible in some cases, but it is rarely the
>>> best option)
>>>
>>> Does that refer to using the PacBio long reads as scaffolds to align
>>> short reads assembled as far as contains to?
>>>
>>> > There is a lot that can be done short of a full genome assembly,
>>>
>>> Is just trying to get the short reads into even some degree of contigs
>>> using the long reads feasible?
>>>
>>> > A method that comes up periodically on this list is to fish out reads
>>> (using mirabait or similar) associated with specific features that you are
>>> interested in (ex. genes, metabolic pathways, ribosomal sequence, viral
>>> insertions, etc) and assemble them independently for comparison with a
>>> reference or whatever else you are trying to do.
>>>
>>> This actually would be something we’d be interested in. We’re
>>> immunologists by trade, and are primarily interested in the immune genes.
>>> What I’ve done so far is to use the 1,000 or so immune-related gene
>>> proteins from the urchin paper you referenced as tblastn queries (E <
>>> 0.001) against our IonTorrent database (I removed almost 2/3 of the reads
>>> using a filter of 90% of bases in a sequence had to have a quality quality
>>> cut-off value >= Phred=20). We got significant hits from all but a handful
>>> of genes.
>>>
>>> Is running mirabait doable on my quad core MacBook Pro with 16 GB
>>> memory, or would have to use our HPCC?
>>>
>>> I see this caveat in the manual: ‘Do not compute hash statistics from a
>>> file with sequences, but instead treat the baitfilename as file name of a
>>> valid mirabait hash statistics file and load it from disk… there are
>>> currently no fool-guards implemented. This means that the user must
>>> absolutely make sure to use the same mirabait value for 'k' both in the run
>>> which generated the hash statistics file and in the search using the
>>> pre-computed file or else results will be (horribly) wrong.’
>>>
>>> Does the fact that I have no idea what this means doom me?
>>>
>>> > and assemble them independently for comparison with a reference or
>>> whatever else you are trying to do.
>>>
>>> Using MIRA? And could these be done on my Mac?
>>>
>>> Thanks again for your help.
>>>
>>> John
>>> --
>>> You have received this mail because you are subscribed to the mira_talk
>>> mailing list. For information on how to subscribe or unsubscribe, please
>>> visit http://www.chevreux.org/mira_mailinglists.html
>>>
>>
>>
>>
>> --
>> Chris Hoefler, PhD
>> Postdoctoral Research Associate
>> Straight Lab
>> Texas A&M University
>> 2128 TAMU
>> College Station, TX 77843-2128
>>
>>
>>
>
>
> --
> Chris Hoefler, PhD
> Postdoctoral Research Associate
> Straight Lab
> Texas A&M University
> 2128 TAMU
> College Station, TX 77843-2128
>
>
>


-- 
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: