[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 1 Nov 2010 21:37:58 +0100

On Freitag 29 Oktober 2010 Wachholtz, Michael wrote:
> [...]
> My questions regard making the best hybrid assembly with this data,
> and flagging inter & intra organism SNPs also.
> I have seen two methods described with mira. The first being that we
> could assemble each solexa lane separately ( I think our RAM can only
> handle 1 lane assembly at a time) then break the assembled contigs &
> unassembled reads into 454 pseudo-reads. Then combine with 454 reads
> and assemble with 454 settings. My questions regarding this are: how
> would we fragment the solexa contigs into pseudo reads for 454? Do I
> just break the contigs into 500bp chunks?

There will be a description for doing things like this in the 3.2.1 version of 
the docs, but at the moment let's just say "yes".

Instead of doing it yourself, you should perhaps investigate the 

  fasta2frag.tcl

script of the MIRA distribution which does a number of things so that MIRA can 
work with it right away. E.g.

  fasta2frag.tcl -l 500 -i 250 -r 2 in.fasta out.fasta

will create 500bp fragments, overlapping with each other by 250 bp and every 
second fragment reverse complemented. With that, MIRA will be totally happy 
and even the standard quality checking algorithms won't clip anything.

Depending on the length of your Solexa contigs, you may want to investigate 
the "-p" option of fasta2frag.

> Do I need to adjust the
> quality scores since solexa uses a different scoring scheme? 

It's only slightly different, the scheme from the newer pipelines is even more 
"phred" like. No, no need to adjust scores. Well, at least I think ... maybe 
you'll want to play with it.

> Also,
> since it is so computationally expensive to assemble solexa with mira
> (we are assembling 1 lane currently, and is already at the 24hr
> mark...still running), is there another fast and ACCURATE solxea
> assembly program that will produce contigs WITH quality scores?

If you find one, tell me please. Fast: yes, there are now a lot of them. But 
fast and accurate (at least accurate to what I want it to be) ... difficult. 
You might want to try WGS, but I think they don't do transcriptome assembly, 
just genome.

> The next method I've seen described is to assemble the 454 reads and
> use them as a backbone to map/assemble the solexa reads (which would
> be less expensive in contrast to assembling solexa runs without a
> backbone, as in the above method). If I do this, will it be able to
> extend/improve/join the 454 contigs/singletons I already have?

In "normal" mapping a bit yes, but then probably not really as you'd like it. 
MIRA will map reads to the backbones and there will be overhangs to the left 
and right, but only so far as the reads will have an overlap to the backbone.

In "mapping and also build new contigs" mode, yes, backbones will be extended, 
but then again in a fashion that might sometimes surprise: assume two 
backbones (bb1 and bb2) and you now map new data to them and the new data can 
bridge both backbones. Then MIRA will map everything it can to backbone 1, 
then continue on building and extending backbone 1 with new reads without even 
looking to backbone 2. In the most ideal case, backbone 1 will then be 
completely mapped and extended with sequences from backbone 2, and backbone 2 
will be "empty" and devoid of any mapped sequence.

But in no use case will MIRA join two backbones. Sorry, not foreseen in the 
near future.

> Will
> these improved contigs show up in the output files? My plan would take
> an iterative approach, trying to extend/join contigs with each solexa
> run. Since I have 11 solexa datasets, I would assemble these to the
> backbone one at a time (what my RAM permits), but with each iteration
> I would want the backbone to improve and also include
> leftover(unassembled) solexa reads from the previous iteration. The
> only problem I see with this is that the output will only include
> sequences comprised of solexa reads? In the next iteration I will want
> to include the same 454 contigs/singletons and the new solexa novel
> contigs/unassembled reads, as well as 454 contigs that were joined or
> extended. This would require me merging the dataset somehow, having to
> filter what has been mapped and unmapped to remove redundant
> sequences. Correct?

Yes.

> I also assume this would make it more difficult to
> catch SNPs (which isn't a problem because I can always use SAMTools in
> the RNA-Seq analysis to catch SNPs through the solexa reads)
> 
> Has anyone tried one of these methods or prefers a particular one, and
> can share the details/problems?

Whatever you try: a bit more RAM to your machine should be considered. Say ... 
40 GiB more? Never hurts. You *will* run into problems with MIRA and 30m 
Solexa reads, even if they were only 36mers. Use "miramem" to get an idea how 
much RAM you'll need.

And last, but not least: I'm not sure I would want to tackle this problem of 
yours with MIRA. Then again, I wouldn't know anything else giving you "an easy 
ride" through the type of data you have.

Best,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: