[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 01 Nov 2010 21:51:40 +0100

--Hi,

personally i use Velveth & velvetg to assemble Solexa reads because is very faster,
it takes ~ 6 hours to assemble 30 millon of 75bp reads.
Then i use Mira to re-assemble the contig provided by Velvet.


Laurent --




Wachholtz, Michael a écrit :
I am currently trying to do a hybrid transcriptome assembly with both
454 and Solexa reads, which will lead to an eventual RNA-Seq analysis.
The research is regarding 2 strains of buffalograss, one which is
resistant to cinch bugs ( tetraploid) and another that is suspectible
to cinch bugs (hexaploid). We have 5 half-plate runs of 454 data (
~400,000 reads/run) and 11 lane runs of solexa data (each lane
producing 30millon 55bp reads). Our best computer is quad-core with
1.5terabyte HD and 25GB RAM.
My questions regard making the best hybrid assembly with this data,
and flagging inter & intra organism SNPs also.
I have seen two methods described with mira. The first being that we
could assemble each solexa lane separately ( I think our RAM can only
handle 1 lane assembly at a time) then break the assembled contigs &
unassembled reads into 454 pseudo-reads. Then combine with 454 reads
and assemble with 454 settings. My questions regarding this are: how
would we fragment the solexa contigs into pseudo reads for 454? Do I
just break the contigs into 500bp chunks? Do I need to adjust the
quality scores since solexa uses a different scoring scheme? Also,
since it is so computationally expensive to assemble solexa with mira
(we are assembling 1 lane currently, and is already at the 24hr
mark...still running), is there another fast and ACCURATE solxea
assembly program that will produce contigs WITH quality scores? I've
tried abyss, but can't figure out how to get a consensus quality score
file to output for each contig.

The next method I've seen described is to assemble the 454 reads and
use them as a backbone to map/assemble the solexa reads (which would
be less expensive in contrast to assembling solexa runs without a
backbone, as in the above method). If I do this, will it be able to
extend/improve/join the 454 contigs/singletons I already have? Will
these improved contigs show up in the output files? My plan would take
an iterative approach, trying to extend/join contigs with each solexa
run. Since I have 11 solexa datasets, I would assemble these to the
backbone one at a time (what my RAM permits), but with each iteration
I would want the backbone to improve and also include
leftover(unassembled) solexa reads from the previous iteration. The
only problem I see with this is that the output will only include
sequences comprised of solexa reads? In the next iteration I will want
to include the same 454 contigs/singletons and the new solexa novel
contigs/unassembled reads, as well as 454 contigs that were joined or
extended. This would require me merging the dataset somehow, having to
filter what has been mapped and unmapped to remove redundant
sequences. Correct? I also assume this would make it more difficult to
catch SNPs (which isn't a problem because I can always use SAMTools in
the RNA-Seq analysis to catch SNPs through the solexa reads)

Has anyone tried one of these methods or prefers a particular one, and
can share the details/problems?



--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: