[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: "Wachholtz, Michael" <mwachholtz@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 1 Nov 2010 16:35:14 -0500

I think that is the pipeline I will use. Assemble solexa data using
the velvet/oases package. I like this because it is fast and one of
the few solexa transcriptome assemblers, everything else seems catered
to genomic assemblies. The only thing I dislike is that no consensus
quality sequence is output. So if I fragment these contigs and turn
them into 454 pseudo-reads, they will have no quality scores.  I'm
working with 2 strains, one is hexaploid and another tetraploid. I
fear I am swimming into deep dark waters, but hoping that MIRA will
help me to identify the majority of inter & intra organism SNPs. I
would like to catch indels greater than 3bp also. Does anyone know how
to tweek the MIRA 454 parameters to help catch indels but also deal
with sequencing error/homopolymer issues in 454 reads?

On Mon, Nov 1, 2010 at 3:51 PM, Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx> wrote:
> --Hi,
>
> personally i use Velveth & velvetg to assemble Solexa reads because is very
> faster,
> it takes ~ 6 hours to assemble 30 millon of 75bp reads.
> Then i use Mira to re-assemble the contig provided by Velvet.
>
>
> Laurent --
>
>
>
>
> Wachholtz, Michael a écrit :
>>
>> I am currently trying to do a hybrid transcriptome assembly with both
>> 454 and Solexa reads, which will lead to an eventual RNA-Seq analysis.
>> The research is regarding 2 strains of buffalograss, one which is
>> resistant to cinch bugs ( tetraploid) and another that is suspectible
>> to cinch bugs (hexaploid). We have 5 half-plate runs of 454 data (
>> ~400,000 reads/run) and 11 lane runs of solexa data (each lane
>> producing 30millon 55bp reads). Our best computer is quad-core with
>> 1.5terabyte HD and 25GB RAM.
>> My questions regard making the best hybrid assembly with this data,
>> and flagging inter & intra organism SNPs also.
>> I have seen two methods described with mira. The first being that we
>> could assemble each solexa lane separately ( I think our RAM can only
>> handle 1 lane assembly at a time) then break the assembled contigs &
>> unassembled reads into 454 pseudo-reads. Then combine with 454 reads
>> and assemble with 454 settings. My questions regarding this are: how
>> would we fragment the solexa contigs into pseudo reads for 454? Do I
>> just break the contigs into 500bp chunks? Do I need to adjust the
>> quality scores since solexa uses a different scoring scheme? Also,
>> since it is so computationally expensive to assemble solexa with mira
>> (we are assembling 1 lane currently, and is already at the 24hr
>> mark...still running), is there another fast and ACCURATE solxea
>> assembly program that will produce contigs WITH quality scores? I've
>> tried abyss, but can't figure out how to get a consensus quality score
>> file to output for each contig.
>>
>> The next method I've seen described is to assemble the 454 reads and
>> use them as a backbone to map/assemble the solexa reads (which would
>> be less expensive in contrast to assembling solexa runs without a
>> backbone, as in the above method). If I do this, will it be able to
>> extend/improve/join the 454 contigs/singletons I already have? Will
>> these improved contigs show up in the output files? My plan would take
>> an iterative approach, trying to extend/join contigs with each solexa
>> run. Since I have 11 solexa datasets, I would assemble these to the
>> backbone one at a time (what my RAM permits), but with each iteration
>> I would want the backbone to improve and also include
>> leftover(unassembled) solexa reads from the previous iteration. The
>> only problem I see with this is that the output will only include
>> sequences comprised of solexa reads? In the next iteration I will want
>> to include the same 454 contigs/singletons and the new solexa novel
>> contigs/unassembled reads, as well as 454 contigs that were joined or
>> extended. This would require me merging the dataset somehow, having to
>> filter what has been mapped and unmapped to remove redundant
>> sequences. Correct? I also assume this would make it more difficult to
>> catch SNPs (which isn't a problem because I can always use SAMTools in
>> the RNA-Seq analysis to catch SNPs through the solexa reads)
>>
>> Has anyone tried one of these methods or prefers a particular one, and
>> can share the details/problems?
>>
>>
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: