[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: Marshall Hampton <hamptonio@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 14 Nov 2010 17:55:37 -0600

Hi Michael,

I'm very curious to hear how things go for you.  I am doing a project
with a mammalian transcriptome, differential expression from 18
samples.  We have 3.7 million 454 reads I have assembled with MIRA,
and very soon we should get another 60 million reads from an Illumina
run.  So overall its very similar to what you are doing.

-Marshall Hampton

On Mon, Nov 1, 2010 at 4:35 PM, Wachholtz, Michael
<mwachholtz@xxxxxxxxxxx> wrote:
> I think that is the pipeline I will use. Assemble solexa data using
> the velvet/oases package. I like this because it is fast and one of
> the few solexa transcriptome assemblers, everything else seems catered
> to genomic assemblies. The only thing I dislike is that no consensus
> quality sequence is output. So if I fragment these contigs and turn
> them into 454 pseudo-reads, they will have no quality scores.  I'm
> working with 2 strains, one is hexaploid and another tetraploid. I
> fear I am swimming into deep dark waters, but hoping that MIRA will
> help me to identify the majority of inter & intra organism SNPs. I
> would like to catch indels greater than 3bp also. Does anyone know how
> to tweek the MIRA 454 parameters to help catch indels but also deal
> with sequencing error/homopolymer issues in 454 reads?
>
> On Mon, Nov 1, 2010 at 3:51 PM, Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx> 
> wrote:
>> --Hi,
>>
>> personally i use Velveth & velvetg to assemble Solexa reads because is very
>> faster,
>> it takes ~ 6 hours to assemble 30 millon of 75bp reads.
>> Then i use Mira to re-assemble the contig provided by Velvet.
>>
>>
>> Laurent --
>>
>>
>>
>>
>> Wachholtz, Michael a écrit :
>>>
>>> I am currently trying to do a hybrid transcriptome assembly with both
>>> 454 and Solexa reads, which will lead to an eventual RNA-Seq analysis.
>>> The research is regarding 2 strains of buffalograss, one which is
>>> resistant to cinch bugs ( tetraploid) and another that is suspectible
>>> to cinch bugs (hexaploid). We have 5 half-plate runs of 454 data (
>>> ~400,000 reads/run) and 11 lane runs of solexa data (each lane
>>> producing 30millon 55bp reads). Our best computer is quad-core with
>>> 1.5terabyte HD and 25GB RAM.
>>> My questions regard making the best hybrid assembly with this data,
>>> and flagging inter & intra organism SNPs also.
>>> I have seen two methods described with mira. The first being that we
>>> could assemble each solexa lane separately ( I think our RAM can only
>>> handle 1 lane assembly at a time) then break the assembled contigs &
>>> unassembled reads into 454 pseudo-reads. Then combine with 454 reads
>>> and assemble with 454 settings. My questions regarding this are: how
>>> would we fragment the solexa contigs into pseudo reads for 454? Do I
>>> just break the contigs into 500bp chunks? Do I need to adjust the
>>> quality scores since solexa uses a different scoring scheme? Also,
>>> since it is so computationally expensive to assemble solexa with mira
>>> (we are assembling 1 lane currently, and is already at the 24hr
>>> mark...still running), is there another fast and ACCURATE solxea
>>> assembly program that will produce contigs WITH quality scores? I've
>>> tried abyss, but can't figure out how to get a consensus quality score
>>> file to output for each contig.
>>>
>>> The next method I've seen described is to assemble the 454 reads and
>>> use them as a backbone to map/assemble the solexa reads (which would
>>> be less expensive in contrast to assembling solexa runs without a
>>> backbone, as in the above method). If I do this, will it be able to
>>> extend/improve/join the 454 contigs/singletons I already have? Will
>>> these improved contigs show up in the output files? My plan would take
>>> an iterative approach, trying to extend/join contigs with each solexa
>>> run. Since I have 11 solexa datasets, I would assemble these to the
>>> backbone one at a time (what my RAM permits), but with each iteration
>>> I would want the backbone to improve and also include
>>> leftover(unassembled) solexa reads from the previous iteration. The
>>> only problem I see with this is that the output will only include
>>> sequences comprised of solexa reads? In the next iteration I will want
>>> to include the same 454 contigs/singletons and the new solexa novel
>>> contigs/unassembled reads, as well as 454 contigs that were joined or
>>> extended. This would require me merging the dataset somehow, having to
>>> filter what has been mapped and unmapped to remove redundant
>>> sequences. Correct? I also assume this would make it more difficult to
>>> catch SNPs (which isn't a problem because I can always use SAMTools in
>>> the RNA-Seq analysis to catch SNPs through the solexa reads)
>>>
>>> Has anyone tried one of these methods or prefers a particular one, and
>>> can share the details/problems?
>>>
>>>
>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: