[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: "Wachholtz, Michael" <mwachholtz@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 14 Nov 2010 22:33:14 -0600

I found 2 valuable tools to deal with solexa data. Our project will
eventually end up with 24 lanes of solexa data, so around ~720,000,000
55bp reads. This is a MASSIVE amount of data. We are using the fastx
toolkit and another program called fastq. fastx toolkit has quality
filter programs. On a per lane basis (~30,000,000 reads per lane),
filtering out any reads with a quality score below 30 removes about
40% of the reads. If we used lower, say 20 quality score, it only
removes about 20% of reads. But since we have so many reads, we think
it is safe to use such strict criteria. After that, for each lane, we
used the fastq program to collapse/remove any identical reads. This
program will keep the read with best average score. This removes about
40% of reads again. Some reads had 100's to 1000's of identical reads,
most likely highly expressed (housekeeping) genes. Our Illumina
transcriptome data is not normalized.
After that we assemble in abyss or velvet. Still doing trials with
this to determine best k-mer value to use (trying to keep it low to
catch low expressed genes). One reason we used such strict quality
filtering is because these programs, to my knowledge,  will not output
a consensus quality score sequence for any contigs. MIRA will output
consensus quality scores, but assembling solexa data is far too slow
with MIRA (took over 3 days to assemble 15,000,000 reads).
After the contigs are assembled, we just make fake quality score files
for each contig, giving each basepair a score of 30. Then fragment
these contigs using the fasta2fragment program included in MIRA. After
that, we can either treat these as sanger reads (or 454 pseudo-reads,
except the fake quality scores would have to be scaled to 454 format.
So instead of 30 score, we would use score of 40-50) and assemble with
our 454 reads in MIRA or Newbler. Newbler is much faster, but at some
point we will assemble with MIRA to help catch SNP & indels and
produce an assembly that is easy to process & visualize.
Our transcriptome is littered with transposable elements. Our game
plan is to do a rough quick assembly, annotate, then remove contigs
(and the reads that map to them) that are transposable elements or
housekeeping genes. Then reassemble with MIRA to get a refined and
easy to process assembly. We recently acquired a copy of pre-release
Newbler 2.5, which everyone is raving about since a recent publication
said it was best assembler tested, however this program has its pros
and cons. I believe the pipeline I explained above will make our
dataset MIRA friendly. Hopefully I will have more results within the
next few weeks. All in all, our dataset is making me pull my hair out.
We are working with 2 strains, one is tetraploid and the other is
hexaploid. The variation between and within strains will be a
challenge for any assembler.

The fastx tookit can be found here: http://hannonlab.cshl.edu/fastx_toolkit/
The command we used for filtering was "fastq_quality_filter -q 30 -p
100 -i s5sequence.fa -o outfile.fa" which will keep reads where 100%
of the basepairs have score above 30. Since you have a smaller
dataset, you may want to be less strict with your filtering criteria.
The fastq program can be found here
https://github.com/brentp/bio-playground/tree/master/reads-utils/
After compiling the source code, we ran command "./fastq filter
--adjust 64 --unique /path/to/your.fasta > unique.fasta" to remove
identical reads. This may affect coverage statistics in the assembler,
but this shouldn't be very relevant since the library is not
normalized and some transcripts will have very low coverage
regardless.

On Sun, Nov 14, 2010 at 5:55 PM, Marshall Hampton <hamptonio@xxxxxxxxx> wrote:
> Hi Michael,
>
> I'm very curious to hear how things go for you.  I am doing a project
> with a mammalian transcriptome, differential expression from 18
> samples.  We have 3.7 million 454 reads I have assembled with MIRA,
> and very soon we should get another 60 million reads from an Illumina
> run.  So overall its very similar to what you are doing.
>
> -Marshall Hampton
>
> On Mon, Nov 1, 2010 at 4:35 PM, Wachholtz, Michael
> <mwachholtz@xxxxxxxxxxx> wrote:
>> I think that is the pipeline I will use. Assemble solexa data using
>> the velvet/oases package. I like this because it is fast and one of
>> the few solexa transcriptome assemblers, everything else seems catered
>> to genomic assemblies. The only thing I dislike is that no consensus
>> quality sequence is output. So if I fragment these contigs and turn
>> them into 454 pseudo-reads, they will have no quality scores.  I'm
>> working with 2 strains, one is hexaploid and another tetraploid. I
>> fear I am swimming into deep dark waters, but hoping that MIRA will
>> help me to identify the majority of inter & intra organism SNPs. I
>> would like to catch indels greater than 3bp also. Does anyone know how
>> to tweek the MIRA 454 parameters to help catch indels but also deal
>> with sequencing error/homopolymer issues in 454 reads?
>>
>> On Mon, Nov 1, 2010 at 3:51 PM, Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx> 
>> wrote:
>>> --Hi,
>>>
>>> personally i use Velveth & velvetg to assemble Solexa reads because is very
>>> faster,
>>> it takes ~ 6 hours to assemble 30 millon of 75bp reads.
>>> Then i use Mira to re-assemble the contig provided by Velvet.
>>>
>>>
>>> Laurent --
>>>
>>>
>>>
>>>
>>> Wachholtz, Michael a écrit :
>>>>
>>>> I am currently trying to do a hybrid transcriptome assembly with both
>>>> 454 and Solexa reads, which will lead to an eventual RNA-Seq analysis.
>>>> The research is regarding 2 strains of buffalograss, one which is
>>>> resistant to cinch bugs ( tetraploid) and another that is suspectible
>>>> to cinch bugs (hexaploid). We have 5 half-plate runs of 454 data (
>>>> ~400,000 reads/run) and 11 lane runs of solexa data (each lane
>>>> producing 30millon 55bp reads). Our best computer is quad-core with
>>>> 1.5terabyte HD and 25GB RAM.
>>>> My questions regard making the best hybrid assembly with this data,
>>>> and flagging inter & intra organism SNPs also.
>>>> I have seen two methods described with mira. The first being that we
>>>> could assemble each solexa lane separately ( I think our RAM can only
>>>> handle 1 lane assembly at a time) then break the assembled contigs &
>>>> unassembled reads into 454 pseudo-reads. Then combine with 454 reads
>>>> and assemble with 454 settings. My questions regarding this are: how
>>>> would we fragment the solexa contigs into pseudo reads for 454? Do I
>>>> just break the contigs into 500bp chunks? Do I need to adjust the
>>>> quality scores since solexa uses a different scoring scheme? Also,
>>>> since it is so computationally expensive to assemble solexa with mira
>>>> (we are assembling 1 lane currently, and is already at the 24hr
>>>> mark...still running), is there another fast and ACCURATE solxea
>>>> assembly program that will produce contigs WITH quality scores? I've
>>>> tried abyss, but can't figure out how to get a consensus quality score
>>>> file to output for each contig.
>>>>
>>>> The next method I've seen described is to assemble the 454 reads and
>>>> use them as a backbone to map/assemble the solexa reads (which would
>>>> be less expensive in contrast to assembling solexa runs without a
>>>> backbone, as in the above method). If I do this, will it be able to
>>>> extend/improve/join the 454 contigs/singletons I already have? Will
>>>> these improved contigs show up in the output files? My plan would take
>>>> an iterative approach, trying to extend/join contigs with each solexa
>>>> run. Since I have 11 solexa datasets, I would assemble these to the
>>>> backbone one at a time (what my RAM permits), but with each iteration
>>>> I would want the backbone to improve and also include
>>>> leftover(unassembled) solexa reads from the previous iteration. The
>>>> only problem I see with this is that the output will only include
>>>> sequences comprised of solexa reads? In the next iteration I will want
>>>> to include the same 454 contigs/singletons and the new solexa novel
>>>> contigs/unassembled reads, as well as 454 contigs that were joined or
>>>> extended. This would require me merging the dataset somehow, having to
>>>> filter what has been mapped and unmapped to remove redundant
>>>> sequences. Correct? I also assume this would make it more difficult to
>>>> catch SNPs (which isn't a problem because I can always use SAMTools in
>>>> the RNA-Seq analysis to catch SNPs through the solexa reads)
>>>>
>>>> Has anyone tried one of these methods or prefers a particular one, and
>>>> can share the details/problems?
>>>>
>>>>
>>>
>>>
>>> --
>>> You have received this mail because you are subscribed to the mira_talk
>>> mailing list. For information on how to subscribe or unsubscribe, please
>>> visit http://www.chevreux.org/mira_mailinglists.html
>>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: