[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: "Wachholtz, Michael" <mwachholtz@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 16 Nov 2010 20:39:24 -0600

I recently discovered that Abyss has trans-abyss package. While this
program is geared towards expression analysis with reference genome,
it has a merge.pl script. We have have been on the fence about what
k-mer value to use. The results are hard to interpret, but this
package will do assembly of all values k-mer from i/2 to i (where i is
the read length) and merge all the contigs into a final assembly. Our
computer is quad-core with 25GB RAM. It only takes Abyss less than
1hour to assemble ~100,000,000 reads. Very fast!! Since all of our
illumina reads were filtered to contain mostly 30+ quality scores, we
just run this assembly through MIRA's fasta2frag program. This will
output quality score file for the fragments, putting in a value of 30
for each bp (saves me the work of writing a script for this). Then
just treat the fragments as sanger reads and do hybrid with our 454
reads in MIRA. If anyone has done Illumina transcriptome assembly with
the velvet/oases package instead of abyss, I would like to hear your
thoughts about the advantages or technique you used. While abyss seems
to do a fine job of catching SNPs and logging them as "popped
bubbles", I'm not sure how it handles indels & transcript variants.
Once we have a complete assembly, our goal is to do RNA-Seq analysis
with the original Illumina data. While MIRA will catch a large
majority of SNPs during assembly, some of the SNP/variation data will
have been lost in the abyss assembly. However this "lost" information
can easily be found when we map reads using bowtie, bam/sam tools.

On Tue, Nov 16, 2010 at 2:55 PM, Sven Klages
<sir.svencelot@xxxxxxxxxxxxxx> wrote:
> oh, yes. I see, .. I just wanted to use it for my own data and was quite
> astonished ;-)
> fasta output, no qualities ... not of any use for me neither ..
> cheers,
> Sven
>
> 2010/11/16 Wachholtz, Michael <mwachholtz@xxxxxxxxxxx>
>>
>> I have, but the output is in fasta format with no quality scores. The
>> only advantage this program has is that it will output how many
>> identical reads there were. I prefer the fastq program in that it will
>> retain the quality score of best sequence and will output in fastq
>> format.
>>
>> On Mon, Nov 15, 2010 at 5:18 AM, Sven Klages
>> <sir.svencelot@xxxxxxxxxxxxxx> wrote:
>> > Hi Michael,
>> >
>> > 2010/11/15 Wachholtz, Michael <mwachholtz@xxxxxxxxxxx>
>> >>
>> >> [...]
>> >>
>> >> it is safe to use such strict criteria. After that, for each lane, we
>> >> used the fastq program to collapse/remove any identical reads. This
>> >
>> > [...]
>> >
>> > just a short question. You have successfuly used the FASTX-Toolkit to
>> > quality-clip your data;
>> > this tool collection also contains a program to remove duplicates from
>> > NGS
>> > data:
>> >
>> > FASTQ/A Collapser
>> > Collapsing identical sequences in a FASTQ/A file into a single sequence
>> > (while maintaining reads counts)
>> >
>> > Have you tried this for your data?
>> >
>> > cheers,
>> > Sven
>> >
>> >
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk
>> mailing list. For information on how to subscribe or unsubscribe, please
>> visit http://www.chevreux.org/mira_mailinglists.html
>
>

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: