[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: "Wachholtz, Michael" <mwachholtz@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 19 May 2011 12:56:22 -0500

Regarding Solexa reads and transcriptomes, I have read several papers
that use de Brujin graph assemblers, but with multiple k-mer values
and merging the results into a final assembly. We are working with
hexa & tetraploid plant, so there is a lot of allelic diversity and
paralogues within the transcriptome. Ideally, we would like to capture
this complexity in the assemby. As mentioned at the beginning of this
thread, we have 454 reads and solexa reads. I would like your opinion
on this idea:

I have noticed that removing duplicate reads from the solexa data will
reduce the data size by 30-50%. I assume these reads are from highly
expressed genes. I am thinking of running a velvet assembly, with very
high k-mer value. This assembly should create reliable sequences for
the highly expressed genes. After that, any reads that were not
assembled into these complete transcripts can be reused in a hybrid
assembly (w/ 454 reads) via MIRA. I assume that removing duplicate
reads and reads from highly expressed transcripts will 1) reduce my
solexa data size significantly and 2) will allow more detailed
assembly of low abundance transcripts. Can anyone think of problems
with this method?

On Tue, Nov 30, 2010 at 9:14 PM, Wachholtz, Michael
<mwachholtz@xxxxxxxxxxx> wrote:
> I am still configuring trans-abyss, and yes it is not user friendly.
> Our illumina reads are single end, so many of the steps in trans-abyss
> are skipped. We are only using trans-abyss to merge our multi-k-mer
> assembly to remove redundant contigs. I realize abyss will won't catch
> indels well, but we are only using it to help make our 454 assembly
> better. Since we have no reference genome, we sequenced a normalized
> transcriptome via 454. Then did non-normalized sequencing with
> Illumina. We are merely assembling the Illumina reads and hoping that
> they will close some gaps & join contigs in our 454 assembly.
>
> On Mon, Nov 29, 2010 at 2:00 PM, Robin Kramer <kodream@xxxxxxxxx> wrote:
>> Sven,
>>
>> The problem is that bowtie itself has only limited support for indels
>> since it isn't a true SW aligner, and Abyss in its scaffolding stage
>> doesn't support indels(even if bowtie generates them), whatsoever.
>>
>>
>> I am curious as too your experience with the trans package.  Did it do
>> a good job?  The last I checked it was something akin to an NxN blast
>> search and required quite a bit of external configuration to use, and
>> since it was only a perl script I was guessing that it was itself
>> quite slow.
>>
>>
>> On 11/16/10, Wachholtz, Michael <mwachholtz@xxxxxxxxxxx> wrote:
>>> I recently discovered that Abyss has trans-abyss package. While this
>>> program is geared towards expression analysis with reference genome,
>>> it has a merge.pl script. We have have been on the fence about what
>>> k-mer value to use. The results are hard to interpret, but this
>>> package will do assembly of all values k-mer from i/2 to i (where i is
>>> the read length) and merge all the contigs into a final assembly. Our
>>> computer is quad-core with 25GB RAM. It only takes Abyss less than
>>> 1hour to assemble ~100,000,000 reads. Very fast!! Since all of our
>>> illumina reads were filtered to contain mostly 30+ quality scores, we
>>> just run this assembly through MIRA's fasta2frag program. This will
>>> output quality score file for the fragments, putting in a value of 30
>>> for each bp (saves me the work of writing a script for this). Then
>>> just treat the fragments as sanger reads and do hybrid with our 454
>>> reads in MIRA. If anyone has done Illumina transcriptome assembly with
>>> the velvet/oases package instead of abyss, I would like to hear your
>>> thoughts about the advantages or technique you used. While abyss seems
>>> to do a fine job of catching SNPs and logging them as "popped
>>> bubbles", I'm not sure how it handles indels & transcript variants.
>>> Once we have a complete assembly, our goal is to do RNA-Seq analysis
>>> with the original Illumina data. While MIRA will catch a large
>>> majority of SNPs during assembly, some of the SNP/variation data will
>>> have been lost in the abyss assembly. However this "lost" information
>>> can easily be found when we map reads using bowtie, bam/sam tools.
>>>
>>> On Tue, Nov 16, 2010 at 2:55 PM, Sven Klages
>>> <sir.svencelot@xxxxxxxxxxxxxx> wrote:
>>>> oh, yes. I see, .. I just wanted to use it for my own data and was quite
>>>> astonished ;-)
>>>> fasta output, no qualities ... not of any use for me neither ..
>>>> cheers,
>>>> Sven
>>>>
>>>> 2010/11/16 Wachholtz, Michael <mwachholtz@xxxxxxxxxxx>
>>>>>
>>>>> I have, but the output is in fasta format with no quality scores. The
>>>>> only advantage this program has is that it will output how many
>>>>> identical reads there were. I prefer the fastq program in that it will
>>>>> retain the quality score of best sequence and will output in fastq
>>>>> format.
>>>>>
>>>>> On Mon, Nov 15, 2010 at 5:18 AM, Sven Klages
>>>>> <sir.svencelot@xxxxxxxxxxxxxx> wrote:
>>>>> > Hi Michael,
>>>>> >
>>>>> > 2010/11/15 Wachholtz, Michael <mwachholtz@xxxxxxxxxxx>
>>>>> >>
>>>>> >> [...]
>>>>> >>
>>>>> >> it is safe to use such strict criteria. After that, for each lane, we
>>>>> >> used the fastq program to collapse/remove any identical reads. This
>>>>> >
>>>>> > [...]
>>>>> >
>>>>> > just a short question. You have successfuly used the FASTX-Toolkit to
>>>>> > quality-clip your data;
>>>>> > this tool collection also contains a program to remove duplicates from
>>>>> > NGS
>>>>> > data:
>>>>> >
>>>>> > FASTQ/A Collapser
>>>>> > Collapsing identical sequences in a FASTQ/A file into a single sequence
>>>>> > (while maintaining reads counts)
>>>>> >
>>>>> > Have you tried this for your data?
>>>>> >
>>>>> > cheers,
>>>>> > Sven
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> You have received this mail because you are subscribed to the mira_talk
>>>>> mailing list. For information on how to subscribe or unsubscribe, please
>>>>> visit http://www.chevreux.org/mira_mailinglists.html
>>>>
>>>>
>>>
>>> --
>>> You have received this mail because you are subscribed to the mira_talk
>>> mailing list. For information on how to subscribe or unsubscribe, please
>>> visit http://www.chevreux.org/mira_mailinglists.html
>>>
>>
>> --
>> You have received this mail because you are subscribed to the mira_talk 
>> mailing list. For information on how to subscribe or unsubscribe, please 
>> visit http://www.chevreux.org/mira_mailinglists.html
>>
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: