[mira_talk] Re: Metagenomic Assembly

From: Veljo Kisand <vkisand@xxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Tue, 20 Dec 2011 08:38:55 +0200

On 12/19/2011 11:02 PM, Thomas, Dallas wrote:

Bastien
I am in the process of beginning a rather large metagenomic assemblyand am having difficulties deciding where to begin due to the size andcomplexity of this assembly project. I have roughly 10.27 million SE454 reads and 600 million PE Illumina reads (300 per end). This givesroughly 16X coverage for the community of 1000 species at 4MB per genome.
At the moment I am debating where I should start. Currently I havebeen toying with a few different approaches listed below:
1.Trim Illumina Sequences with DynamicTrim and LengthSort. Perform aseries of assemblies with varied kmer sizes, using Velvet andSOAPdenovo. Combine contigs from each of these assemblies andperform a final hybrid assembly with Mira.
2.Trim Illumina Sequences with DynamicTrim and LengthSort. Select arelatively large random set of the trimmed Illumina Reads and performa hybrid assembly with Mira. Split remaining Illumina reads intobatches and try to align to the contigs from the hybrid assembly.
3.Perform #2 without the trimming.
I have wondered about the use of quality trimming especially afterreading//www.freelists.org/post/mira_talk/IlluminaSolexa-sequence-coverage-VERY-LONGand//www.freelists.org/post/mira_talk/IlluminaSolexa-sequence-coverage-VERY-LONG,1andyet trimming might help to decrease the overall number of Illumina reads.
Any ideas on how best to proceed here would be very much appreciated.If anyone else out there has any ideas please feel free to jump in.

Interesting topic and my microbial ecologist view would be:

Is the assumption 1000 species basing on the let say rRNA genefrequencies in raw reads? When not I would be not so sure - could beeasily 10 000 species or more in any average microbial community.Meaning the coverage is low and remains low... :-( So the success in denovo assembly is random anyway :-) Bastien mentioned the size is beyondMIRA's performance, no better suggestions from me either. "Normal"amounts of 454 reads (500 000 to 1 million reads) perform quite wellgiving quite a few contigs in range 3000 to 10 000 bp what is stepforward for me allowing easier recruitment mapping. I am afraid playingaround with assembly parameters and quality trimming etc wouldn't solvethe danger to get artificial chimeric contigs originating from closelyrelated genomes. So my gut feeling is that it is a bit waste of time totry to get as long contigs as possible out of diverse metagenomes withhelp of de novo assembly. Even when the coverage is orders of magnitudehigher. As preliminary step before mapping back to know genomes soundreasonable to me.


regards,
Veljo

References:
- [mira_talk] Metagenomic Assembly
  - From: Thomas, Dallas

[mira_talk] Re: Metagenomic Assembly

Other related posts: