[mira_talk] Re: Metagenomic Assembly

  • From: Veljo Kisand <vkisand@xxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 20 Dec 2011 08:38:55 +0200

On 12/19/2011 11:02 PM, Thomas, Dallas wrote:

Bastien

I am in the process of beginning a rather large metagenomic assembly and am having difficulties deciding where to begin due to the size and complexity of this assembly project. I have roughly 10.27 million SE 454 reads and 600 million PE Illumina reads (300 per end). This gives roughly 16X coverage for the community of 1000 species at 4MB per genome.

At the moment I am debating where I should start. Currently I have been toying with a few different approaches listed below:

1.Trim Illumina Sequences with DynamicTrim and LengthSort. Perform a series of assemblies with varied kmer sizes, using Velvet and SOAPdenovo. Combine contigs from each of these assemblies and perform a final hybrid assembly with Mira.

2.Trim Illumina Sequences with DynamicTrim and LengthSort. Select a relatively large random set of the trimmed Illumina Reads and perform a hybrid assembly with Mira. Split remaining Illumina reads into batches and try to align to the contigs from the hybrid assembly.

3.Perform #2 without the trimming.

I have wondered about the use of quality trimming especially after reading //www.freelists.org/post/mira_talk/IlluminaSolexa-sequence-coverage-VERY-LONG and //www.freelists.org/post/mira_talk/IlluminaSolexa-sequence-coverage-VERY-LONG,1and yet trimming might help to decrease the overall number of Illumina reads.

Any ideas on how best to proceed here would be very much appreciated. If anyone else out there has any ideas please feel free to jump in.


Interesting topic and my microbial ecologist view would be:
Is the assumption 1000 species basing on the let say rRNA gene frequencies in raw reads? When not I would be not so sure - could be easily 10 000 species or more in any average microbial community. Meaning the coverage is low and remains low... :-( So the success in de novo assembly is random anyway :-) Bastien mentioned the size is beyond MIRA's performance, no better suggestions from me either. "Normal" amounts of 454 reads (500 000 to 1 million reads) perform quite well giving quite a few contigs in range 3000 to 10 000 bp what is step forward for me allowing easier recruitment mapping. I am afraid playing around with assembly parameters and quality trimming etc wouldn't solve the danger to get artificial chimeric contigs originating from closely related genomes. So my gut feeling is that it is a bit waste of time to try to get as long contigs as possible out of diverse metagenomes with help of de novo assembly. Even when the coverage is orders of magnitude higher. As preliminary step before mapping back to know genomes sound reasonable to me.

regards,
Veljo

Other related posts: