I was thinking along the same lines. Even if it turned out not to be an issue with the assembly it would certainly speed things up by cutting down the number of reads. Shaun From: Keith Robison <keith.e.robison@xxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2012-05-01 07:05 PM Subject: [mira_talk] Re: Metagenome assembly Sent by: mira_talk-bounce@xxxxxxxxxxxxx If you are seeing a lot of human contamination, one approach would be to use Bowtie2 against a human reference assembly. Then take everything that didn't align & feed that into MIRA. In a similar manner, if there were some well-known bacterium dominating the data, you could also use this approach to deplete those reads Keith On Tue, May 1, 2012 at 7:47 PM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx> wrote: Another thing I thought I'd mention in case you're curious also has to do with the data I got from the Edena assembly. When I started checking out some of the large contigs I thought they were all crap because a basic BLASTn was returning nothing and I would have thought at least some segments would match well enough. But when I switched to doing BLASTx I got hits that were 100% and contiguous in the matching genome???? Maybe this has to do with the consensus called due to the mixed population but I somehow don't think so. So far I haven't had time to look at this any further but it sure is weird!!! Shaun Edena v3 development version 110920 Loading file "out.ovl"... done reads length: 90 number of reads: 24447638 number of nodes: 24175258 number of edges: 14877278 minimum overlap size: 50 Concatenating overlaps graph... done Renumbering nodes... done Updated number of nodes: 18649163 Discarding non-usable reads... done 16781194 nodes corresponding to 20161590 reads have been discarded (82.5%) Removing dead-end path... done 889629 dead-ends (l<=179nt) have been removed corresponding to 1119882 reads (4.6%) Concatenating overlaps graph... done Renumbering nodes... done Updated number of nodes: 800286 Contextual cleaning: step1... done Contextual cleaning: step2... done 605279 edges have been cleaned out Concatenating overlaps graph... done Renumbering nodes... done Updated number of nodes: 701294 Removing dead-end path...done 3837 dead-ends (l <= 179nt) have been removed corresponding to 13623 reads (0.06%) Concatenating overlaps graph... done Renumbering nodes... done Updated number of nodes: 692763 Nodes coverage sampling: mean: 15.28 median: 9.82 sd: 48.00 minimum average coverage required for the contigs: 2.45 Resolving bubbles... done bubbles resolved: 28 Concatenating overlaps graph... done Renumbering nodes... done Updated number of nodes: 692679 Estimating pairing distance... done paired-end allowed distance range(s) [min,max] (observed distribution) dataset 1: [88,303] (mean=196.119 sd=53.898) Sorting nodes...done Building contigs... done Number of contigs: 21801 sum: 5436997 N50: 259 mean: 249.392 max: 47186 min: 100 Contigs elongations were stopped due to: branching: 6962 dead-end: 36640 Inactive hide details for Bastien Chevreux ---2012-04-26 03:54:27 PM---On Apr 25, 2012, at 22:44 , Shaun Tyler wrote: > Does anBastien Chevreux ---2012-04-26 03:54:27 PM---On Apr 25, 2012, at 22:44 , Shaun Tyler wrote: > Does anyone have experience assembling metagenome d From: Bastien Chevreux <bach@xxxxxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Date: 2012-04-26 03:54 PM Subject: [mira_talk] Re: Metagenome assembly Sent by: mira_talk-bounce@xxxxxxxxxxxxx On Apr 25, 2012, at 22:44 , Shaun Tyler wrote: Does anyone have experience assembling metagenome data with Mira. I have a feeling this might be one of those applications that will give Mira a nervous breakdown. The data is 100 bp paired end Illumina reads from libraries derived from nasal swabs. There is slightly in excess of 2 Gbp of data per sample (25 M reads or so). 25m reads alone are not a big problem, I've done RNASeq assemblies with 40 to 50m and I know at least two users who ventured into the 100m area (but I'd not recommend doing that). You just need a machine which is big enough. However, I fear that some aspects of metagenomes will indeed lead to problems. If you assemble the date in "genome" mode, I think MIRA will have a hard time in guessing the "coverage" of this "genome" ... and that will lead to misassemblies. If you assemble in EST mode, things will probably go faster, but there again I am almost sure misassemblies will happen. The thing is: in metagenomes, there is no such thing as an "average coverage" because this "average coverage" will be mainly driven by population ratios. I have no idea how to get around this. In any case: if you are making trials, set -SK:bph=31 This will probably greatly reduce misassemblies at the expense of genomes with low abundance being less well assembled. Would love to hear back from you on that. B.