[mira_talk] Re: Metagenome assembly

  • From: Keith Robison <keith.e.robison@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 1 May 2012 20:04:52 -0400

If you are seeing a lot of human contamination, one approach would be to
use Bowtie2 against a human reference assembly.  Then take everything that
didn't align & feed that into MIRA.  In a similar manner, if there were
some well-known bacterium dominating the data, you could also use this
approach to deplete those reads

Keith

On Tue, May 1, 2012 at 7:47 PM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>wrote:
>
>
> Another thing I thought I'd mention in case you're curious also has to do
> with the data I got from the Edena assembly.  When I started checking out
> some of the large contigs I thought they were all crap because a basic
> BLASTn was returning nothing and I would have thought at least some
> segments would match well enough.  But when I switched to doing BLASTx I
> got hits that were 100% and contiguous in the matching genome????  Maybe
> this has to do with the consensus called due to the mixed population but I
> somehow don't think so.  So far I haven't had time to look at this any
> further but it sure is weird!!!
>
> Shaun
>
>
>
>
>    Edena v3 development version 110920
>
> Loading file "out.ovl"...  done
>    reads length:             90
>    number of reads:          24447638
>    number of nodes:          24175258
>    number of edges:          14877278
>    minimum overlap size:     50
> Concatenating overlaps graph... done
>    Renumbering nodes... done
>    Updated number of nodes: 18649163
> Discarding non-usable reads... done
>    16781194 nodes corresponding to 20161590 reads have been discarded
> (82.5%)
> Removing dead-end path... done
>    889629 dead-ends (l<=179nt) have been removed
>    corresponding to 1119882 reads (4.6%)
> Concatenating overlaps graph... done
>    Renumbering nodes... done
>    Updated number of nodes: 800286
> Contextual cleaning: step1... done
> Contextual cleaning: step2... done
>    605279 edges have been cleaned out
> Concatenating overlaps graph... done
>    Renumbering nodes... done
>    Updated number of nodes: 701294
> Removing dead-end path...done
>    3837 dead-ends (l <= 179nt) have been removed
>    corresponding to 13623 reads (0.06%)
> Concatenating overlaps graph... done
>    Renumbering nodes... done
>    Updated number of nodes: 692763
> Nodes coverage sampling:
>    mean: 15.28
>    median: 9.82
>    sd: 48.00
>    minimum average coverage required for the contigs: 2.45
> Resolving bubbles... done
>    bubbles resolved: 28
> Concatenating overlaps graph... done
>    Renumbering nodes... done
>    Updated number of nodes: 692679
> Estimating pairing distance... done
>    paired-end allowed distance range(s) [min,max]  (observed distribution)
>    dataset 1: [88,303]  (mean=196.119 sd=53.898)
> Sorting nodes...done
> Building contigs... done
> Number of contigs:  21801
>    sum:  5436997
>    N50:  259
>    mean: 249.392
>    max:  47186
>    min:  100
> Contigs elongations were stopped due to:
>    branching: 6962
>    dead-end:  36640
>
>
> [image: Inactive hide details for Bastien Chevreux ---2012-04-26 03:54:27
> PM---On Apr 25, 2012, at 22:44 , Shaun Tyler wrote: > Does an]Bastien
> Chevreux ---2012-04-26 03:54:27 PM---On Apr 25, 2012, at 22:44 , Shaun
> Tyler wrote: > Does anyone have experience assembling metagenome d
>
> From: Bastien Chevreux <bach@xxxxxxxxxxxx>
> To: mira_talk@xxxxxxxxxxxxx
> Date: 2012-04-26 03:54 PM
> Subject: [mira_talk] Re: Metagenome assembly
> Sent by: mira_talk-bounce@xxxxxxxxxxxxx
> ------------------------------
>
>
>
> On Apr 25, 2012, at 22:44 , Shaun Tyler wrote:
>
>
>    Does anyone have experience assembling metagenome data with Mira.  I
>    have a feeling this might be one of those applications that will give Mira
>    a nervous breakdown.  The data is 100 bp paired end Illumina reads from
>    libraries derived from nasal swabs.  There is slightly in excess of 2 Gbp
>    of data per sample (25 M reads or so).
>
>
> 25m reads alone are not a big problem, I've done RNASeq assemblies with 40
> to 50m and I know at least two users who ventured into the 100m area (but
> I'd not recommend doing that). You just need a machine which is big enough.
>
> However, I fear that some aspects of metagenomes will indeed lead to
> problems. If you assemble the date in "genome" mode, I think MIRA will have
> a hard time in guessing the "coverage" of this "genome" ... and that will
> lead to misassemblies. If you assemble in EST mode, things will probably go
> faster, but there again I am almost sure misassemblies will happen.
>
> The thing is: in metagenomes, there is no such thing as an "average
> coverage" because this "average coverage" will be mainly driven by
> population ratios. I have no idea how to get around this.
>
> In any case: if you are making trials, set
>   -SK:bph=31
>
> This will probably greatly reduce misassemblies at the expense of genomes
> with low abundance being less well assembled.
>
> Would love to hear back from you on that.
>
> B.
>
>

GIF image

Other related posts: