[mira_talk] Re: Metagenome assembly

  • From: Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 2 May 2012 08:24:33 -0500

I was thinking along the same lines.  Even if it turned out not to be an
issue with the assembly it would certainly speed things up by cutting down
the number of reads.

Shaun






From:   Keith Robison <keith.e.robison@xxxxxxxxx>
To:     mira_talk@xxxxxxxxxxxxx
Date:   2012-05-01 07:05 PM
Subject:        [mira_talk] Re: Metagenome assembly
Sent by:        mira_talk-bounce@xxxxxxxxxxxxx



If you are seeing a lot of human contamination, one approach would be to
use Bowtie2 against a human reference assembly.  Then take everything that
didn't align & feed that into MIRA.  In a similar manner, if there were
some well-known bacterium dominating the data, you could also use this
approach to deplete those reads

Keith

On Tue, May 1, 2012 at 7:47 PM, Shaun Tyler <Shaun.Tyler@xxxxxxxxxxxxxxx>
wrote:

  Another thing I thought I'd mention in case you're curious also has to do
  with the data I got from the Edena assembly.  When I started checking out
  some of the large contigs I thought they were all crap because a basic
  BLASTn was returning nothing and I would have thought at least some
  segments would match well enough.  But when I switched to doing BLASTx I
  got hits that were 100% and contiguous in the matching genome????  Maybe
  this has to do with the consensus called due to the mixed population but
  I somehow don't think so.  So far I haven't had time to look at this any
  further but it sure is weird!!!

  Shaun





        Edena v3 development version 110920
  Loading file "out.ovl"...  done
     reads length:             90
     number of reads:          24447638
     number of nodes:          24175258
     number of edges:          14877278
     minimum overlap size:     50
  Concatenating overlaps graph... done
     Renumbering nodes... done
     Updated number of nodes: 18649163
  Discarding non-usable reads... done
     16781194 nodes corresponding to 20161590 reads have been discarded
  (82.5%)
  Removing dead-end path... done
     889629 dead-ends (l<=179nt) have been removed
     corresponding to 1119882 reads (4.6%)
  Concatenating overlaps graph... done
     Renumbering nodes... done
     Updated number of nodes: 800286
  Contextual cleaning: step1... done
  Contextual cleaning: step2... done
     605279 edges have been cleaned out
  Concatenating overlaps graph... done
     Renumbering nodes... done
     Updated number of nodes: 701294
  Removing dead-end path...done
     3837 dead-ends (l <= 179nt) have been removed
     corresponding to 13623 reads (0.06%)
  Concatenating overlaps graph... done
     Renumbering nodes... done
     Updated number of nodes: 692763
  Nodes coverage sampling:
     mean: 15.28
     median: 9.82
     sd: 48.00
     minimum average coverage required for the contigs: 2.45
  Resolving bubbles... done
     bubbles resolved: 28
  Concatenating overlaps graph... done
     Renumbering nodes... done
     Updated number of nodes: 692679
  Estimating pairing distance... done
     paired-end allowed distance range(s) [min,max]  (observed
  distribution)
     dataset 1: [88,303]  (mean=196.119 sd=53.898)
  Sorting nodes...done
  Building contigs... done
  Number of contigs:  21801
     sum:  5436997
     N50:  259
     mean: 249.392
     max:  47186
     min:  100
  Contigs elongations were stopped due to:
     branching: 6962
     dead-end:  36640


  Inactive hide details for Bastien Chevreux ---2012-04-26 03:54:27 PM---On
  Apr 25, 2012, at 22:44 , Shaun Tyler wrote: > Does anBastien Chevreux
  ---2012-04-26 03:54:27 PM---On Apr 25, 2012, at 22:44 , Shaun Tyler
  wrote: > Does anyone have experience assembling metagenome d

  From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  To: mira_talk@xxxxxxxxxxxxx
  Date: 2012-04-26 03:54 PM
  Subject: [mira_talk] Re: Metagenome assembly
  Sent by: mira_talk-bounce@xxxxxxxxxxxxx



  On Apr 25, 2012, at 22:44 , Shaun Tyler wrote:

        Does anyone have experience assembling metagenome data with Mira.
        I have a feeling this might be one of those applications that will
        give Mira a nervous breakdown.  The data is 100 bp paired end
        Illumina reads from libraries derived from nasal swabs.  There is
        slightly in excess of 2 Gbp of data per sample (25 M reads or so).

  25m reads alone are not a big problem, I've done RNASeq assemblies with
  40 to 50m and I know at least two users who ventured into the 100m area
  (but I'd not recommend doing that). You just need a machine which is big
  enough.

  However, I fear that some aspects of metagenomes will indeed lead to
  problems. If you assemble the date in "genome" mode, I think MIRA will
  have a hard time in guessing the "coverage" of this "genome" ... and that
  will lead to misassemblies. If you assemble in EST mode, things will
  probably go faster, but there again I am almost sure misassemblies will
  happen.

  The thing is: in metagenomes, there is no such thing as an "average
  coverage" because this "average coverage" will be mainly driven by
  population ratios. I have no idea how to get around this.

  In any case: if you are making trials, set
    -SK:bph=31

  This will probably greatly reduce misassemblies at the expense of genomes
  with low abundance being less well assembled.

  Would love to hear back from you on that.

  B.




GIF image

Other related posts: