[mira_talk] Re: Metagenomic assembly

  • From: Torben Nielsen <torben@xxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 11 May 2014 12:12:45 -1000

> On 09 May 2014, at 23:21 , Chayan Roy <chayan.roy93@xxxxxxxxx> wrote:
>> I am using six iontorrent data (avg read length 195bp) and four proton data 
>> (~178bp). for all data i have performed denovo assembly with default 
>> parameters. But after looking at the contig_stat_pass1.txt after the 
>> assembly, i rerun it with -AS:nop=1 (well i know this was a really weird 
>> things to do) but this was resulted in contigs of much longer size (in that 
>> case i might sacrifice the accuracy, is it so??)
> 
> You’re not sacrificing accuracy … you’re completely butchering it. Your 
> “long” contigs will have a lot of misassembles.

I have run about 25 large metagenomic assemblies this year. The smaller ones 
are newer MiSeq full runs with 25M read pairs while some of the larger ones are 
HiSeq with about twice the data. I run 6 passes which seems to be what it takes 
for the number of contigs to “stabilize”. I asked Bastien about that some time 
ago and as I recall, he commented that he’d used mostly up to 4 (it’s been a 
while, but that’s what I remember). I got a version that logs the number of 
contigs broken in a pass and I played with it for a while and settled on 6 as 
being a good compromise. I tried up to 8.

I need to stop looking at the first pass contigs. It makes me cry when I see 
almost 1M long contigs in the first pass and I *know* I’m lucky to have 250K 
left at the end of the 6th pass. My conclusion for metagenomics is to not look 
at the contig lengths till the end of the 3rd pass. Or just keep a bottle of 
wine handy and drown your sorrows. My passes take two days a piece so there’s 
plenty of time to sober up.

If you really really want longer contigs and are aware of the dangers, consider 
shredding your contigs and reassembling. I have done that and I have gotten 
significantly longer contigs out of it. In effect, I am equalizing coverage 
this way. That said, I gave up on it and decided to go for analytical 
approaches that work fine on long contigs and do not require fully assembled 
genomes. In much of what I am working on, the species question isn’t easily 
answerable anyway.

Another possible approach is to play with partial ordering of your contigs. 
That leads to graphs and you can look at paths through the graph to find 
potentially much longer ones. I’ve put in a fair amount of work doing that and 
I can get very long contigs, but I am not so sure what it means.

Torben


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: