[mira_talk] Recommended parameters for de novo amplicon assemblies

From: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 18 May 2011 16:48:49 +0200
Hi,
  I am trying to verify some 454 amplicon data and would like to know how to 
best
assemble the reads. I have several comments and I wish it provokes some answers/
comments back from you: ;)

1. After discussion with Bastien there is no guarantee that adaptor-containing 
region
   will NOT be involved in a HSP during the alignment, even if the adaptor 
region is
   in the 'to be trimmed region' as defined by trim-points. That is bad and as 
I see
   it now, one should drop adaptor-containing regions before submission into 
assembly
   program. It is a matter of trust, or you are asking for a trouble. I already
   suspected that newbler merges reads via low-qual region and now ... mira 
does that
   as well. :((

2. Too many reads are recognized as repeats, and although I tried to raise
   -ASSEMBLY:coverage_threshold value that is still not enough.

3. When a consensus is constructed I would prefer mira to be more conservative 
and
   not extend it into regions where a few reads were longer. Such reads actually
   define the sequence of the consensus. But if you think that these reads are
   PCR-specific chimeras, for example. adaptor-rlmid-crap-amplicon-rlimd-adaptor
   you really to have in the consensus only the amplicon region. Interestingly,
   the -CL:propose_end_clips=yes is not that "highly effective" and these few,
   e.g. 8 reads out of 200 make it into the consensus. So what option could I 
use
   to to get the consensus shorter? Or do I have to learn how to use consed to
   shorten the consensus? ;-)

   This leads me in mind to the situation what would happen to sample RLMID 
keys.
   Imagine you assemble reads with adaptor-rlmid-amplicon-rlmid_rc-adaptor and 
if the
   -CL:propose_end_clips=yes would work one would expect to have in consensus 
only
   the shared, amplicon region. That would be fine for me, though so far I tried
   to assemble only pre-clustered reads split by their RLMID so this 'pec' 
feature
   had not much chances to show its capabilities (and especially as I mostly had
   it disabled (see below the command line).

4. The sequence editor should be disabled by default but to be sure I tried to
   specifically enable it and later to disable it. The numbers of resulting 
contigs
   varied and I wasn't sure what to actually prefer. How would you go about 
analysis
   of 16S amplicons (aiming to unleash broad spectrum of microbes in a sample)?

5. An obvious question could be "Why doing de novo assembly of amplicons at all
   instead of reference-guided assembly?". Well, because this shows the real PCR
   artifacts, something (I believe) one would not realize if the read (just its 
part,
   not whole read, but does ever realize that?) aligned to the reference. 
Moreover,
   although this seems to be a simple task it looks like it is not, and that
   assemblers do make interesting mistakes.

6. I tried phrap but assembly of 6k reads took over a day (until I killed the 
task)
   while mira did that in dozens of seconds. What is your preferred program to 
align
   globally the reads?
   Imagine they should all overlap (200-400bp PCR products), more or less 
end-to-end,
   there is nothing particularly difficult I believe, no reason to waste CPU in
   analysis of repeats motifs, chimeras, etc.
   I should have tried clustalw2 or muscle I think at first. ;)

7. Following the previous point, I have a hard time to figure out how to make 
mira
   merge many contigs together (few differ by supposedly sequencing errors 
within
   the core amplicon). Instead, I want to have the contigs split based on their
   ends (so to move out reads with ligated crap to their either end). It is more
   important for me to cluster by length of the sequence then by the actual 
sequence
   "inside" The sequences "inside" the reads are mostly 98-99% same (I think 
because
   the length of the amplicon is so long that the few differences do not make up
   a large percentage).

8. In general, mostly the de novo assembly is doable and gave me just a single 
contig,
   as expected. However, in the cases briefly sketched above I received more 
contigs.
   The problem is that they tend to be defined by sequence differences within 
the "core"
   amplicon region and somehow, the PCR-derived chimeras are scattered through 
all the
   alternative contigs. If I look into the "core amplicon" sequences of several 
of
   those contigs, they all point to same bacterium and therefore I think could 
be
   merged together.

The above findings are based on few experiments with mira-3.2.0 with:

 --job=denovo,genome,accurate,454 454_SETTINGS -ED:automatic_contig_editing=no 
-AL:min_overlap=60 -CLIPPING:lowercase_clip=yes 
-ASSEMBLY:coverage_threshold=1000 COMMON_SETTINGS 
-SKIM:number_of_threads=8,bases_per_hash=15 
-CLIPPING:apply_skim_chimeradetectionclip=no,pec=no 
-ASSEMBLY:skim_each_pass=on,uniform_read_distribution=no,spoiler_detection=no 
-GE:not=8 -OUT:rld=no,org=yes,orw=yes


Thanks for your comments. The text turned out to be be a bit long but I hope you
could follow me. ;)
Martin

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
Follow-Ups:
- [mira_talk] Re: Recommended parameters for de novo amplicon assemblies
  - From: Bastien Chevreux
[mira_talk] Recommended parameters for de novo amplicon assemblies

Other related posts: