Hi, I am trying to verify some 454 amplicon data and would like to know how to best assemble the reads. I have several comments and I wish it provokes some answers/ comments back from you: ;) 1. After discussion with Bastien there is no guarantee that adaptor-containing region will NOT be involved in a HSP during the alignment, even if the adaptor region is in the 'to be trimmed region' as defined by trim-points. That is bad and as I see it now, one should drop adaptor-containing regions before submission into assembly program. It is a matter of trust, or you are asking for a trouble. I already suspected that newbler merges reads via low-qual region and now ... mira does that as well. :(( 2. Too many reads are recognized as repeats, and although I tried to raise -ASSEMBLY:coverage_threshold value that is still not enough. 3. When a consensus is constructed I would prefer mira to be more conservative and not extend it into regions where a few reads were longer. Such reads actually define the sequence of the consensus. But if you think that these reads are PCR-specific chimeras, for example. adaptor-rlmid-crap-amplicon-rlimd-adaptor you really to have in the consensus only the amplicon region. Interestingly, the -CL:propose_end_clips=yes is not that "highly effective" and these few, e.g. 8 reads out of 200 make it into the consensus. So what option could I use to to get the consensus shorter? Or do I have to learn how to use consed to shorten the consensus? ;-) This leads me in mind to the situation what would happen to sample RLMID keys. Imagine you assemble reads with adaptor-rlmid-amplicon-rlmid_rc-adaptor and if the -CL:propose_end_clips=yes would work one would expect to have in consensus only the shared, amplicon region. That would be fine for me, though so far I tried to assemble only pre-clustered reads split by their RLMID so this 'pec' feature had not much chances to show its capabilities (and especially as I mostly had it disabled (see below the command line). 4. The sequence editor should be disabled by default but to be sure I tried to specifically enable it and later to disable it. The numbers of resulting contigs varied and I wasn't sure what to actually prefer. How would you go about analysis of 16S amplicons (aiming to unleash broad spectrum of microbes in a sample)? 5. An obvious question could be "Why doing de novo assembly of amplicons at all instead of reference-guided assembly?". Well, because this shows the real PCR artifacts, something (I believe) one would not realize if the read (just its part, not whole read, but does ever realize that?) aligned to the reference. Moreover, although this seems to be a simple task it looks like it is not, and that assemblers do make interesting mistakes. 6. I tried phrap but assembly of 6k reads took over a day (until I killed the task) while mira did that in dozens of seconds. What is your preferred program to align globally the reads? Imagine they should all overlap (200-400bp PCR products), more or less end-to-end, there is nothing particularly difficult I believe, no reason to waste CPU in analysis of repeats motifs, chimeras, etc. I should have tried clustalw2 or muscle I think at first. ;) 7. Following the previous point, I have a hard time to figure out how to make mira merge many contigs together (few differ by supposedly sequencing errors within the core amplicon). Instead, I want to have the contigs split based on their ends (so to move out reads with ligated crap to their either end). It is more important for me to cluster by length of the sequence then by the actual sequence "inside" The sequences "inside" the reads are mostly 98-99% same (I think because the length of the amplicon is so long that the few differences do not make up a large percentage). 8. In general, mostly the de novo assembly is doable and gave me just a single contig, as expected. However, in the cases briefly sketched above I received more contigs. The problem is that they tend to be defined by sequence differences within the "core" amplicon region and somehow, the PCR-derived chimeras are scattered through all the alternative contigs. If I look into the "core amplicon" sequences of several of those contigs, they all point to same bacterium and therefore I think could be merged together. The above findings are based on few experiments with mira-3.2.0 with: --job=denovo,genome,accurate,454 454_SETTINGS -ED:automatic_contig_editing=no -AL:min_overlap=60 -CLIPPING:lowercase_clip=yes -ASSEMBLY:coverage_threshold=1000 COMMON_SETTINGS -SKIM:number_of_threads=8,bases_per_hash=15 -CLIPPING:apply_skim_chimeradetectionclip=no,pec=no -ASSEMBLY:skim_each_pass=on,uniform_read_distribution=no,spoiler_detection=no -GE:not=8 -OUT:rld=no,org=yes,orw=yes Thanks for your comments. The text turned out to be be a bit long but I hope you could follow me. ;) Martin -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html