[mira_talk] Re: Recommended parameters for de novo amplicon assemblies

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 18 May 2011 19:29:21 +0200

On May 18, 2011, at 16:48 , Martin Mokrejs wrote:
> 1. After discussion with Bastien there is no guarantee that 
> adaptor-containing region
>   will NOT be involved in a HSP during the alignment, even if the adaptor 
> region is
>   in the 'to be trimmed region' as defined by trim-points.

Nope, never wrote that :-)

MIRA will *never* expand into clipped 454 sequence. So adaptors which are in 
the clipped part (as defined per input) will never contribute to an alignment.

Adaptors or adaptor-remains in the non-clipped part however do contribute (if 
they were not caught by other clipping mechanisms).

> That is bad and as I see
>   it now, one should drop adaptor-containing regions before submission into 
> assembly
>   program. It is a matter of trust, or you are asking for a trouble.

Having adaptors, sequencing vectors and similar beast in input data is always 
asking for trouble. Clean as thoroughly is possible, that's giving assemblers 
the best possibility to give you good assemblies.

> I already
>   suspected that newbler merges reads via low-qual region and now ... mira 
> does that
>   as well. :((

An overlap is an overlap is an overlap. If it's 200 bases long, it's more than 
valid, even if the bases are low qual. If you do not want low quality in, 
define clipping filters to get rid of them prior to the assembly.

There's certainly a conflict of interest though: clipping low quality but 
otherwise perfect sequence certainly does not help increasing N50, and that's 
what users are asking for.

> 2. Too many reads are recognized as repeats, and although I tried to raise
>   -ASSEMBLY:coverage_threshold value that is still not enough.

Those come from a different repeat detection mechanism based on k-mers. Have a 
look at the -SK:fenn:fexn:fer:fehr:fecr:nrr parameters.

> 3. When a consensus is constructed I would prefer mira to be more 
> conservative and
>   not extend it into regions where a few reads were longer. Such reads 
> actually
>   define the sequence of the consensus. But if you think that these reads are
>   PCR-specific chimeras, for example. 
> adaptor-rlmid-crap-amplicon-rlimd-adaptor
>   you really to have in the consensus only the amplicon region.

Filter out those reads prior to assembly, there's no other way. The assembler 
cannot test for every possible adaptor/mid/whatever chimera-contamination, 
that's certainly stuff for pre-assembly cleaning steps.

That is because there are so many of them as protocols change and people also 
cook up their own stuff.

> Interestingly,
>   the -CL:propose_end_clips=yes is not that "highly effective" and these few,
>   e.g. 8 reads out of 200 make it into the consensus.

-CL:pec is highly effective as long your input data does not contain unclipped 
adaptor sequences somewhere in crappy reads. Because then MIRA will simply 
think it's "normal" sequence.

-CL:pec clips away the ends of reads (which are notoriously plagued by 
sequencing errors and sometimes small unclipped adaptor fragments) by doing 
k-mer comparison with the complete project. Simply, but very effective for this 
task as it clips both sequencing errors and unclipped adaptor fragments. But as 
I wrote ... if there's some complete adaptor somewhere in your data, your SOL 
regarding clipping of adaptor fragments somewhere else.

>   This leads me in mind to the situation what would happen to sample RLMID 
> keys.
>   Imagine you assemble reads with adaptor-rlmid-amplicon-rlmid_rc-adaptor and 
> if the
>   -CL:propose_end_clips=yes would work one would expect to have in consensus 
> only
>   the shared, amplicon region.

Wrong strategy. I'd do: precleaning of reads so that only the amplicon remains.

> 4. The sequence editor should be disabled by default but to be sure I tried to
>   specifically enable it and later to disable it. The numbers of resulting 
> contigs
>   varied and I wasn't sure what to actually prefer. How would you go about 
> analysis
>   of 16S amplicons (aiming to unleash broad spectrum of microbes in a sample)?

Editor remains on by default :-) It's simply too effective for normal genome 
projects.

> 5. An obvious question could be "Why doing de novo assembly of amplicons at 
> all
>   instead of reference-guided assembly?". Well, because this shows the real 
> PCR
>   artifacts, something (I believe) one would not realize if the read (just 
> its part,
>   not whole read, but does ever realize that?) aligned to the reference. 
> Moreover,
>   although this seems to be a simple task it looks like it is not, and that
>   assemblers do make interesting mistakes.

Actually, MIRA would perhaps help you in finding short artefacts by putting SNP 
markers if you use different strains for backbone and reads. Butif the artefact 
is too large, the the read would not be aligned anymore, granted.

> 6. I tried phrap but assembly of 6k reads took over a day (until I killed the 
> task)
>   while mira did that in dozens of seconds. What is your preferred program to 
> align
>   globally the reads?
>   Imagine they should all overlap (200-400bp PCR products), more or less 
> end-to-end,
>   there is nothing particularly difficult I believe, no reason to waste CPU in
>   analysis of repeats motifs, chimeras, etc.
>   I should have tried clustalw2 or muscle I think at first. ;)

Or perhaps tried the "est" setting of MIRA :-) But there you might need to 
switch off some clipping (poly-A/T etc.). Instead, try the EST pathfinder: 
-AS:ugpf=no

If want to keep the "--job=genome", you could trick MIRA into setting repeat 
thresholds which are very high like this:

-SK:fexn=50:fer=60:fehr=70:fecr=80


> 7. Following the previous point, I have a hard time to figure out how to make 
> mira
>   merge many contigs together (few differ by supposedly sequencing errors 
> within
>   the core amplicon). Instead, I want to have the contigs split based on their
>   ends (so to move out reads with ligated crap to their either end).

Again: ligated crap should be filtered out in pre-assembly.

> The above findings are based on few experiments with mira-3.2.0 with:

Try 3.2.1.17 :-)

> --job=denovo,genome,accurate,454 454_SETTINGS -ED:automatic_contig_editing=no 
> -AL:min_overlap=60 -CLIPPING:lowercase_clip=yes 
> -ASSEMBLY:coverage_threshold=1000 COMMON_SETTINGS 
> -SKIM:number_of_threads=8,bases_per_hash=15 
> -CLIPPING:apply_skim_chimeradetectionclip=no,pec=no 
> -ASSEMBLY:skim_each_pass=on,uniform_read_distribution=no,spoiler_detection=no 
> -GE:not=8 -OUT:rld=no,org=yes,orw=yes

*grml*

Keep "-CL:pec=yes" if you do not have a very good reason to switch it off. I 
currently have only two reason why I'd want to switch it off:
1. low coverage in genome assemblies
2. necessity to catch very lowly expressed genes in EST / RNASeq

Similarly, the spoiler detection should be switched off only for the very same 
reasons as -CL:pec. You seem to say that most of your amplicons are very well 
covered, so ... why don't you run two assembly rounds? The first with settings 
aiming at high quality data, the second you take the debris from the first and 
try to get whatever is in there allowing low quality.

B. :-)


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: