[mira_talk] Re: misassembly problems
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Fri, 27 Mar 2009 20:51:22 +0100
On Friday 27 March 2009 Giuseppe D'Auria wrote:
> I assembled a really complicate microbial genomes full of IS (full I
> mean really full). I found several, I think, misassembled reads. The
> project is half-plate GS-FLX20 Paired-Ends assembly. No much complicated
> for mira (less than 4h in accurate mode), these are the parameters:
Hi Guiseppe,
I learned the hard way that some bacteria really are almost as awful as
eukaryotic genomes. IS can be one cause, multiple phages/prophage in high copy
number within the genome another
> [...]
> I decided to increase the 'nop' to 12 and 'rbl' to 6 whit the hope this
> can improve my previous attempt I applied just using standard parameters
> (accurate mode).
Presently, going beyond 7 or 8 passes probably does not help too much,
bacteria I've seen tend to stabilise quite quickly. To be honest, the 7 passes
of "accurate" 454 assemblies are also more a feature that was pretty useful
for GS20 sequences, FLX generally would need less (but I still keep it as many
people still have some GS20 data).
> Go to the problem.
> I found several contigs whit reads probably erroneously assembled (look
> at the light-blue A at position 8430).
That's a weakness of the current assembly engine: if it does not recognise a
repeat correctly, it is to lenient in handling the sequences and the result is
what you see.
Actually, this is my current development area since a few weeks and I think
that sometime in April, I'll be ready to launch a version with a new assembly
engine which should be ... pretty good, according to first results.
> I said misassembled because the
> respective forward or reverse partner is in another contig and if I
> disassemble and try to join again (manually) it make sense. The problem
> is that this events causes wrong contigs whit big problem when I go to
> Gap4 (people call it finishing .... ironic ???).
> Can I fix parameters in order to avoid this kinds of errors in the
> contigs, if yes which one?.
At the moment, your best option for the assembly is, as Jan suggested, the -
highlyrepetitive switch. It can help a lot there. One of the major helper flags
that get set by -highlyrepetitive is the option to mask nasty repeats during
skim (-SK:mnr). You might perhaps want to adapt -SK:rt as the current default
value of "8" could be a bit too harch (try 4 first, if there are still too many
misassemblies, increase in steps of 2).
There's another thing you could do: if you find such cases of obvious
misassembly in gap4, mark the bases of *all* reads in the column that shows a
misassembly with the tag "SRMr". Then, once done, convert the gap database
back to CAF and use this as input for a de-novo assembly (switching of all
clippings this time). MIRA will use the newly marked bases as repeat markers
and won't do the same mistake.
Regards,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: