[mira_talk] Re: misassembly problems

On Friday 27 March 2009 Giuseppe D'Auria wrote:
> I assembled a really complicate microbial genomes full of IS (full I
> mean really full). I found several, I think, misassembled reads. The
> project is half-plate GS-FLX20 Paired-Ends assembly. No much complicated
> for mira (less than 4h in accurate mode), these are the parameters:

Hi Guiseppe,

I learned the hard way that some bacteria really are almost as awful as 
eukaryotic genomes. IS can be one cause, multiple phages/prophage in high copy 
number within the genome another

> [...]
> I decided to increase the 'nop' to 12 and 'rbl' to 6 whit the hope this
> can improve my previous attempt I applied just using standard parameters
> (accurate mode).

Presently, going beyond 7 or 8 passes probably does not help too much, 
bacteria I've seen tend to stabilise quite quickly. To be honest, the 7 passes 
of "accurate" 454 assemblies are also more a feature that was pretty useful 
for GS20 sequences, FLX generally would need less (but I still keep it as many 
people still have some GS20 data).


> Go to the problem.
> I found several contigs whit reads probably erroneously assembled (look
> at the light-blue A at position 8430).

That's a weakness of the current assembly engine: if it does not recognise a 
repeat correctly, it is to lenient in handling the sequences and the result is 
what you see.

Actually, this is my current development area since a few weeks and I think 
that sometime in April, I'll be ready to launch a version with a new assembly 
engine which should be ... pretty good, according to first results.

> I said misassembled because the
> respective forward or reverse partner is in another contig and if I
> disassemble and try to join again (manually) it make sense. The problem
> is that this events causes wrong contigs whit big problem when I go to
> Gap4 (people call it finishing .... ironic ???).
> Can I fix parameters in order to avoid this kinds of errors in the
> contigs, if yes which one?.

At the moment, your best option for the assembly is, as Jan suggested, the -
highlyrepetitive switch. It can help a lot there. One of the major helper flags 
that get set by -highlyrepetitive is the option to mask nasty repeats during 
skim (-SK:mnr). You might perhaps want to adapt -SK:rt as the current default 
value of "8" could be a bit too harch (try 4 first, if there are still too many 
misassemblies, increase in steps of 2).

There's another thing you could do: if you find such cases of obvious 
misassembly in gap4, mark the bases of *all* reads in the column that shows a 
misassembly with the tag "SRMr". Then, once done, convert the gap database 
back to CAF and use this as input for a de-novo assembly (switching of all 
clippings this time). MIRA will use the newly marked bases as repeat markers 
and won't do the same mistake.

Regards,
  Bastien



-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: