[mira_talk] Re: minimum number of reads to join contigs

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 10 Jun 2010 18:34:23 +0200

On Donnerstag 10 Juni 2010 Davide Scaglione wrote:
> Is there any way to tell MIRA to join reads/build contigs only if a certain
>  number of reads is producing the join?

No-can-do.

> Making an example with my dataset, I'm assembling 1500000 454-EST;
>  expecially for very large contigs,  there are chimera reads that wrongly
>  join two different big piled-up chunks of reads,  coming from different
>  genes. And this is bad..for annotation and for everything else. Let me
>  say, a big contig with a 25 x coverage, another contig with 25 x coverage,
>  joined by only one read on the middle...of course a NCBI blastx reveal
>  that it's a misassembly.

There may be a way out of the situation, but it's probably associated to loss 
of data: chimera detection.

You have this:

r1 xxxxxxxxxxxxxxxx
r2 xxxxxxxxxxxxxxxxx
r3 xxxxxxxxxxxxxxxxx
r4 xxxxxxxxxxxxxxxxxxooooooooooooo
r5                     ooooooooooo
r6                     ooooooooooo
r7                       ooooooooo

with r4 being the chimera. The chimera detection in MIRA works by searching 
for sequence stretches which are not covered by overlaps. If you now use the 
chimera detection of MIRA, it will almost certainly flag r4 as chimera and 
only use a part of it (x or o, depending of which part is longer). There is 
always a chance that r4 is a valid read though, but that's a risk to take.

Now, that would be totally fine, if one would not have to account for lowly 
expressed genes. Imagine this situation:

s1 xxxxxxxxxxxxxxxxx
s2         xxxxxxxxxxxxxxxxxxxxxxxxx
s3                          xxxxxxxxxxxxxxx

Look at s2; from an overlap perspective, s2 could also very well be a chimera, 
leading to a break of an otherwise perfectly valid contig. This is why chimera 
detection is switched off by default in MIRA.

> Because setting only a fixed integer as parameter might be a problem for
>  low-coverage contigs/regions; an idea could be to set a drop-on-coverage
>  on which MIRA split contigs....e.g.: on regions were the coverage drop
>  under a certain percentage of the average of the contig (or better, of a
>  the previous windows of let say, 50 bp).

A similar idea has been for quite some time on my TODO, but I never came 
around investigating it further, sorry.

At the moment, the only thing you can do is to write a parser for searching 
these kind of things in a contig, extract the corresponding reads and re-
assemble them with chimera detection switched on.

Regards,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: