[mira_talk] Re: minimum number of reads to join contigs

  • From: "Davide Scaglione" <gianza@xxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Thu, 10 Jun 2010 19:00:11 +0200

Thanks a lot for your detailed and helpful reply!
Just to double-check..when you talk about chimera detection, are you refering to "spoiler detection" -AS:sd??

In this case I think I will do like this; cause I'm using three different strain, I will assemble each strain separately with spoiler detection switched on. Then I will assemble again the resulting contigs together, with spoiler detection off...assuming that the contigs are good and real ones...thus eligible to join everything they want.

Is it a good plan?

Regards

Davide

PS: I swear, as soon as I get a final assembly I will stop to bother you!

--------------------------------------------------
From: "Bastien Chevreux" <bach@xxxxxxxxxxxx>
Sent: Thursday, June 10, 2010 6:34 PM
To: <mira_talk@xxxxxxxxxxxxx>
Subject: [mira_talk] Re: minimum number of reads to join contigs

On Donnerstag 10 Juni 2010 Davide Scaglione wrote:
Is there any way to tell MIRA to join reads/build contigs only if a certain
 number of reads is producing the join?

No-can-do.

Making an example with my dataset, I'm assembling 1500000 454-EST;
 expecially for very large contigs,  there are chimera reads that wrongly
 join two different big piled-up chunks of reads,  coming from different
 genes. And this is bad..for annotation and for everything else. Let me
say, a big contig with a 25 x coverage, another contig with 25 x coverage,
 joined by only one read on the middle...of course a NCBI blastx reveal
 that it's a misassembly.

There may be a way out of the situation, but it's probably associated to loss
of data: chimera detection.

You have this:

r1 xxxxxxxxxxxxxxxx
r2 xxxxxxxxxxxxxxxxx
r3 xxxxxxxxxxxxxxxxx
r4 xxxxxxxxxxxxxxxxxxooooooooooooo
r5                     ooooooooooo
r6                     ooooooooooo
r7                       ooooooooo

with r4 being the chimera. The chimera detection in MIRA works by searching for sequence stretches which are not covered by overlaps. If you now use the
chimera detection of MIRA, it will almost certainly flag r4 as chimera and
only use a part of it (x or o, depending of which part is longer). There is
always a chance that r4 is a valid read though, but that's a risk to take.

Now, that would be totally fine, if one would not have to account for lowly
expressed genes. Imagine this situation:

s1 xxxxxxxxxxxxxxxxx
s2         xxxxxxxxxxxxxxxxxxxxxxxxx
s3                          xxxxxxxxxxxxxxx

Look at s2; from an overlap perspective, s2 could also very well be a chimera, leading to a break of an otherwise perfectly valid contig. This is why chimera
detection is switched off by default in MIRA.

Because setting only a fixed integer as parameter might be a problem for
 low-coverage contigs/regions; an idea could be to set a drop-on-coverage
 on which MIRA split contigs....e.g.: on regions were the coverage drop
 under a certain percentage of the average of the contig (or better, of a
 the previous windows of let say, 50 bp).

A similar idea has been for quite some time on my TODO, but I never came
around investigating it further, sorry.

At the moment, the only thing you can do is to write a parser for searching
these kind of things in a contig, extract the corresponding reads and re-
assemble them with chimera detection switched on.

Regards,
 Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: