[mira_talk] Re: TGICL and Mira

On Tuesday 25 November 2008 12:18, Laurent MANCHON wrote:
> I have made clustering with TGICL (TIGR software assembler) and MIRA on
> the same input of data (276 000 ESTs)
> and the results are very different, with TGICL i obtain 76000 contigs
> and 57533 singletons and for MIRA i obtain 110000 contigs and i don't
> know where singletons are stored.

Hi Laurent,

earlier version of MIRA stored singlets (named "_s" together with the contigs 
("_c"). Since one of the early 2.9.x versions, singlets are by default just 
named in the "debris" file of the info directory, but this can be changed 
with -OUT:sssip (beware, there was a bugfix in 2.9.29x5 that restored the 
intended bahviour).

> Why this difference ?

This is a bit difficult to diagnose from afar, but this is a recurring 
question ("why do assemblies differ?"). It was frequent enough for me to 
document an exemplary case for EST data, please have a look at:
      http://www.chevreux.org/mira_ex_est.html

Those cases are normally clustered together by other assemblers, even when 
using most stringent conditions of 98% identity as you did.

There may be, of course, quite a number of other possibilities, but I'd first 
investigate this kind of things. One thing you could try is clustering the 
contigs from MIRA, and then have a look what TGICL did for the resulting 
clusters.

> command i use with TGICL: tgicl oysterdb -l 60 -v 20 -p98
> command i use with MIRA: mira -project=oyster -fasta -estmode -AL:mrs=98

Setting the required identity to 99% has good and bad effects. For transcripts 
that reach a high enough coverage (which can be defined via -CO:mrpg), MIRA 
will automatically disassemble contigs which show "suspicious" 
data. "Suspicious" being all kind of mismatch patterns that point to either 
splice variants, ploidy differences or transcripts from closely related 
sequences which are just a tad different: 1 base per read is enough, and also 
niceties like splice variants (or ploidy differences or repeats) with one 
codon difference are split apart:

.............***.................
.............***.................
.............***.................
.............***.................
.................................
.................................
.................................
.................................

So, -AL:mrs=98 will have no effect on contigs MIRA splits anyway.

For really low coverage data, things are a bit different. There, the 98% 
identity will help to split apart a few of the cases which could not be 
caught due to low coverage. That's the good side.

The bad side: this also prevents assembly of reads which have a slightly 
higher error rate but would normally cause no harm (due to the errors being 
overthrown by alignment coverage and quality values of other reads).

So, using the -AL:mrs trick is a matter of taste.

> The both programs take 1 day of computation on server with 4 opterons
> processors and 64 GB memory, maybe some specific parameters can increase
> the speed of the computation.

You are using an older version of MIRA (either one of the early 2.9.x series 
or even MIRA 2.8.3), I saw this due to the "-estmode" parameter that has been 
retired for quite a while now.

You might want to test out the newest release from the 2.9 series, it should 
be faster and have better EST routines :-)


Hope it helps,
  Bastien

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: