[mira_talk] Re: TGICL and Mira

Bastien Chevreux a écrit :
On Tuesday 25 November 2008 12:18, Laurent MANCHON wrote:
I have made clustering with TGICL (TIGR software assembler) and MIRA on
the same input of data (276 000 ESTs)
and the results are very different, with TGICL i obtain 76000 contigs
and 57533 singletons and for MIRA i obtain 110000 contigs and i don't
know where singletons are stored.

Hi Laurent,

earlier version of MIRA stored singlets (named "_s" together with the contigs ("_c"). Since one of the early 2.9.x versions, singlets are by default just named in the "debris" file of the info directory, but this can be changed with -OUT:sssip (beware, there was a bugfix in 2.9.29x5 that restored the intended bahviour).

Why this difference ?

This is a bit difficult to diagnose from afar, but this is a recurring question ("why do assemblies differ?"). It was frequent enough for me to document an exemplary case for EST data, please have a look at:
      http://www.chevreux.org/mira_ex_est.html

Those cases are normally clustered together by other assemblers, even when using most stringent conditions of 98% identity as you did.

There may be, of course, quite a number of other possibilities, but I'd first investigate this kind of things. One thing you could try is clustering the contigs from MIRA, and then have a look what TGICL did for the resulting clusters.

command i use with TGICL: tgicl oysterdb -l 60 -v 20 -p98
command i use with MIRA: mira -project=oyster -fasta -estmode -AL:mrs=98

Setting the required identity to 99% has good and bad effects. For transcripts that reach a high enough coverage (which can be defined via -CO:mrpg), MIRA will automatically disassemble contigs which show "suspicious" data. "Suspicious" being all kind of mismatch patterns that point to either splice variants, ploidy differences or transcripts from closely related sequences which are just a tad different: 1 base per read is enough, and also niceties like splice variants (or ploidy differences or repeats) with one codon difference are split apart:

.............***.................
.............***.................
.............***.................
.............***.................
.................................
.................................
.................................
.................................

So, -AL:mrs=98 will have no effect on contigs MIRA splits anyway.

For really low coverage data, things are a bit different. There, the 98% identity will help to split apart a few of the cases which could not be caught due to low coverage. That's the good side.

The bad side: this also prevents assembly of reads which have a slightly higher error rate but would normally cause no harm (due to the errors being overthrown by alignment coverage and quality values of other reads).

So, using the -AL:mrs trick is a matter of taste.

The both programs take 1 day of computation on server with 4 opterons
processors and 64 GB memory, maybe some specific parameters can increase
the speed of the computation.

You are using an older version of MIRA (either one of the early 2.9.x series or even MIRA 2.8.3), I saw this due to the "-estmode" parameter that has been retired for quite a while now.

You might want to test out the newest release from the 2.9 series, it should be faster and have better EST routines :-)


Hope it helps,
  Bastien

Okay thank you for this help.
So just one thing, in TGICL i have used those parameters :-l 60 -v 20
(with l : miminum overlap length (default 30) and v : maximum length of 
unmatched overhangs (default 30))
what are the corresponding parameters i have to use in MIRA ?

regards,
Laurent --


--
+---------------------------------------------+
Laurent Manchon Email: lmanchon@xxxxxxxxxxxxxx +---------------------------------------------+

Other related posts: