[mira_talk] Re: TGICL and Mira
- From: Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Wed, 26 Nov 2008 09:40:09 +0100
Bastien Chevreux a écrit :
On Tuesday 25 November 2008 12:18, Laurent MANCHON wrote:
I have made clustering with TGICL (TIGR software assembler) and MIRA on
the same input of data (276 000 ESTs)
and the results are very different, with TGICL i obtain 76000 contigs
and 57533 singletons and for MIRA i obtain 110000 contigs and i don't
know where singletons are stored.
Hi Laurent,
earlier version of MIRA stored singlets (named "_s" together with the contigs
("_c"). Since one of the early 2.9.x versions, singlets are by default just
named in the "debris" file of the info directory, but this can be changed
with -OUT:sssip (beware, there was a bugfix in 2.9.29x5 that restored the
intended bahviour).
Why this difference ?
This is a bit difficult to diagnose from afar, but this is a recurring
question ("why do assemblies differ?"). It was frequent enough for me to
document an exemplary case for EST data, please have a look at:
http://www.chevreux.org/mira_ex_est.html
Those cases are normally clustered together by other assemblers, even when
using most stringent conditions of 98% identity as you did.
There may be, of course, quite a number of other possibilities, but I'd first
investigate this kind of things. One thing you could try is clustering the
contigs from MIRA, and then have a look what TGICL did for the resulting
clusters.
command i use with TGICL: tgicl oysterdb -l 60 -v 20 -p98
command i use with MIRA: mira -project=oyster -fasta -estmode -AL:mrs=98
Setting the required identity to 99% has good and bad effects. For transcripts
that reach a high enough coverage (which can be defined via -CO:mrpg), MIRA
will automatically disassemble contigs which show "suspicious"
data. "Suspicious" being all kind of mismatch patterns that point to either
splice variants, ploidy differences or transcripts from closely related
sequences which are just a tad different: 1 base per read is enough, and also
niceties like splice variants (or ploidy differences or repeats) with one
codon difference are split apart:
.............***.................
.............***.................
.............***.................
.............***.................
.................................
.................................
.................................
.................................
So, -AL:mrs=98 will have no effect on contigs MIRA splits anyway.
For really low coverage data, things are a bit different. There, the 98%
identity will help to split apart a few of the cases which could not be
caught due to low coverage. That's the good side.
The bad side: this also prevents assembly of reads which have a slightly
higher error rate but would normally cause no harm (due to the errors being
overthrown by alignment coverage and quality values of other reads).
So, using the -AL:mrs trick is a matter of taste.
The both programs take 1 day of computation on server with 4 opterons
processors and 64 GB memory, maybe some specific parameters can increase
the speed of the computation.
You are using an older version of MIRA (either one of the early 2.9.x series
or even MIRA 2.8.3), I saw this due to the "-estmode" parameter that has been
retired for quite a while now.
You might want to test out the newest release from the 2.9 series, it should
be faster and have better EST routines :-)
Hope it helps,
Bastien
Okay thank you for this help.
So just one thing, in TGICL i have used those parameters :-l 60 -v 20
(with l : miminum overlap length (default 30) and v : maximum length of
unmatched overhangs (default 30))
what are the corresponding parameters i have to use in MIRA ?
regards,
Laurent --
--
+---------------------------------------------+
Laurent Manchon
Email: lmanchon@xxxxxxxxxxxxxx
+---------------------------------------------+
Other related posts: