[mira_talk] Re: assembly options for non-redundant contigs

  • From: Sven Klages <sir.svencelot@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 4 Jun 2009 08:47:07 +0200

Just to comment on TGICL,

it has not been designed to work with 454 data and in our hands the
clustering itself segfaults AFAIR
on big (454) datasets mainly if there are (very) deep clusters. It is
distributed as a 32bit binaries, the
author is not sure if everthing compiles  fine on 64bit ...
There is no further development.

When run just on the clusters cap3 only consumes a lot of memory when you
deal with very deep clusters;
we had a big, bad cDNA lib where there were >150,000 reads forming one
cluster! That was killing one of
the TGICL programs, the sorter I think.

Cap3 is still some kind of standard assembler. If you want/need to cluster
your data before assembly
you could also try 'wcdest' (http://code.google.com/p/wcdest/) which is
performing quite well in our
hands. It uses pthreads or MPI for parallelisation ..

Sorry for pointing to other software on a MIRA list ..

cheers,
Sven

2009/6/3 Laurent Manchon <lmanchon@xxxxxxxxxxxxxx>

> -- Richard,
>
> which parameters do you use with cap3 ?
> In the past i have made test to compare results from Mira and Tgicl (TIGR
> software using cap3)
> and as you said the results performed by Tgicl were very different in
> quality and quantity (less contigs than Mira).
> But to assemble 450000 cDNA reads cap3 need a lot of memory, and i had
> always segmentation fault.
> So, today i use Mira which is able to treat big input. Many others
> assembler programs exist and it takes too much
> time to compare them each others to establish which is the best.
>
>
> Laurent --
>
>
> Richard Gregory a écrit :
>
>> Hi Laurent,
>>
>> Thanks for the suggestion. Have tested your options on the pre-Titanium
>> dataset, which I'm using as the benchmark because I have an assembly using
>> V2.6.15 to compare to. Extending the table from my  previous email:
>>
>> number     total    number of
>> of reads   bases     contigs
>> 169796    2865603      8540    V2.9.15
>> 149758    6167756     24376    V2.9.43
>> 132790    6409873     25099    V2.9.43_Laurent
>>
>> Looking at contigs >500 bp, V2.9.49 with your options produced 1269
>> contigs, slightly fewer than the V2.9.49 options I was using.
>>
>> The real test as far as I'm concerned is reassembly with something else,
>> such as cap3. Do the contigs assembly or are the kept separate. For this
>> V2.9.15 is easily the least redundant.
>>
>>  Mira     Mira     Cap3
>> Contigs  Contigs  Contigs
>>  In      Used      Out
>>  8540     2277      630   V2.9.15
>> 24376    17545     1167   V2.9.43
>> 25099    18261     1146   V2.9.43_Laurent
>>
>>
>> Richard
>>
>>
>> Laurent MANCHON wrote:
>>
>>> -- Hi Richard,
>>>
>>> this is the commandline i use with 454 cDNA Titanium sequences (with Mira
>>> 2.9.43):
>>>
>>> mira -project=myproject -job=denovo,est,normal,454 -SK:mnr=yes -SK:rt=4
>>> -GE:not=2 -CO:asir=yes -CO:mr 454_SETTINGS
>>> -AL:mrs=95:egp=yes:egpl=reject_codongaps:megpp=100 -LR:mxti=no -CO:rodirs=10
>>> -AL:mo=60 -CL:cpat=yes
>>>
>>> input: 434802 sequences
>>> output: 51559 contigs and 197939 singlets
>>>
>>> Laurent --
>>>
>>>
>>>
>>> Richard Gregory a écrit :
>>>
>>>> Hi All,
>>>>
>>>> We been using Mira for a while now, handles cDNA much better than
>>>> Newbler and gives us more confidence in the result.
>>>>
>>>> The latest batch of data is proving to be a problem, the current project
>>>> requires contigs that contain all similar reads and not be split into
>>>> multiple contigs of minor (or maybe large) differences. We are having
>>>> trouble finding the options to achieve this. Does anybody know if/how this
>>>> can be done?
>>>>
>>>> The sequence data is 1.5 plates of 454 Titanium cDNA and half a plate of
>>>> pre-Titanium cDNA. Have tried many options, genome or est,
>>>> --noclippings, -SK:mmhr=1, -DP:ure=no, -AS:ard=no, -AL:egp=no,
>>>> -AS:sd=no, -CO:mr=no, -AL:mrs=55, -SK:mnr=yes, -SK:hss=1:pr=70, but the
>>>> result is basically the same. The better option set was  -fasta
>>>> -job=denovo,est,draft,454 --noclippings -SK:mmhr=1 -DP:ure=no -AS:ard=no
>>>> -AS:sd=no -GE:not=4 -SK:hss=1:pr=70  , but this was marginally better and
>>>> doesn't achieve the desired result.
>>>>
>>>> The only clue comes previous assemblies with earlier versions of Mira,
>>>> which produced much less redundancy, ie, was ~8000 contigs, now V2.9.43
>>>> produces ~18000. Mapping this onto a reference showed ~1500 contigs could 
>>>> be
>>>> the same gene.  Assembling the ~1500 contigs with cap3 produced ~3 contigs,
>>>> one containing hundreds of contigs.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Richard
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
>

Other related posts: