[mira_talk] Re: eukaryotic cDNA: assembler parameterization pipeline

  • From: Laurent MANCHON <lmanchon@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 04 Jun 2009 09:58:54 +0200

-- Dear Alexie,

it's work a try.
send me at lmanchon at univ-montp2.fr

thank you,

Laurent --


Alexie Papanicolaou a écrit :
gosh, mira_talk has proven a difficult list to post, even when subscribed (maybe a @googlemail vs @gmail thing)... apologies if you get this more than once.

Nevermind:


Dear Richard, Laurent (and anyone else interested),

I've written a short pipeline for parameterization of MIRA for cDNA data.

Essentially, it launches MIRA and Newbler with an array of parameters (from a simple config file) and saves the user from defining many times. Once they finish the script conducts a BLASTX approach to identity contig quality. It's all pretty automated so i when i come to work in the morning my run has finished :-)

There is also a trim_assembly.pl script to reduce redundancy. Most of the redundancy in eukaryotic cDNA is due to the 3' UTR which cannot be assembled due to heterozygosity (i.e. you can assembled if you sequenced clones or a single individual). I usually can go down to 15-30K meaningful contigs.

Please do note, that data preprocess is essential for a succesful assembly.

... there is also other pipelines to process the data directly from the sequencer and another to take the MIRA output to make an annotated dataset using BLAST, KEGG, GO, EC and InterProScan. I'm currently using all these programs to create chado databases for all Insects in dbEST and in my sequencer drawer...

I don't know if people would like to try the scripts, drop me a line. It's pretty user friendly...

cheers
alexie




On Wed, 2009-06-03 at 09:31 +0200, Laurent MANCHON wrote:
-- Hi Richard,

this is the commandline i use with 454 cDNA Titanium sequences (with 
Mira 2.9.43):

mira -project=myproject -job=denovo,est,normal,454 -SK:mnr=yes -SK:rt=4 
-GE:not=2 -CO:asir=yes -CO:mr 454_SETTINGS 
-AL:mrs=95:egp=yes:egpl=reject_codongaps:megpp=100 -LR:mxti=no 
-CO:rodirs=10 -AL:mo=60 -CL:cpat=yes

input: 434802 sequences
output: 51559 contigs and 197939 singlets

Laurent --



Richard Gregory a écrit :
> Hi All,
>
> We been using Mira for a while now, handles cDNA much better than 
> Newbler and gives us more confidence in the result.
>
> The latest batch of data is proving to be a problem, the current 
> project requires contigs that contain all similar reads and not be 
> split into multiple contigs of minor (or maybe large) differences. We 
> are having trouble finding the options to achieve this. Does anybody 
> know if/how this can be done?
>
> The sequence data is 1.5 plates of 454 Titanium cDNA and half a plate 
> of pre-Titanium cDNA. Have tried many options, genome or est,
> --noclippings, -SK:mmhr=1, -DP:ure=no, -AS:ard=no, -AL:egp=no, 
> -AS:sd=no, -CO:mr=no, -AL:mrs=55, -SK:mnr=yes, -SK:hss=1:pr=70, but 
> the result is basically the same. The better option set was  -fasta 
> -job=denovo,est,draft,454 --noclippings -SK:mmhr=1 -DP:ure=no 
> -AS:ard=no -AS:sd=no -GE:not=4 -SK:hss=1:pr=70  , but this was 
> marginally better and doesn't achieve the desired result.
>
> The only clue comes previous assemblies with earlier versions of Mira, 
> which produced much less redundancy, ie, was ~8000 contigs, now 
> V2.9.43 produces ~18000. Mapping this onto a reference showed ~1500 
> contigs could be the same gene.  Assembling the ~1500 contigs with 
> cap3 produced ~3 contigs, one containing hundreds of contigs.
>
>
> Thanks,
>
> Richard
>


    

-- 
Alexie Papanicolaou
Richard ffrench-Constant group
CEC-Biology
Univ. Exeter in Cornwall
Penryn
TR10 9EZ
United Kingdom

        


Other related posts: