[mira_talk] Re: big file in the log diretory

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 24 Mar 2011 01:23:42 +0100

On Wednesday 23 March 2011 22:29:09 Stephanie Pearl wrote:
> So, it also has come to my attention that my "strains" aren't closely
> related enough to be assembled in the manner in which I am trying to --
> they are actually closely related species, ~4000 years diverged. So I
> guess the messy command line is now a moot point.

There's worse. 4 M, 40 M and 400M years come to mind. In case you tell me 4 
billion years I'd start to really worry :-)

> The goal for my project is to assemble 3 different closely related species
> (1 of which has already been assembled by someone else -- this is the one
> with the Sanger reads) for further analysis. I had thought that the mixed
> assembly would use information from each set of ESTs and produce 3
> differently assembled outputs for each set of reads, but perhaps that's not
> the case?

D'oh. Hit me, I'm dumb and can't read. I overlooked the "est" thing in your 
command line and went on thinking it to be a genome assembly. Oh well.

In that case, using "-SB:lsd" is even more important, you should think of that 
in case you assemble all reads together.

If you put all reads together in one assembly, MIRA will not make three 
separate contigs out of that, but mix together whenever possible. When it has 
strain data, it will then also mark SNP in the contigs (else it will build 
different contigs even if there's one nucleotide difference).

Please have a look at 

  
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect1_est_difference_assembly_clustering
and
  
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect3_input:_two_strains_454_with_xml_ancillary_data_polya_already_removed

and other sections in that chapter to get a feeling how this might affect your 
data.

> Would you just recommend a de novo assembly for each of the three sets of 
> reads?

Depends on what you are looking for:
- a maximum of clean transcripts? Then each data set on it's own with "mira 
--job=est".
- a set of contigs, all strains mixed together, with SNPs marked? The all sets 
together with "mira --job=est ... -SB:lsd=yes" and straindata
- a set of contigs (one for each strain) where there are differences known 
between the strains as well as a light clustering like assembly of contigs 
with SNPs? Use the miraSearchESTSNPs pipeline.

B.

Other related posts: