[mira_talk] Re: MIRA / large contigs

  • From: Martin MOKREJŠ <mmokrejs@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 14 May 2014 22:07:19 +0200

Hi,

Bastien Chevreux wrote:
> On 13 May 2014, at 14:01 , Sabrina rodriguez 
> <sabrina.rodriguez@xxxxxxxxxxxxxxxx> wrote:
>> In some cases, eventhough I obtain large contigs (> 500bp) as observed from 
>> the <project>_info_assembly.txt file in the <project>_info directory; in the 
>> <project>_result directory, no "LargeContigs" files were generated.
> 
> Hello Sabrina,
> 
> I base my answer on the assumption you are using a MIRA 4.x version. If not, 
> please upgrade.
> 
> There may be a couple of reasons for what you are seeing. Let’s go through:
> 
> 1. A bug in MIRA. Possible, but I’d be a little bit surprised.
> 
> 2. Is the “large contigs” info file in the info directory populated, that is, 
> does it contain contig names. If yes, then “something” went wrong “somewhere” 
> when MIRA, after the main assembly, called itself to extract the contigs. 
> However, I think that this scenario is unlikely as you observe this only “for 
> some cases / datasets”.
> 
> 3. If the “large contigs” info file in the info directory is not populated, 
> your data set is … weird, and fools the heuristics which determine what to 
> consider as “large contig.” This heuristic works like this: during assembly, 
> MIRA looks at all contigs >= 5kb to determine an average coverage of those 
> contigs >= 5kb. Then, at the end of the assembly, it defines as “large” 
> contigs all contigs >= 500bp which have a coverage being at least 50% of the 
> previously calculated average coverage (33% on projects with a coverage <40x).

Um, for EST projects I would propose going for 2000 only. 5kb is too much. Did 
you say 5kb is for genome assemblies only? ;-)

Can one decrease the 50% threshold (for EST projects ...)?

> 
> BTW: you can change the 500 bp and 5000 bp limits via -MI:lcs and -MI:lcs4s 
> parameters, maybe you want to test lcs=500 and lcs4s=2000. But please read on.
> 
>> In one example, I have obtained contig lengths going from  120 bp to 4543.
>> In a second example, I have obtained contigs length between 107 and 9432 bp.
> 
> So, in the first example I can totally understand why MIRA did not extract 
> any contig as “large” contig: there was none >= 5 kbp to calculate statistics 
> on, hence no average coverage estimation could be given. In the second 
> example however I wonder a little bit what kind of other effect prevented at 
> least the 9kbp contig to be regarded as large.
> 
> However, in case you were not assembling some viral data, your assembly stats 
> point to some deeper problem: projects with a max contig size of 9 kbp (let 
> alone 4 kbp) are a total catastrophe. Something feels very wrong there.

Most likely bad adapter/primer/artifact removal. ;)

Martin

-- 
Martin Mokrejs, PhD.
454 / IonTorrent / Evrogen MINT / Clontech SMART adapter/artifact removal (... 
too many protocols to name here)
http://www.bioinformatics.cz/software/supported-protocols/

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: