[mira_talk] Re: megahubs in MIRA

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 20 Mar 2012 23:07:56 +0100

On Mar 20, 2012, at 14:36 , Isabelle Lesur wrote:
> I am trying to perform an assembly consisting of 84 086 Sanger reads with 
> qualities.

Hello Isabelle,

sounds like a large bacterium you have, between 6 and 8 MB, right? There are 
still projects doing this with Sanger? Wow.

> -bash-3.2$ mira --project=OCV2_prime_full_length 
> --job=denovo,genome,normal,sanger --noclipping=all --notraceinfo 
> SANGER_SETTINGS

So, you already pre-processed the data, physically removing known sequencing 
vector sequence etc. from the reads. Is that correct?

> And the assembly stopped bacause of Megahubs.
> Total megahubs: 8

DON'T PANIC (written in large, friendly letters :-) Ok, there are several 
routes you can follow, some of them more lazy than others. I'm all for lazy, so 
let's follow this route.

The number of megahubs is tiny, really, meaning that you can tell MIRA to allow 
for a certain amount of them. Do that via "-SK:mmhr=10" (allows 10% of the 
reads to be megahubs, you have only 0.1%). Restart the assembly, don't change 
any other parameter (especially not -SK:nrr).

While MIRA does its job, it's time for you to head over to this part of the 
manual:

  
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard

You basically want to read all of chapter 12, because this will allow you to 
track down - if present - remaining unclipped vector sequences. It will also 
give you an idea on how repetitive your genome is. Have especially a look at 
12.2. How MIRA tags different repeat levels
12.3. The readrepeats info file
12.4. Pipeline to find worst contaminants or repeats in sequencing data

In case you need help, feel free to post the hash statistics of your bug.

But this should enable you to quickly find out whether there is a contamination 
remaining in your data.

> I then used RepeatMasker to mask the repeats in my sequenced and clipped all 
> the Ns.

No! No no no no no :-) Do not use repeat masker on your data. Just feed it to 
MIRA as it handles "normal" repeats well enough. It just will not cope with 
(too much) sequencing vector.

Also have a look at the result of this first assembly. If it is very 
fragmented, chances are high you will need to find out the contamination. If 
not, then the megahubs might have been a false alarm. Maybe a high copy plasmid?

Hope that helps,
  Bastien

Other related posts: