[mira_talk] Re: megahubs in MIRA

  • From: Isabelle Lesur <isabelle.lesur.kupin@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 22 Mar 2012 14:12:13 +0100

Thanks a lot Bastien!!!

It looks like it is working just fine now!


Isabelle

2012/3/20 Bastien Chevreux <bach@xxxxxxxxxxxx>

> On Mar 20, 2012, at 14:36 , Isabelle Lesur wrote:
>
> I am trying to perform an assembly consisting of 84 086 Sanger reads with 
> qualities.
>
>
> Hello Isabelle,
>
> sounds like a large bacterium you have, between 6 and 8 MB, right? There
> are still projects doing this with Sanger? Wow.
>
> -bash-3.2$ mira --project=OCV2_prime_full_length 
> --job=denovo,genome,normal,sanger --noclipping=all --notraceinfo 
> SANGER_SETTINGS
>
>
> So, you already pre-processed the data, physically removing known
> sequencing vector sequence etc. from the reads. Is that correct?
>
> And the assembly stopped bacause of Megahubs.
> Total megahubs: 8
>
>
> DON'T PANIC (written in large, friendly letters :-) Ok, there are several
> routes you can follow, some of them more lazy than others. I'm all for
> lazy, so let's follow this route.
>
> The number of megahubs is tiny, really, meaning that you can tell MIRA to
> allow for a certain amount of them. Do that via "-SK:mmhr=10" (allows 10%
> of the reads to be megahubs, you have only 0.1%). Restart the assembly,
> don't change any other parameter (especially not -SK:nrr).
>
> While MIRA does its job, it's time for you to head over to this part of
> the manual:
>
>
> http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard
>
> You basically want to read all of chapter 12, because this will allow you
> to track down - if present - remaining unclipped vector sequences. It will
> also give you an idea on how repetitive your genome is. Have especially a
> look at
> 12.2. How MIRA tags different repeat levels
> <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_how_MIRA_tags_different_repeat_levels>12.3.
> The readrepeats info file
> <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_the_readrepeats_info_file>12.4.
> Pipeline to find worst contaminants or repeats in sequencing data
> <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data>
>
> In case you need help, feel free to post the hash statistics of your bug.
>
> But this should enable you to quickly find out whether there is a
> contamination remaining in your data.
>
> I then used RepeatMasker to mask the repeats in my sequenced and clipped all 
> the Ns.
>
>
> No! No no no no no :-) Do not use repeat masker on your data. Just feed it
> to MIRA as it handles "normal" repeats well enough. It just will not cope
> with (too much) sequencing vector.
>
> Also have a look at the result of this first assembly. If it is very
> fragmented, chances are high you will need to find out the contamination.
> If not, then the megahubs might have been a false alarm. Maybe a high copy
> plasmid?
>
> Hope that helps,
>   Bastien
>
>

Other related posts: