On Mar 20, 2012, at 14:36 , Isabelle Lesur wrote: > I am trying to perform an assembly consisting of 84 086 Sanger reads with > qualities. Hello Isabelle, sounds like a large bacterium you have, between 6 and 8 MB, right? There are still projects doing this with Sanger? Wow. > -bash-3.2$ mira --project=OCV2_prime_full_length > --job=denovo,genome,normal,sanger --noclipping=all --notraceinfo > SANGER_SETTINGS So, you already pre-processed the data, physically removing known sequencing vector sequence etc. from the reads. Is that correct? > And the assembly stopped bacause of Megahubs. > Total megahubs: 8 DON'T PANIC (written in large, friendly letters :-) Ok, there are several routes you can follow, some of them more lazy than others. I'm all for lazy, so let's follow this route. The number of megahubs is tiny, really, meaning that you can tell MIRA to allow for a certain amount of them. Do that via "-SK:mmhr=10" (allows 10% of the reads to be megahubs, you have only 0.1%). Restart the assembly, don't change any other parameter (especially not -SK:nrr). While MIRA does its job, it's time for you to head over to this part of the manual: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard You basically want to read all of chapter 12, because this will allow you to track down - if present - remaining unclipped vector sequences. It will also give you an idea on how repetitive your genome is. Have especially a look at 12.2. How MIRA tags different repeat levels 12.3. The readrepeats info file 12.4. Pipeline to find worst contaminants or repeats in sequencing data In case you need help, feel free to post the hash statistics of your bug. But this should enable you to quickly find out whether there is a contamination remaining in your data. > I then used RepeatMasker to mask the repeats in my sequenced and clipped all > the Ns. No! No no no no no :-) Do not use repeat masker on your data. Just feed it to MIRA as it handles "normal" repeats well enough. It just will not cope with (too much) sequencing vector. Also have a look at the result of this first assembly. If it is very fragmented, chances are high you will need to find out the contamination. If not, then the megahubs might have been a false alarm. Maybe a high copy plasmid? Hope that helps, Bastien