Thanks a lot Bastien!!! It looks like it is working just fine now! Isabelle 2012/3/20 Bastien Chevreux <bach@xxxxxxxxxxxx> > On Mar 20, 2012, at 14:36 , Isabelle Lesur wrote: > > I am trying to perform an assembly consisting of 84 086 Sanger reads with > qualities. > > > Hello Isabelle, > > sounds like a large bacterium you have, between 6 and 8 MB, right? There > are still projects doing this with Sanger? Wow. > > -bash-3.2$ mira --project=OCV2_prime_full_length > --job=denovo,genome,normal,sanger --noclipping=all --notraceinfo > SANGER_SETTINGS > > > So, you already pre-processed the data, physically removing known > sequencing vector sequence etc. from the reads. Is that correct? > > And the assembly stopped bacause of Megahubs. > Total megahubs: 8 > > > DON'T PANIC (written in large, friendly letters :-) Ok, there are several > routes you can follow, some of them more lazy than others. I'm all for > lazy, so let's follow this route. > > The number of megahubs is tiny, really, meaning that you can tell MIRA to > allow for a certain amount of them. Do that via "-SK:mmhr=10" (allows 10% > of the reads to be megahubs, you have only 0.1%). Restart the assembly, > don't change any other parameter (especially not -SK:nrr). > > While MIRA does its job, it's time for you to head over to this part of > the manual: > > > http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard > > You basically want to read all of chapter 12, because this will allow you > to track down - if present - remaining unclipped vector sequences. It will > also give you an idea on how repetitive your genome is. Have especially a > look at > 12.2. How MIRA tags different repeat levels > <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_how_MIRA_tags_different_repeat_levels>12.3. > The readrepeats info file > <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_the_readrepeats_info_file>12.4. > Pipeline to find worst contaminants or repeats in sequencing data > <http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data> > > In case you need help, feel free to post the hash statistics of your bug. > > But this should enable you to quickly find out whether there is a > contamination remaining in your data. > > I then used RepeatMasker to mask the repeats in my sequenced and clipped all > the Ns. > > > No! No no no no no :-) Do not use repeat masker on your data. Just feed it > to MIRA as it handles "normal" repeats well enough. It just will not cope > with (too much) sequencing vector. > > Also have a look at the result of this first assembly. If it is very > fragmented, chances are high you will need to find out the contamination. > If not, then the megahubs might have been a false alarm. Maybe a high copy > plasmid? > > Hope that helps, > Bastien > >