Hi, a bit short on time to analyse this (this run is somewhat of a sidetrack at the moment). Anyway: I think the megahub is actually an artefact; the single read in the megahub logfile is a long (15kb) fake read comprising a complete PCR product. (The complte run is 300000 GS20, 16000 Sanger and ~20 PCR fake reads) I don't know if this is a problem; I simply allowed for the megahub (mmhr=1) and the run seems to be ok. (Using the PCR fake reads as reference contigs in a mapping assembly would be possible too, but that requires some manual joining in the end, and I want to avoid that as far as possible.) Thanks for a great software. B On Wed, 24 Jun 2009 18:43:04 +0200 Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Mittwoch 24 Juni 2009 Björn Nystedt wrote: > > running MIRA (V2.9.45x1) I get report on 1 megahub in my data (it might be > > a problem with my vector clipping and I am still investigating that). > > Anyway, I had trouble finding info in the manuals about the log-file: > > *posmatch_megahubs_preassembly.0.lst > > In my case, I have a singel read name in this file. Anyone knows how to > > interpret that? Björn Nystedt > > Hi Björn, > > basically, having one read as 'megahub' means that you seem to have a number > of reads which are quiterepetitive and one of them (by chance) gets over the > threshold of 'being a megahub'. > > Could you please have a look at the new manual in the *46 distribution which > is a first draft on how to assemble 'nasty' data. There's a section which > deals > on how to find out which parts are causing problems (it's now also available > online: Finding out repetitive parts in reads > http://chevreux.org/uploads/media/mirav2946_hard.html#section_6). > > Basically, I'd propose you have a look at the hash statistics of your project > (described in help file). Then, restart the assembly with -SK:mnr=yes and - > SK:nrrr=XXX, for choosing XXX I'd suggest a rather high number that you > determine from the hash statistics where things 'look funny'. During that run > the file will be created that does contain both the read names as well as the > masked parts of the reads, so you will be able to quickly find out what is > causing havoc in your data. Don't go too low with -SK:nrr as you might then > also find legitimate repetitive sequence (rRNAs come to mind in bacteria) and > not only the contaminants. Guessing a bit, I'd say that choosing nrr=20 is a > first good start. > > I would be interested to see the hash statistics of your project, could you > please send it to me to have a look at? Thanks. > > Regards, > Bastien > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- ==================================== Björn Nystedt (Sällström) PhD Student Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html