[mira_talk] Re: number of nasty repeats option inconsistancy since release 2.9.43x1
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Tue, 23 Jun 2009 20:13:40 +0200
On Dienstag 23 Juni 2009 Jorge.DUARTE@xxxxxxxxxxxx wrote:
> i just wanted to point out that in release 43, this option was -SK:rt,
> and at least since release 45, it is called -SK:nrr.
Hi Jorge,
it's not just a renaming that has happened, the meaning has also changed.
> Problem is, when trying to assemble some eukaryote sequences,
> i still get a reference to 'rt' naming in the log for versions 45 and 46:
My bad, forgot to change that text. Will be fixed.
> Basically, the proposed solution for organisms with complex repeats
> doesn't seem to work anymore since release 45 !
It should work as intended (at least it does for me), albeit you will need to
take lower values for -SK:nrr than you did for -SK:rt.
Here's the background: -SK:rt was pretty much like an Axe: if you put -
SK:rt=8, then the top 8% of all k-mers were simply removed from all further
comparison. This could, and has in the past, led to situations where way to
much was cut away. E.g., if in 1 million reads "only" 20000 contained highly
repetitive k-mers (2%), then the 8% cutback from -SK:rt was axing 6% more k-
mers and that hit "innocent bystanders".
The new algorithm with -SK:nrr (e.g. -SK:nrr=4) behaves differently. It guesses
the overall coverage of a project and then basically says: every k-mer that is
present more than (in this example) 4 times as often as the estimated coverage
will be removed from comparison. This should allow for a much more fine grained
masking of nasty repeats as well as having the user knowing (instead guessing)
what has been removed from the SKIM comparison step.
Please also have a look at the new help file on assembling 'hard' genomes which
is present in the downloads or on http://chevreux.org/mira_manuals.html
> I tried with the same data with release 43, and evrything goes fine when
> using -SK:mnr=yes:rt=8
I'd suggest you start with -SK:nrr=4 and see whether this works.
As I have only limited experience with eukaryotes and not that many data sets
available: could you please send me the log or an excerpt containing the skim
hash statistics? I'd adapt the manuals accordingly.
And I'm curious to know whether what I though up really is 'better' as I think
it is. Feedback appreciated.
Regards,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: