[mira_talk] Re: number of nasty repeats option inconsistancy since release 2.9.43x1

On Dienstag 23 Juni 2009 Jorge.DUARTE@xxxxxxxxxxxx wrote:
> i just wanted to point out that in release 43, this option was -SK:rt,
> and at least since release 45, it is called -SK:nrr.

Hi Jorge,

it's not just a renaming that has happened, the meaning has also changed.

> Problem is, when trying to assemble some eukaryote sequences,
> i still get a reference to 'rt' naming in the log for versions 45 and 46:

My bad, forgot to change that text. Will be fixed.

> Basically, the proposed solution for organisms with complex repeats
> doesn't seem to work anymore since release 45 !

It should work as intended (at least it does for me), albeit you will need to 
take lower values for -SK:nrr than you did for -SK:rt.

Here's the background: -SK:rt was pretty much like an Axe: if you put -
SK:rt=8, then the top 8% of all k-mers were simply removed from all further 
comparison. This could, and has in the past, led to situations where way to 
much was cut away. E.g., if in 1 million reads "only" 20000 contained highly 
repetitive k-mers (2%), then the 8% cutback from -SK:rt was axing 6% more k-
mers and that hit "innocent bystanders".

The new algorithm with -SK:nrr (e.g. -SK:nrr=4) behaves differently. It guesses 
the overall coverage of a project and then basically says: every k-mer that is 
present more than (in this example) 4 times as often as the estimated coverage 
will be removed from comparison. This should allow for a much more fine grained 
masking of nasty repeats as well as having the user knowing (instead guessing) 
what has been removed from the SKIM comparison step.

Please also have a look at the new help file on assembling 'hard' genomes which 
is present in the downloads or on http://chevreux.org/mira_manuals.html

> I tried with the same data with release 43, and evrything goes fine when
> using -SK:mnr=yes:rt=8

I'd suggest you start with -SK:nrr=4 and see whether this works.

As I have only limited experience with eukaryotes and not that many data sets 
available: could you please send me the log or an excerpt containing the skim 
hash statistics? I'd adapt the manuals accordingly.

And I'm curious to know whether what I though up really is 'better' as I think 
it is. Feedback appreciated.

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: