[mira_announce] New version 2.9.45

Dear all,

I have just uploaded 2.9.45 for 64 and 32 bit platforms:

http://www.chevreux.org/tmp/mira_2.9.45_dev_linux-gnu_i686_32.tar.bz2
http://www.chevreux.org/tmp/mira_2.9.45_dev_linux-gnu_x86_64.tar.bz2

These versions will also make it to the official download page, but I simply 
don't have time to do it this evening anymore. I compiled them on a 2.9.27 
kernel, so these versions should work again for most people who had a "kernel 
too old" error message in the last compiles.

The complete list of changes since 2.9.44 can be found below, I'll highlight 
just a two things I think many people might find useful (the documentation in 
the walkthroughs is lacking behind a bit, so try that out on small projects to 
see the effects)

1) the new -GE:crhf parameter
--------------------------------
When switched on it sets tags in reads which show by colour the repeat status 
of every k-mer in every read. This is *extremely* useful in finishing as one 
immediately sees, even in unpaired data, whether a join is "safe" because it 
happens in non-repetitive areas or whether it must be made with caution 
because one is joining highly repetitive areas.

Furthermore, looking at the contigs one can now see how MIRA built the 
contigs: most contigs stop at highly repetitive areas which could not be 
safely crossed by paired or unpaired data. If they stop in non-repetitive 
areas, they mostly fade out to below average coverage (so you know that no 
gDNA fragment crossed this gap) or they stopped probably because some reads 
with bad sequences slipped through QC and wreak havic with the alignment 
(you'll see that in manual joins and can remove those reads).

Please make sure you include the HAF0 to HAF7 tags into the tag definition 
files 
of your finishing programs, see the support directory of the package for more 
information. The important tags are:

HAF2 = this k-mer is present at below average frequency (<0.5x)
HAF3 = this k-mer is present at average frequency (~0.5x to 1.5x)
HAF4 = this k-mer is present at above average frequency, but not quite enough 
to be really repetitive (~ 1.5x to 1.9x)
HAF5 = this k-mer is repetitive and probably present at 2x to 8x in the genome 
HAF6 = this k-mer is present more than 8x in the genome


2) the -SK:nrr parameter
------------------------------
This replaces the -SK:rt parameter which could be used only in a black-box 
trial-and-error method.

-SK:nrr is a switch which incredibly helps tackling even most vile genomes 
(eukaryotes or 'funny' prokaryotes). In short: one can set the repeat level 
allowed to be searched for by SKIM. Example: setting "-SK:nrr=2" lets MIRA 
behave like Newbler: mask everything that is repetitive. Setting the value to 
3 tells MIRA to mask everything which is probably present in more than 3 
copies. Etc.pp.

However, this masking does not mask entire reads, but works on a base-by-base 
basis, using k-mers of -SK:bph length. Therefore, repetitive stretches which 
are shorter than the average read is long do not cause a contig stop.

Furthermore, the masking just affects searching, when built into contigs there 
will still be sequence present (and hence a consensus)

So, when having really bad genomes, start with rt=2 to get a good first idea 
and then try out higher numbers. MIRA gives you some hints in the assembly 
(search for "Repeat ratio histogram" in the output) on what you can expect to 
have in your data. I need to streamline that still a bit and write docs, but 
it should be pretty obvious how to read the table.


Regards,
  Bastien


2.9.45
------
- to accomodate the Solexa paired-end naming scheme, CAF files now allow the
  "/" character in identifiers (like read names).
- SK:rt has been renamed to -SK:nrr and the meaning has changed (please read
  changed documentation). This gives an easier control in handling of
  repetitive sequences.
- skimming for nasty sequences (-SK:mnr) now uses the same algorithms as
  -CL:pec which are faster and better than the old ones.
- new parameter -CL:pecbph
- SKIM3 now removes some massive temporary files from the log directory
- MRMr tags renamed MNRr
- updated support files GTAGDB and consedtaglib.txt


2.9.44x7
--------
- speed up of SKIM hit reduction. Important for large eukaryotic assemblies or
  de-novo prokaryotic Solexa assemblies, reducing the time of that step from
  several hours to under one hour or even minutes.


2.9.44x6
--------
- added "solexa" as naming scheme to -GE:rns (using "/1" and "/2" to
  distinguish forward and reverse reads
- added -GE:crhf to color reads by hash frequency. Very handy for
  finishing. Needs tags "HAF0" to "HAF7" to be defined for gap4 (or consed or
  other finishing tools)
- new log file: "miralog.usedids" which logs all reads (after clipping etc.)
  which go into contig assembly
- statistics regarding the read pool are now printed out after all operations
  that might change read lengths (read extension or clipping)


2.9.44x5
--------
- added unpadded read position to "*_info_readtaglist.txt"
- -SK:pr can now be set individually by sequencing technology



Other related posts: