[mira_talk] MIRA 3.1.15: test driving for interested parties

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 2 Jun 2010 20:58:55 +0200
Dear all,

3.0.5 contains a nasty bug ("extendADS" problem) some people people are 
running into and which stops an assembly cold. While a workaround is simple 
(tunring off -DP:ure), it robs some of the power of the de-novo assembly when 
having Sanger sequences. I'm not ready yet to release a new full version as I 
made some important changes lately to improve speed while handling really 
large read numbers.

The current head of the development branch (3.1.15) passes my usual tests for 
de-novo assemblies and I also have worked on 4 mapping projects with it, so I 
feel that it should be OK from an algorithm point of view.

However, the documentation is not up-to-date (I'm changing it to DocBook right 
now and rework it a bit in the process) and I still want to polish a few 
things.

But if anyone is interested to test drive the current head and give feedback, 
please feel free to do so:

  http://www.chevreux.org/tmp/mira_3.1.15_dev_linux-gnu_x86_64_static.tar.bz2

Note that docs are missing completely in this archive, please refer to the 
(rather terse) change log down below to learn about new features / parameters 
of MIRA.

Regards,
  Bastien

3.1.15
------
- new parameter -CO:emeas1clpec. Automatically sets emea to 1 if proposed end
  clipping is used (ends will be "clean"). Improves recognition of
  misassemblies in cases where only the outer fringes of reads differ.
- change in template handling: to be lenient, MIRA internally added/subtracted
  10% of the given insertsizes (or at least 1kb). Not anymore! This would give
  problems with very small libraries (Solexa) or when the given values were
  "lenient enough" and were made "too lenient" by this and subsequently
  flagged in different post-processing tools.
- change in handling template insert size info from XML: previously, MIRA set
  stdev to a minimum of 500 bases and used 2*stdev to calculate minimum and
  maximum insert sizes. The 500 bases minimum rule has been removed, and now
  using 3*stdev
- new parameter: -GE:tpbd to give template partner build direction on the
  command line. Defines whether the template partner of a read (in a
  read-pair) must have the same direction (1) or reverse direction (-1) in a
  contig.
- change: when --job=...,454 is used, the default minimum overlap is not 40
  anymore, but 20. 40 was too conservative, overlaps at weak contig joins were
  discarded too often.
- improved graph reduction algorithm: some more small overlaps at low coverage
  sites are taken to Smith-Waterman. This helps to find some more weak contig
  joins.


3.1.14
------
- speed up of routine to find and mark IUPAC bases and unsure bases (IUPc &
  UNSc). Very noticeable when using annotated genomes as mapping reference.
- bugfix: IUPC & UNSc were not searched for anymore (introduced in 3.1.12 with
  the -CO:asir bugfix)
- re-activated '-d' in convert_project
- adjusted miramem estimator for mapping of Solexa reads


3.1.13
------
- improvements for large assemblies with millions of reads where setting up
  data for new contigs during build is sped up. Especially noticeable in EST
  assemblies, but also genome assemblies with Solexa.


3.1.12
------
- new option to speed up assemblies with millions of reads: -AS:mrpc controls
  the minimum number of reads a contig must potentially have before it is
  really assembled. This prevents all the small junk contigs with very low
  numbers of reads in, e.g., Solexa sequencing to be assembled and can speed
  up the assembly by days.
- MIRA now uses the tcmalloc library from Google perftools if available. It is
  highly recommended as it optimises memory allocation and saves a lot of
  memory on multiple pass assemblies. E.g., memory usage for 810k 454 FLX
  reads, 45x coverage, 5 pass genome de-novo accurate:
              3.0.5    8272988 kB
             3.1.11    8273012 kB
             3.1.12    9492956 kB
     3.1.12tcmalloc    6758916 kB
- change: adapted some estimators in miramem, hopefully giving better
  estimates for RAM usage during MIRA assemblies.
- bugfix: array iterator overrun in contig building which had probably no
  noticeable effect. If, then perhaps rejecting weak matches it would have
  barely accepted.
- bugfix: -CO:asir sometimes set repeat markers instead of SNP markers.
- bugfix: mira could try to check physical presence of SCF data even for
  non-Sanger reads


3.1.11
------
- optimisation: memory pre-allocation routines for read growth help to get
  down memory fragmentation and hence less memory requirement
  overall.
- bugfix: -CO:mr=no was not fully respected. While not used during contig
  building, possible repeats were always marked in result files and then
  tranferred to following iterations.
- bugfix extendADS(): acquireSequences() could throw due to 0 length of a
  sequence

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
Follow-Ups:
- [mira_talk] Re: MIRA 3.1.15: test driving for interested parties
  - From: Davide Scaglione
- [mira_talk] Re: MIRA 3.1.15: test driving for interested parties
  - From: Bastien Chevreux
[mira_talk] MIRA 3.1.15: test driving for interested parties

Other related posts: