[mira_announce] MIRA V3.4.0 available

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_announce@xxxxxxxxxxxxx
  • Date: Sun, 21 Aug 2011 19:18:16 +0200

Dear all,

MIRA V3.4.0 has been released at SourceForge.

  http://sourceforge.net/projects/mira-assembler/

The highlights of this version in marketing compatible keywords (and 
exclamation marks) on one slide:

==========================
|- faster!               |  
|- better!               |
|- IonTorrent!           |
|- PacBio!               |
|- more utilities!       |
|- better documentation! |
|- update is for free!   |
==========================

For all others, here's the somewhat more verbose version: development of the 
3.4 series of MIRA concentrated on making assemblies with 30m to 100m reads 
more "liveable", i.e., reduce memory and disk footprint of MIRA as well as 
improving run-times. At the same time, an updated assembly strategy both for 
genome and EST / RNASeq data was devised to reduce the influence of chimeras 
and intronic data on the assembly. Also MIRA is now pretty smart in handling 
de-novo Solexa projects with "low coverage" (<30x) as well as "high coverage" 
(>= 100x).

While we are at it: default parameters for Solexa de-novo were adapted to work
with at least 75mers. While doing de-novo assembly with smaller read lengths
is still possible for MIRA, the whole concept of ultra-short-read de-novo
assembly is a silly idea in the first place. So don't do it ... the additional 
cost for >= 75mers is peanuts on HiSeq.

The new ability to handle IonTorrent data also made its appearance in MIRA as
implementing support for this kind of sequencing technology was comparatively
simple and straight forward. MIRA supports all kind of read lengths presently
on the market (100bp, 220bp) out of the box, but longer read lengths should
not pose a problem. Current IonTorrent data behaves very much like early 454
GS20 reads and I am curious whether Life will be able to perform the same
length and quality improvement within 12 month like 454 did in 2006. Time will
tell.

For PacBio, results are a mixed bag: CCS reads as well as error-corrected CLR
data works extremely well with MIRA, at least I'm happy how the E. coli
C227-11 demo data from the PacBio DevNet gets assembled. I suppose MIRA will
still need to get a couple more rules regarding the error profile of those
reads, but I'll be able to do that only once I've seen more data. What does
not work at all at the moment (and causes me some terrible headache) are the
CLR reads: those with an accuracy of only 80% to 85%. I'm not sure how to
tackle them efficiently.

For mapping assemblies, many smaller and bigger improvements ease the daily
life and improve results with those data sets. Exemplarily named should be
improved mapping quality of reads in highly repetitive regions of a genome
when the reference sequence is not optimal as well as the new ability to load
backbone sequences and annotation from GFF3 format files (saving will follow
shortly).

Quality control and automated clipping has been another focus in the past few
months. Notable developments there are automated clipping of known adaptors in
Solexa and IonTorrent data, improvements in the detection and avoidance of
chimeric reads and a some new automated editing algorithms which edit away
pretty clear cases of sequencing errors.

Regarding utilities, 'convert_project' has been revamped to be able to
convert large assembly or data files with less memory. It also got a number of
new options to get even more use cases covered. The new tool 'mirabait'
enables to quickly extract reads based on matching k-mers from a huge data
set.

For detailed changes, please consult the src/mira/CHANGES_old.txt file in the
source distribution.


Reports for any error, unusual behaviour or inconsistencies in the 
documentation are gladly accepted via the usual channels.


Have fun with MIRA,
  Bastien

Other related posts:

  • » [mira_announce] MIRA V3.4.0 available - Bastien Chevreux