Dear all, MIRA V3.4.0 has been released at SourceForge. http://sourceforge.net/projects/mira-assembler/ The highlights of this version in marketing compatible keywords (and exclamation marks) on one slide: ========================== |- faster! | |- better! | |- IonTorrent! | |- PacBio! | |- more utilities! | |- better documentation! | |- update is for free! | ========================== For all others, here's the somewhat more verbose version: development of the 3.4 series of MIRA concentrated on making assemblies with 30m to 100m reads more "liveable", i.e., reduce memory and disk footprint of MIRA as well as improving run-times. At the same time, an updated assembly strategy both for genome and EST / RNASeq data was devised to reduce the influence of chimeras and intronic data on the assembly. Also MIRA is now pretty smart in handling de-novo Solexa projects with "low coverage" (<30x) as well as "high coverage" (>= 100x). While we are at it: default parameters for Solexa de-novo were adapted to work with at least 75mers. While doing de-novo assembly with smaller read lengths is still possible for MIRA, the whole concept of ultra-short-read de-novo assembly is a silly idea in the first place. So don't do it ... the additional cost for >= 75mers is peanuts on HiSeq. The new ability to handle IonTorrent data also made its appearance in MIRA as implementing support for this kind of sequencing technology was comparatively simple and straight forward. MIRA supports all kind of read lengths presently on the market (100bp, 220bp) out of the box, but longer read lengths should not pose a problem. Current IonTorrent data behaves very much like early 454 GS20 reads and I am curious whether Life will be able to perform the same length and quality improvement within 12 month like 454 did in 2006. Time will tell. For PacBio, results are a mixed bag: CCS reads as well as error-corrected CLR data works extremely well with MIRA, at least I'm happy how the E. coli C227-11 demo data from the PacBio DevNet gets assembled. I suppose MIRA will still need to get a couple more rules regarding the error profile of those reads, but I'll be able to do that only once I've seen more data. What does not work at all at the moment (and causes me some terrible headache) are the CLR reads: those with an accuracy of only 80% to 85%. I'm not sure how to tackle them efficiently. For mapping assemblies, many smaller and bigger improvements ease the daily life and improve results with those data sets. Exemplarily named should be improved mapping quality of reads in highly repetitive regions of a genome when the reference sequence is not optimal as well as the new ability to load backbone sequences and annotation from GFF3 format files (saving will follow shortly). Quality control and automated clipping has been another focus in the past few months. Notable developments there are automated clipping of known adaptors in Solexa and IonTorrent data, improvements in the detection and avoidance of chimeric reads and a some new automated editing algorithms which edit away pretty clear cases of sequencing errors. Regarding utilities, 'convert_project' has been revamped to be able to convert large assembly or data files with less memory. It also got a number of new options to get even more use cases covered. The new tool 'mirabait' enables to quickly extract reads based on matching k-mers from a huge data set. For detailed changes, please consult the src/mira/CHANGES_old.txt file in the source distribution. Reports for any error, unusual behaviour or inconsistencies in the documentation are gladly accepted via the usual channels. Have fun with MIRA, Bastien