[mira_announce] New version 2.9.44x2 / Call for testers

Dear all,

I'm looking for brave testers who want to beta-test the new core assembly 
engine of MIRA. 64-bit only at the moment, sorry.

More specifically, those who have 
- Sanger projects of eukaryotes (anything up to 50MB or 100MB)
- Sanger or 454 projects (or Sanger/454hybrids) of smaller organisms (anything
   up to 20MB or so, more if you have large machines)

The data can be paired-end (it'll certainly help to get larger contigs) but 
does not need to.

What I'm interested in the above is to know how MIRA reacts in unexpected 
cases (e.g. those pesky massive repeats of eukaryotes). For those who have 
paired-end, I'd also be interested to know whether there is a large difference 
in memory consumption by the alignment graphs after the SW alignment step when 
using paired-end data.

For the ultimately valiant (because I did not optimize anything there yet as 
that's on the TODO for the coming weeks):
- Sanger/Solexa or 454/Solexa hybrid de-novo assemblies. But there you should
  do yourself a favour and do that only for bacteria and something like 3 to 4
  million Solexa reads. Here, no Solexa paired-end yet as I still need more
  info on how the reads behave: direction of sequencing etc.pp (if someone
  knows, please mail me)

Normal mapping assemblies with either sequencing technology should work like 
before, so no real need to test.

Please report anything that seems funny to you. I also welcome the log files of 
successful or failed assemblies.


Here's an excerpt from the README of the package:

Important for version 2.9.44x2
==============================

------------------------------------------------------------------------------
THIS IS A HARDCORE TEST VERSION ...

... because despite the nominal small jump from 2.9.43 to 2.9.44x2, this
version has now tons of new stuff activated on which I worked on in the past
six months or so. These are basically the new core of MIRA for the upcoming
version 3.

THAT BEING SAID ...

I trust this version enough to use it for my current and coming sequencing
projects and I welcome anyone to give me feedback while using it.

This version will not appear on the official download site, rather fetch it
from here (64 bit only, sorry):
http://www.chevreux.org/tmp/mira_2.9.44x2_dev_linux-gnu_x86_64.tar.bz2
------------------------------------------------------------------------------


Most notable changes for users (on the positive side):

- contig building will stop at or in repeats if they cannot be cleanly
  crossed, be it by long reads or with paired-end reads. In the synthetic and
  live data I work with during development, I haven't yet encountered a single
  case of misassembly where contig parts would have been joined that did not
  belong together.
  In contrast, the old version made assumptions on how the repeat would
  continue and this caused sometimes misassemblies.

- beside improvement in quality, the contig building process got noticeably
  faster.

- chimeras are definitively a problem of the past. They are now recognised and
  cut back to the longest clean sequence. The catch: contig parts held
  together by a single read will now fall apart. This is a problem for low
  coverage projects only and therefore, with the high throughput sequencing
  technologies, not much of a threat nowadays.
  In contrast, the old version did not give special attention to chimeras. The
  old assembly engine used routines that prevented use of chimeras most of the
  time ... but not always, and therefore, misassemblies could happen there.

- enhanced use of paired-ends. If repeats can be crossed by paired-ends in
  theory, they almost certainly will be crossed in practice. Even if there are
  several 100% identical repeats in the genome, I expect crossings to be
  correct.

- enhanced ability to handle smaller eukaryotes (expected, not verified
  yet). I got mixed feedback in the past regarding the ability to handle
  smaller eukaryotes in the range from 10 to 100MB. Sometimes the memory
  consumption would simply explode. New routines are addressing this but as I
  have no eukaryotic assembly project at the moment, I can't confirm.


CURRENT CAVEATS (on the negative side):

- I haven't tested EST assembly with the new routines. They may work, but they
  also might fail miserably. Feedback appreciated.

- in contrast to the overall memory requirement for larger projects
  (eukaryotes) being noticably lower, the overall memory requirements for
  smaller projects has increased a bit (2 bytes per base in read to be exact)


Have fun with it.

Regards,
  Bastien


Other related posts: