[mira_announce] New version 2.9.44x2 / Call for testers
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_announce@xxxxxxxxxxxxx
- Date: Sat, 18 Apr 2009 23:03:25 +0200
Dear all,
I'm looking for brave testers who want to beta-test the new core assembly
engine of MIRA. 64-bit only at the moment, sorry.
More specifically, those who have
- Sanger projects of eukaryotes (anything up to 50MB or 100MB)
- Sanger or 454 projects (or Sanger/454hybrids) of smaller organisms (anything
up to 20MB or so, more if you have large machines)
The data can be paired-end (it'll certainly help to get larger contigs) but
does not need to.
What I'm interested in the above is to know how MIRA reacts in unexpected
cases (e.g. those pesky massive repeats of eukaryotes). For those who have
paired-end, I'd also be interested to know whether there is a large difference
in memory consumption by the alignment graphs after the SW alignment step when
using paired-end data.
For the ultimately valiant (because I did not optimize anything there yet as
that's on the TODO for the coming weeks):
- Sanger/Solexa or 454/Solexa hybrid de-novo assemblies. But there you should
do yourself a favour and do that only for bacteria and something like 3 to 4
million Solexa reads. Here, no Solexa paired-end yet as I still need more
info on how the reads behave: direction of sequencing etc.pp (if someone
knows, please mail me)
Normal mapping assemblies with either sequencing technology should work like
before, so no real need to test.
Please report anything that seems funny to you. I also welcome the log files of
successful or failed assemblies.
Here's an excerpt from the README of the package:
Important for version 2.9.44x2
==============================
------------------------------------------------------------------------------
THIS IS A HARDCORE TEST VERSION ...
... because despite the nominal small jump from 2.9.43 to 2.9.44x2, this
version has now tons of new stuff activated on which I worked on in the past
six months or so. These are basically the new core of MIRA for the upcoming
version 3.
THAT BEING SAID ...
I trust this version enough to use it for my current and coming sequencing
projects and I welcome anyone to give me feedback while using it.
This version will not appear on the official download site, rather fetch it
from here (64 bit only, sorry):
http://www.chevreux.org/tmp/mira_2.9.44x2_dev_linux-gnu_x86_64.tar.bz2
------------------------------------------------------------------------------
Most notable changes for users (on the positive side):
- contig building will stop at or in repeats if they cannot be cleanly
crossed, be it by long reads or with paired-end reads. In the synthetic and
live data I work with during development, I haven't yet encountered a single
case of misassembly where contig parts would have been joined that did not
belong together.
In contrast, the old version made assumptions on how the repeat would
continue and this caused sometimes misassemblies.
- beside improvement in quality, the contig building process got noticeably
faster.
- chimeras are definitively a problem of the past. They are now recognised and
cut back to the longest clean sequence. The catch: contig parts held
together by a single read will now fall apart. This is a problem for low
coverage projects only and therefore, with the high throughput sequencing
technologies, not much of a threat nowadays.
In contrast, the old version did not give special attention to chimeras. The
old assembly engine used routines that prevented use of chimeras most of the
time ... but not always, and therefore, misassemblies could happen there.
- enhanced use of paired-ends. If repeats can be crossed by paired-ends in
theory, they almost certainly will be crossed in practice. Even if there are
several 100% identical repeats in the genome, I expect crossings to be
correct.
- enhanced ability to handle smaller eukaryotes (expected, not verified
yet). I got mixed feedback in the past regarding the ability to handle
smaller eukaryotes in the range from 10 to 100MB. Sometimes the memory
consumption would simply explode. New routines are addressing this but as I
have no eukaryotic assembly project at the moment, I can't confirm.
CURRENT CAVEATS (on the negative side):
- I haven't tested EST assembly with the new routines. They may work, but they
also might fail miserably. Feedback appreciated.
- in contrast to the overall memory requirement for larger projects
(eukaryotes) being noticably lower, the overall memory requirements for
smaller projects has increased a bit (2 bytes per base in read to be exact)
Have fun with it.
Regards,
Bastien
Other related posts:
- » [mira_announce] New version 2.9.44x2 / Call for testers - Bastien Chevreux