[mira_talk] Re: assembly parameters and more
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Wed, 11 Mar 2009 21:16:02 +0100
On Wednesday 11 March 2009 Davide Sassera (davide.sassera) wrote:
> I’m currently assembling a 1,5 Mb bacterial genome.
> I have half
> a titanium plate (520K reads), half a normal gs-flx plate (270k reads) and
> a little bit of sanger just to spice it up.
> I know it’s a lot of sequences for such a small genome
Hello Davide,
there's no such thing as overkill :-) Well, perhaps there is, but a little bit
of overkill never harmed anyone ... in sequencing.
> but we were afraid of having
> issues with contaminating DNA and chimeras formed by whole genome
> amplification, both of which in fact we have.
Ouch. Difficult organism to get growing?
> I’m working
> on a 3.16Ghz dual core with 16GB of ram, and since I was sure my system was
> going to handle the data, I went with an ultra accurate assembly:
> [...]
> Now on with the questions:
> Am I being too strict with the parameters I’m using?
Hmmm. Yes and no. Let's go through.
Jan already pointed out that -AS:mrl=30 is quite low. He's right. You have a
coverage of at least 100x, you really should not bother with such small
things. Leave it at the default or set it even to 80, getting rid of small
junk doesn't harm in this case. Btw, paired-end reads smaller than this would
still get included as they may provide valuable links for the assembly.
-AS:rbl=8 is overdoing. Set it to 2, that should be enough. Reason is that
these innner loops are mostly used to get rid of simple sequencing errors ...
and after the third pass 99% of what could be corrected has been corrected
anyway.
Speaking of passes: having 10 passes (-AS:nop=10) is also probably too much. I
use 7 in the accurate mode just because the old GS20 sequences had more
sequencing errors than I though, and this helped to get ri of them for good.
But the newer FLX are quite ok. Also, the repeats should be pretty much
resolved after pass 5 or 6.
-SK:pr=97 is not the ideal way to enforce strict alignments. This probably
throws away too many good alignments in the SKIM phase. A better way would be
to keep -SK:pr at 80 (or 90) but be stricter in the Smith-Waterman alignment
phase: -AL:mrs=80 is default in accurate mode, you might want to try 90. On
the other hand, fiddling with these strictness parameters is needed only on
rare occasions (you might have one) as MIRA pretty well finds out by itself
where misassemblies happened and marks them as "don't do that in the next
pass". On the other hand, you wrote you have chimeras, so keeping this around
90 might be a good idea.
> Now: it
> seems things are harder than I thought, after 6 days it is still doing the
> second step and it’s swapping 12gigs!!!
*This* is most unusual. MIRA is a memory hog, but should not be that greedy.
Can you please send me the output log, I'd like to check a few things.
> Do the assembly steps take all the same time? It seems that step 2 is
> taking much longer than step 1
That is to be expected and a direct effect of -AS:rbl=8. Indeed, MIRA cheats in
the first pass and does an "rbl=1" if it knows that there are more passes to
come. As I wrote, keep -AS:rbl at 2 and things should be a lot quicker.
> Do all step take the same memory? Again it seems the second step is more
> demandingIf my assumptions are correct I will either wait for months or the
> assembly will stop for lack of memory soon, right?
Memory consumption tends to rise a bit with each pass, especially when MIRA
needs to keep a list of overlaps it shouldn't take (du to repeats). However,
using 28GB for a 1.5mb bacterium is really too much, even for MIRA.
> So what should I do now?
Pray? No, just kidding. First, stop that assembly and send me that output log
... I have the bad feeling that there's something lurking in your data set
which is not normal.
Then, while I look through that, restart your project with this:
mira -project=050309 -job=denovo,accurate,454,sanger,genome
-GE:not=2
-AS:klrs=1 -AL:mrs=90 -CO:rodirs=15
454_SETTINGS -AS:mrl=80 -AL:mrs=90 -CO:rodirs=20:mrpg=10
There are a few other differences to your call:
- -SK:mnr is off as you should not need it for a simple bacterium. If yes, then
you are in deep trouble anyway.
- -AL:mrs is higher to catch chimeras a bit earlier
- -CO:rodirs is set a bit stricter. This could lead to a few reads with
sequencing errors not to be built in, but ... you have more than enough of
them.
- -CO:mrpg has been increased for 454 reads. This is due to the high coverage
you have which in turn might lead to sequencing errors, which might amass at
a given position, to be wrongly seen as repeat. Given a coverage of 100, you
could also try 20 or 30 for this value.
> Stop bothering you guys (Bastien above all) with stupid questions?
Well, if I don't hear of problems, I can't improve the assembler. Your choice
:-)
> Thank you
> in advance, I really feel that the constant updates and this competent and
> relaxed mailing list make Mira stand above all the competitors
MIRA does a few things differently than other assembler. Some better, some
worse. I'm working on getting rid of the latter.
Regards,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: