[mira_talk] Re: Assembling "contigs" from previous assemblies (Was Re: debris file and lrc)
- From: Bastien Chevreux <bach@xxxxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Thu, 19 Feb 2009 11:39:32 +0100
On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote:
> Thanks for the quick and robust reply. Given the things you said I'm
> wondering whether the fact that I am trying to assemble sequences
> (contigs) derived from formerly assembled 454 reads (and in the future
> Sanger based assembled contigs) and not the reads themselves. Moreover
> these contig sets (coming from different partial genome assemblies) are
> possibly only (and perhaps minimally) overlapping with contigs from the
> other contig sets I'm attempting to assemble. Am I understanding you
> correctly in thinking that such non-overlapping, unique (by virtue of
> the previous subset assembly) sequences would wind up in the debris? If
> so, what is the difference between such "debris" and singlets in the
> project sequence results files? I'm wondering whether these debris
> sequences (which incidentally appear normal and above the sequence
> length cut-offs) should be included in my result set for this project.
Hi Mark,
this usage is indeed quite different from what MIRA expects when being given
the "--job=genome" short-cut: a lot of error-prone reads that form contigs
with "a certain coverage". Certain algorithms (e.g. the quality control and
read distribution) will get confused and produce ... well, nothing usable as
you've had to experience.
Let's see how to correct for this. But first a question from my side: MIRA
normally does not accept to assemble input reads longer than 20kb. Are all
your contigs smaller than this (I wouldn't believe so) or did you fragment
them (if so, how?).
As I never applied MIRA for that kind of task, the following "how to configure
MIRA for that job" is somewhat theoretical. Might be that you will need to
tweak a few things and do a few tries before it works like expected. I suggest
you make a small test set with a couple of input sequences you know should
overlap and do trials on that.
I'd start configuring like this (and this applies only for a scenario where you
have "low-coverage, long sequences", if you mix that with real shotgun data
then there might be better ways):
- "--job=denovo,genome,normal" This basically prepares MIRA that it needs to
assemble anew and that the input is not EST data.
- then switch off all clippings, you basically expect that your input is no too
bad: "--noclippings"
- along the same vein, switch off read extension (-DP:ure=no), remember to do
this for every input type, i.e., if you declared to load 454 sequences, do
this by putting this in the part for 454 parameters ("454_SETTINGS
-DP:ure=no")
- then switch off the automatic repeat detection (-AS:ard=no) as this does
coverage analysis which in your case is pretty counter-productive
- in case you work with the "keep contigs in memory" option, switch off the
spoiler detection (-AS:sd=no)
- (minor point, but might be useful) you might want to think about giving MIRA
strain information in case the contigs come from different strains (or even
different cultures of the same strain). Have a look at the "-SB" parameters
that deal with strain information.
- it might be necessary to increase the sensitivity of SKIM. Try out
"-SK:hss=1:pr=70"
- if you expect the ends of your input sequences to be more noisy, try to
switch off the extra gap penalty (-AL:egp=no, sequencing type dependent!).
Alas, this might lead to a scenario where "almost identical repeats within
the genome" get more easily assembled together, so use with care.
- you want singlets in your assembly. As nothing should have been filtered out
with the above setup, switch on the saving of singlets: "-OUT:sssip=yes"
(again, this is sequencing type dependent)
- I'm not sure whether the contig misassembly detection is useful in this
scenario. Once you've settled the remaining parameters and are using it on
your real data set, try out whether "-CO:mr=no" gives better results.
Alternatively, increase the number of "reads" needed to tag misassemblies,
that would be the "-CO:mrpg" parameter (the last one again sequencing type
dependent)
Hope that helps.
Regards,
Bastien
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: