[mira_talk] Re: Assembling "contigs" from previous assemblies (Was Re: debris file and lrc)

On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote:
> Thanks for the quick and robust reply.  Given the things you said I'm
> wondering whether the fact that I am trying to assemble sequences
> (contigs) derived from formerly assembled 454 reads (and in the future
> Sanger based assembled contigs) and not the reads themselves.  Moreover
> these contig sets (coming from different partial genome assemblies) are
> possibly only (and perhaps minimally) overlapping with contigs from the
> other contig sets I'm attempting to assemble.  Am I understanding you
> correctly in thinking that such non-overlapping, unique (by virtue of
> the previous subset assembly) sequences would wind up in the debris?  If
> so, what is the difference between such "debris" and singlets in the
> project sequence results files?  I'm wondering whether these debris
> sequences (which incidentally appear normal and above the sequence
> length cut-offs) should be included in my result set for this project.

Hi Mark,

this usage is indeed quite different from what MIRA expects when being given 
the "--job=genome" short-cut: a lot of error-prone reads that form contigs 
with "a certain coverage". Certain algorithms (e.g. the quality control and 
read distribution) will get confused and produce ... well, nothing usable as 
you've had to experience.

Let's see how to correct for this. But first a question from my side: MIRA 
normally does not accept to assemble input reads longer than 20kb. Are all 
your contigs smaller than this (I wouldn't believe so) or did you fragment 
them (if so, how?).

As I never applied MIRA for that kind of task, the following "how to configure 
MIRA for that job" is somewhat theoretical. Might be that you will need to 
tweak a few things and do a few tries before it works like expected. I suggest 
you make a small test set with a couple of input sequences you know should 
overlap and do trials on that.

I'd start configuring like this (and this applies only for a scenario where you 
have "low-coverage, long sequences", if you mix that with real shotgun data 
then there might be better ways):
- "--job=denovo,genome,normal"  This basically prepares MIRA that it needs to
  assemble anew and that the input is not EST data.
- then switch off all clippings, you basically expect that your input is no too
  bad: "--noclippings"
- along the same vein, switch off read extension (-DP:ure=no), remember to do
  this for every input type, i.e., if you declared to load 454 sequences, do
  this by putting this in the part for 454 parameters ("454_SETTINGS 
 -DP:ure=no")
- then switch off the automatic repeat detection (-AS:ard=no) as this does
  coverage analysis which in your case is pretty counter-productive
- in case you work with the "keep contigs in memory" option, switch off the
  spoiler detection (-AS:sd=no)
- (minor point, but might be useful) you might want to think about giving MIRA
  strain information in case the contigs come from different strains (or even
  different cultures of the same strain). Have a look at the "-SB" parameters
  that deal with strain information.
- it might be necessary to increase the sensitivity of SKIM. Try out 
  "-SK:hss=1:pr=70"
- if you expect the ends of your input sequences to be more noisy, try to
  switch off the extra gap penalty (-AL:egp=no, sequencing type dependent!).
  Alas, this might lead to a scenario where "almost identical repeats within
  the genome" get more easily assembled together, so use with care.
- you want singlets in your assembly. As nothing should have been filtered out
  with the above setup, switch on the saving of singlets: "-OUT:sssip=yes"
  (again, this is sequencing type dependent)
- I'm not sure whether the contig misassembly detection is useful in this
  scenario. Once you've settled the remaining parameters and are using it on
  your real data set, try out whether "-CO:mr=no" gives better results.
  Alternatively, increase the number of "reads" needed to tag misassemblies,
  that would be the "-CO:mrpg" parameter (the last one again sequencing type
  dependent)

Hope that helps.

Regards,
  Bastien



-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: