[mira_talk] Re: Assembling "contigs" from previous assemblies (Was Re: debris file and lrc)

Hi Mark,

I don't think velvet is a good choice, as it depends
on a high coverage of input data AFAIK.

What about treating your data like a "normalised cDNA
library"? No genome reconstruction, low coverage etc?

--job=denovo,EST,normal

Or, just like I do for EST data, clustering to group
sequences with overlaps/identities and subsequent
assembly of each generated cluster separately with your
assembler of choice (e.g. cap3).

I am not really sure if one of these suggestions
will work in reality, .. just give it a try :-)

Another 2p,
Sven

+++ mark.rose@xxxxxxxxxxxx (19.02.2009 15:55):
Hi Bastien and Sven

Let me explain more clearly what my situation is and goals are.  I have
two projects at the moment that are relevant.  Both involve assembling
contigs from previous assemblies. My purpose for trying to assemble
these sequence datasets is to produce a single datasets to be used as
references for resequencing projects aimed at SNP detection.  As such I
am not interested in creating accurate representations of large
stretches of genome but rather non-redundant, "uni-sequence" sets.

Projects
1)I have contigs from two 454 assemblies of reads derived from 2
different genomic reduction methods.  The reads from these two methods
likely overlap to some unknown extent.  My goal is to create a single
dataset from these two, assembling contigs from each where overlap
occurs but retaining non-overlapping contigs (which there are likely to
be many).  Contigs from neither of these 2 assemblies exceed 5kb and
most are considerably smaller.
2) I have 3 assemblies of 454 reads (all derived from different genomic
reduction methods as above), a collection of partially sequenced BACs
and end sequence from a fosmid library.  My approach to this is first to
perform some dataset redundancy cleanup by aligning the 454 contigs and
the fosmid sequence to the large unmasked BAC reference (this organism
has a high repetitive sequence content as does the on above).  I am
using mosaik to do this.  Aligning sequences will be discarded.  I will
then take a masked version of the BACs and split them on the masked
elements into subsequences.  This will remove the masked (naturally
repetitive) elements which are not of interest.   I have not done this
yet and so don't know how large the resulting unmasked subsequences will
be. If necessary I can further fragment these sequences to get them
under 20kb.  Then finally I will assemble these unmasked subsequences
with the unaligned contigs from the 454 assemblies and the fosmid
sequences (which are each under 1kb).

Just wondering too, given my limited goals for these assemblies, whether
a tool like velvet might be an alternative?


Thanks for your help

Mark
-----Original Message-----
From: mira_talk-bounce@xxxxxxxxxxxxx
[mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux
Sent: Thursday, February 19, 2009 5:40 AM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: Assembling "contigs" from previous assemblies
(Was Re: debris file and lrc)

On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote:
Thanks for the quick and robust reply. Given the things you said I'm wondering whether the fact that I am trying to assemble sequences
(contigs) derived from formerly assembled 454 reads (and in the future

Sanger based assembled contigs) and not the reads themselves. Moreover these contig sets (coming from different partial genome assemblies) are possibly only (and perhaps minimally) overlapping with

contigs from the other contig sets I'm attempting to assemble. Am I understanding you correctly in thinking that such non-overlapping, unique (by virtue of the previous subset assembly) sequences would wind up in the debris? If so, what is the difference between such "debris" and singlets in the project sequence results files? I'm wondering whether these debris sequences (which incidentally appear normal and above the sequence length cut-offs) should be included in
my result set for this project.

Hi Mark,

this usage is indeed quite different from what MIRA expects when being
given the "--job=genome" short-cut: a lot of error-prone reads that form
contigs with "a certain coverage". Certain algorithms (e.g. the quality
control and read distribution) will get confused and produce ... well,
nothing usable as you've had to experience.

Let's see how to correct for this. But first a question from my side:
MIRA normally does not accept to assemble input reads longer than 20kb.
Are all your contigs smaller than this (I wouldn't believe so) or did
you fragment them (if so, how?).

As I never applied MIRA for that kind of task, the following "how to
configure MIRA for that job" is somewhat theoretical. Might be that you
will need to tweak a few things and do a few tries before it works like
expected. I suggest you make a small test set with a couple of input
sequences you know should overlap and do trials on that.

I'd start configuring like this (and this applies only for a scenario
where you have "low-coverage, long sequences", if you mix that with real
shotgun data then there might be better ways):
- "--job=denovo,genome,normal"  This basically prepares MIRA that it
needs to
  assemble anew and that the input is not EST data.
- then switch off all clippings, you basically expect that your input is
no too
  bad: "--noclippings"
- along the same vein, switch off read extension (-DP:ure=no), remember
to do
  this for every input type, i.e., if you declared to load 454
sequences, do
  this by putting this in the part for 454 parameters ("454_SETTINGS
 -DP:ure=no")
- then switch off the automatic repeat detection (-AS:ard=no) as this
does
  coverage analysis which in your case is pretty counter-productive
- in case you work with the "keep contigs in memory" option, switch off
the
  spoiler detection (-AS:sd=no)
- (minor point, but might be useful) you might want to think about
giving MIRA
  strain information in case the contigs come from different strains (or
even
  different cultures of the same strain). Have a look at the "-SB"
parameters
  that deal with strain information.
- it might be necessary to increase the sensitivity of SKIM. Try out
  "-SK:hss=1:pr=70"
- if you expect the ends of your input sequences to be more noisy, try
to
  switch off the extra gap penalty (-AL:egp=no, sequencing type
dependent!).
  Alas, this might lead to a scenario where "almost identical repeats
within
  the genome" get more easily assembled together, so use with care.
- you want singlets in your assembly. As nothing should have been
filtered out
  with the above setup, switch on the saving of singlets:
"-OUT:sssip=yes"
  (again, this is sequencing type dependent)
- I'm not sure whether the contig misassembly detection is useful in
this
  scenario. Once you've settled the remaining parameters and are using
it on
  your real data set, try out whether "-CO:mr=no" gives better results.
  Alternatively, increase the number of "reads" needed to tag
misassemblies,
  that would be the "-CO:mrpg" parameter (the last one again sequencing
type
  dependent)

Hope that helps.

Regards,
  Bastien



--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html --------------------------------------------------------

This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: