[mira_talk] Re: debris file and lrc

  • From: <mark.rose@xxxxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Wed, 18 Feb 2009 10:18:28 -0500

Hi Bastien

Thanks for the quick and robust reply.  Given the things you said I'm
wondering whether the fact that I am trying to assemble sequences
(contigs) derived from formerly assembled 454 reads (and in the future
Sanger based assembled contigs) and not the reads themselves.  Moreover
these contig sets (coming from different partial genome assemblies) are
possibly only (and perhaps minimally) overlapping with contigs from the
other contig sets I'm attempting to assemble.  Am I understanding you
correctly in thinking that such non-overlapping, unique (by virtue of
the previous subset assembly) sequences would wind up in the debris?  If
so, what is the difference between such "debris" and singlets in the
project sequence results files?  I'm wondering whether these debris
sequences (which incidentally appear normal and above the sequence
length cut-offs) should be included in my result set for this project.

If you would be so kind to reply to this query soon I would greatly
appreciate it as I am under pressure to get this assembly done.

Thank you very much

Mark



-----Original Message-----
From: mira_talk-bounce@xxxxxxxxxxxxx
[mailto:mira_talk-bounce@xxxxxxxxxxxxx] On Behalf Of Bastien Chevreux
Sent: Wednesday, February 18, 2009 8:43 AM
To: mira_talk@xxxxxxxxxxxxx
Subject: [mira_talk] Re: debris file and lrc

On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote:
> Could you list reasons for why input sequences are discarded into the 
> *_info_debrislist.txt file (I can't find much of a description of it 
> in the manual).  I've been trying to assemble some fasta contigs 
> originally derived from 454 read assemblies and roughly 3/4 of my 
> input sequence wind up in this debris file.

Hello Mark,

MIRA puts everything into the debris it could not assemble. The reads
can land into debris at various stages (I need to improve the debris
file by appending reasons for the reads to land there):

- after loading (and clipping), if the reads are too short
  Important MIRA parameters:
       -AS:mrl and the complete -CL category
- after the SKIM process (quick comparison of each read against every
other
  read), if a read does not potentially match any other read
  Important MIRA parameter:
    -SK:pr
- during the pre-assembly Smith-Waterman alignments, if it turns out
that a
  read does not match any other read with the given criteria
  Important MIRA parameters:
    -AL:mo:mrs:egp:egpl:megpp
- during/after the contig assembly stage (less frequent, but still
possible):
  if a read turns out to be "strange" and doesn't want to integrate
fully into
  a contig. This happens, e.g., for chimeras or for sequences that have
a
  larger portion of sequencing vector or adaptor sequence
  Important MIRA parameters:
    -CO:rodirs, -CO:mr (and the parameters dependent on that) as well as
the
     same -AL parameters as for the Smith-Waterman alignments 

For assembly of genomic data, -AS:ard (automatic repeat detection) and
the parameters dependent on this can also play a role. Here's a
simplified example: 
assume you have an average coverage of 10 and for a certain repeat you
have 21 sequences. MIRA will then assume that you have 2 copies of this
repeat and allow a coverage of 10 for every repeat. Which means that
you'll have 2 copies with 10 sequences + a remaining sequence that could
match, but due to the strict handling of repetitive coverage is not
assembled to any repeat copy and therefore lands into the debris file.

The same case with 22 sequences would result into 2 copies with 10
sequences and a third copy with 2 sequences (very probably named "lrc",
see below).

Now, if in your case 3/4 of the sequences land in the debris file, you
are in trouble. Possible reasons:
- bad sequence quality (and clipped away)
- sequencing vector (or adaptor sequence) not removed at all from your
input
  (if present: is the ancillary data loaded?)
- assembly of EST data with "--job=genome" (this is a big no-no, as
options
  switched on there like -AS:ard or -CL:pec wreak havoc for EST data)
- other sequence related reasons I currently cannot think of
- a bug in MIRA (which I hope it is not)

> Also can you explain what "lrc" means in the names of assembled
contigs?

A remainder of larger changes that are currently made in the algorithms,
it was introduced more for debugging purposes but a few users thought it
could help them so I left it in.

Basically, MIRA starts to build contigs in areas that are "rock solid",
i.e., not a repetitive region (main decision point) and nice coverage of
good reads. 
If during the assembly MIRA reaches a point where it cannot start
building a contig in a non-repetitive region, it will name the contig
"lrc" instead of "c".

In normal genome projects this happens mostly towards the end of the
assembly process for debris of repetitive copies which were formerly
thrown out due to -AS:ard (see above).

You know that something is wrong if you have "lrc" contigs very early on
in the assembly (with low contig numbers).

Hope that helps,
  Bastien


--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html 
--------------------------------------------------------

This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: