[mira_talk] Re: debris file and lrc

On Wednesday 18 February 2009 mark.rose@xxxxxxxxxxxx wrote:
> Could you list reasons for why input sequences are discarded into the
> *_info_debrislist.txt file (I can't find much of a description of it in
> the manual).  I've been trying to assemble some fasta contigs originally
> derived from 454 read assemblies and roughly 3/4 of my input sequence
> wind up in this debris file.

Hello Mark,

MIRA puts everything into the debris it could not assemble. The reads can land 
into debris at various stages (I need to improve the debris file by appending 
reasons for the reads to land there):

- after loading (and clipping), if the reads are too short 
  Important MIRA parameters:
       -AS:mrl and the complete -CL category
- after the SKIM process (quick comparison of each read against every other
  read), if a read does not potentially match any other read
  Important MIRA parameter:
    -SK:pr
- during the pre-assembly Smith-Waterman alignments, if it turns out that a
  read does not match any other read with the given criteria
  Important MIRA parameters:
    -AL:mo:mrs:egp:egpl:megpp
- during/after the contig assembly stage (less frequent, but still possible):
  if a read turns out to be "strange" and doesn't want to integrate fully into
  a contig. This happens, e.g., for chimeras or for sequences that have a
  larger portion of sequencing vector or adaptor sequence
  Important MIRA parameters:
    -CO:rodirs, -CO:mr (and the parameters dependent on that) as well as the
     same -AL parameters as for the Smith-Waterman alignments 

For assembly of genomic data, -AS:ard (automatic repeat detection) and the 
parameters dependent on this can also play a role. Here's a simplified example: 
assume you have an average coverage of 10 and for a certain repeat you have 21 
sequences. MIRA will then assume that you have 2 copies of this repeat and 
allow a coverage of 10 for every repeat. Which means that you'll have 2 copies 
with 10 sequences + a remaining sequence that could match, but due to the 
strict handling of repetitive coverage is not assembled to any repeat copy and 
therefore lands into the debris file.

The same case with 22 sequences would result into 2 copies with 10 sequences 
and a third copy with 2 sequences (very probably named "lrc", see below).

Now, if in your case 3/4 of the sequences land in the debris file, you are in 
trouble. Possible reasons:
- bad sequence quality (and clipped away)
- sequencing vector (or adaptor sequence) not removed at all from your input
  (if present: is the ancillary data loaded?)
- assembly of EST data with "--job=genome" (this is a big no-no, as options
  switched on there like -AS:ard or -CL:pec wreak havoc for EST data)
- other sequence related reasons I currently cannot think of
- a bug in MIRA (which I hope it is not)

> Also can you explain what "lrc" means in the names of assembled contigs?

A remainder of larger changes that are currently made in the algorithms, it 
was introduced more for debugging purposes but a few users thought it could 
help them so I left it in.

Basically, MIRA starts to build contigs in areas that are "rock solid", i.e., 
not a repetitive region (main decision point) and nice coverage of good reads. 
If during the assembly MIRA reaches a point where it cannot start building a 
contig in a non-repetitive region, it will name the contig "lrc" instead of 
"c".

In normal genome projects this happens mostly towards the end of the assembly 
process for debris of repetitive copies which were formerly thrown out due to 
-AS:ard (see above).

You know that something is wrong if you have "lrc" contigs very early on in 
the assembly (with low contig numbers).

Hope that helps,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: